LoomVideo

Unifying Multimodal Inputs into Video Generation and Editing

Peking University Β· Alibaba Group

News

  • πŸ“Œ LoomVideo is built upon Wan 2.2 TI2V 5B with Qwen3-VL-8B as the multimodal encoder. If you find our work useful, please consider giving our GitHub repository a star and citing our paperπŸ™
  • [2026-06-05] Our paper is now available on arXiv!
  • [2026-06-02] We release the codebase and model weights of LoomVideo!
  • [2026-06-02] We release the project page of LoomVideo!

Abstract

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs a Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41Γ— acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

Model Architecture

LoomVideo Architecture

Overall Architecture of LoomVideo. It employs Deepstack injection, Scale-and-Add conditioning, and Negative Temporal RoPE index for efficient unified video generation and editing.

LoomVideo is an integrated MLLM + DiT + VAE framework. We replace the T5 text encoder with Qwen3-VL-8B to process interleaved multimodal inputs (text, reference images, and source videos). The VAE compresses visual inputs into latents, and the DiT generates high-fidelity videos from these conditions. Three key designs are introduced:

  • (1) Deepstack Injection: We extract hidden states from every MLLM layer and inject them into corresponding DiT layers via cross-attention, ensuring deep layer-to-layer semantic alignment and enhancing instruction-following without heavy adapters.
  • (2) Scale-and-Add Conditioning: We scale the clean source video latent by a timestep-dependent factor and add it directly to the noised target, introducing zero additional tokens and achieving 5.41Γ— inference acceleration while supporting complex non-rigid edits.
  • (3) Negative Temporal RoPE: Reference images are assigned negative temporal indices (βˆ’Ο„, βˆ’2Ο„, ...) while target frames use positive indices, enabling the model to robustly distinguish references from targets with minimal extra tokens.

Qualitative Examples

FashionVideoBench

text_video_to_video / product_edit
Input

Prompt: Remove the large white lace collar from the shirt.

Input

Prompt: Change the white shirt to a red polka-dot top.

Input

Prompt: Replace the person's white slippers with black boots.

Input

Prompt: Remove the vase from the cabinet.

Input

Prompt: Remove the logo from the clothing.

Input

Prompt: Remove the pearl bag from the person's hand.

Input

Prompt: Add a silver necklace around the person's neck.

Input

Prompt: Remove the necklace from the person's neck.

Input

Prompt: Remove the fur collar from the clothing.

Input

Prompt: Add a delicate butterfly brooch to the left shoulder of the jacket.

Input

Prompt: Change the subject's floral dress to a black blazer dress.

Input

Prompt: Hang a small brown leather handbag on the woman's left shoulder.

text_video_to_video / model_edit
Input

Prompt: Replace the person: A man with a tanned complexion, dark hair, smiling and showing his teeth. He is wearing a wide-brimmed woven straw hat with a yellow cord chin strap, and a solid black long-sleeved shirt with a small logo on the upper left sleeve.

Input

Prompt: Replace the person: A woman with short, wavy brown hair, fair skin, and bright red lipstick with drop earrings. She wears a dark green short-sleeved blouse with button-down front and a knee-length black A-line skirt with dark green floral pattern, carrying a small structured black handbag.

Input

Prompt: Replace the person: A young man with fair skin and styled black hair, wearing dark sunglasses with thick black frames. He is dressed in a black leather jacket over a plain white t-shirt, with white stripes on the sleeves, black pants with a western-style silver buckle belt, a silver geometric pendant necklace, and holding the strap of a quilted black bag.

Input

Prompt: Replace the person: A woman with long, wavy dark brown hair, pearl stud earrings and a pearl necklace. She wears an off-white blouse with vertical pintuck pleating and scalloped lace collar, tucked into medium-blue flared denim jeans, carrying a book under her right arm.

Input

Prompt: Replace the person: A woman with fair skin and brown hair styled in a high textured bun with braided elements. She wears a solid black zip-up vest over a plain white long-sleeved undershirt, with a stand-up collar and subtle zippered pockets, displaying a bright smile.

Input

Prompt: Replace the person: A highly muscular man with short brown hair and light stubble. He wears a tight-fitting olive green short-sleeved t-shirt and tan cargo shorts, equipped with an olive green tactical chest rig with black adjustable straps and buckles, and a black wristwatch.

Input

Prompt: Replace the person: A woman with long straight dark hair wearing dark sunglasses. She is dressed in a dark olive green quilted coat with diamond-stitched pattern, a fitted white turtleneck, off-white flared pants, and pointy-toed white boots with dark soles.

Input

Prompt: Replace the person: A man with short dark hair and stubble, wearing aviator-style sunglasses with gold frames and brown lenses. He is dressed in a long-sleeved burgundy V-neck cardigan with buttons, ribbed cuffs, two square patch pockets, a beige undershirt visible beneath, and standard-fit dark blue denim jeans.

freeform_edit
Input

Prompt: Edit this video: The video features the same man as the original video, who has short, graying hair, a beard, and wears brown-rimmed glasses. He is wearing the same white short-sleeved t-shirt with a rectangular striped graphic and a small sailboat design as the original video, along with the same dark blue cargo shorts and the same silver watch on his left wrist as the original video. The background is an outdoor urban setting with dark walls, deep shadows, and a paved street surface featuring a solid white line. The camera angle is a medium shot that remains static as the man, initially facing away with his hands in his pockets, turns around to face the camera.

Input

Prompt: Edit this video: The video features the same woman as the original video, wearing a light blue short-sleeved polo shirt with white stripes on the collar and cuffs, paired with a dark brown skirt. She has long brown hair and is wearing small stud earrings. The background shows a bright room with white walls and a stack of books on a surface to the left. The same man as the original video walks across the background from right to left, wearing a matching light blue polo shirt and brown pants. The camera remains static throughout the video, capturing the scene in a medium shot.

Input

Prompt: Edit this video: The video features the same man and woman as the original video, wearing the same blue denim chef coats and aprons. They are standing side-by-side in the same modern kitchen background as the original video. The camera remains static throughout the video, capturing them from the waist up in a medium shot. The man stands on the left, initially adjusting his apron and then resting his hands near his pockets. The woman stands on the right, starting with her hands clasped in front of her. She then raises her right arm, bending it at the elbow, and rests her left hand on her right arm. Both individuals look towards the camera and then slightly off to the side.

Input

Prompt: Edit this video: The video features the same woman as the original video, walking forward on an outdoor sidewalk. She is wearing the same light blue long-sleeved blouse with dark blue trim along the collar, cuffs, and a large bow tie at the neck as the original video. She pairs this top with black trousers and carries a red quilted handbag in her left hand. Her dark hair is pulled back, and she wears red lipstick, a delicate necklace, rings, and the same dangling earrings and watch as the original video. The background consists of a building exterior with a wooden door on the left, a white wall with blue signs, and a large white cylindrical structure overhead, with a blurred street and greenery visible behind her. The camera captures her in a medium shot, tracking backward to keep her centered in the frame as she walks towards the lens.

text_video_product_image_to_video
Input
Reference image

Prompt: Replace the black ankle boots with white high-top sneakers

Input
Reference image

Prompt: Replace the handbag with a tote bag.

Input
Reference image

Prompt: Change the color of the shoes to navy blue with white soles

Input
Reference image

Prompt: Change the white platform sneakers on feet to black leather loafers

Input
Reference image

Prompt: Replace the baseball cap with a hat.

Input
Reference image

Prompt: Replace the straw hat with a hat.

Input
Reference image

Prompt: Change the blue gradient knit sneakers to red leather sneakers.

Input
Reference image

Prompt: Replace the baseball cap with a hat.

Input
Reference image

Prompt: Replace the black sneakers worn on the feet with red running shoes

Input
Reference image

Prompt: Replace the floral embroidered boots with classic black leather ankle boots

Input
Reference image

Prompt: Change the straw sun hat to a hat.

Input
Reference image

Prompt: Change the boots to classic black leather shoes

video_model_image_to_video
Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

Input
Reference image

Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image.

text_multi_image_to_video
Input
Reference image

Prompt: Generate a video with reference images: The woman (@Image 1) wears a pink blazer and white shirt with lace decoration, along with chain earrings and a watch, standing in a pure white studio. She slightly closes her eyes then opens them, turns her head to look left, and raises her hand to adjust the suit collar.

Input
Reference image

Prompt: Generate a video with reference images: The man (@Image 1) wearing a suit (@Image 1) and a T-shirt raises both hands to adjust his collar, then slightly lowers his head and closes his eyes in a pure white studio background.

Input
Reference image

Prompt: Generate a video with reference images: The man wearing a navy blue zip-up jacket, white shirt, black suit pants, a belt with a silver "Z" buckle, and black square sunglasses, standing in front of the glass railing of the background (@Image 1), resting his arms on the railing and slightly turning his head to gaze into the distance.

Input
Reference image Reference image

Prompt: Generate a video with reference images: The woman wearing a blue top (@Image 2) and black trousers, grabbing a white bedsheet with both hands and pulling it upwards to smooth it out, standing in the corridor (@Image 1).

Input
Reference image Reference image

Prompt: Generate a video with reference images: The boy (@Image 2) wearing a beanie, sunglasses, a colorful windbreaker jacket, white T-shirt, and distressed jeans, standing indoors in front of a black table (@Image 1), stretching his arm out then dancing by swinging his arms up and down to the rhythm.

Input
Reference image Reference image

Prompt: Generate a video with reference images: The man wearing a Polo shirt (@Image 2), black casual pants, white sneakers, sunglasses, and a watch, striding forward on the lawn (@Image 1) with one hand in his pocket.

Input
Reference image Reference image

Prompt: Generate a video with reference images: The man (@Image 2) wearing a white Polo shirt with dark blue stripes stands in front of the background (@Image 1), starting with his head down, then slowly raising his head to look directly at the camera.

Input
Reference image 1 Reference image 2 Reference image 3

Prompt: Generate a video with reference images: The woman (@Image 2) wearing a pure white cropped lapel jacket, necklace (@Image 3), and long silver tassel earrings, standing in front of the wall (@Image 1), smiling at the camera, raising her hand to adjust her jacket collar, then brushing her side hair.

RefVIE-Bench

Image + Video + Text β†’ Video
Input
Reference image

Prompt: Replace the tower with a twisting, spiral-shaped skyscraper in the style of the Turning Torso. All other parts of the video must remain unchanged.

Input
Reference image

Prompt: Replace the background with a serene anime-style landscape, featuring a massive, fluffy cumulonimbus cloud towering over a windswept grassy hill, ensuring it appears in the same position and pose within the video scene.

Input
Reference image

Prompt: Replace the man's outfit with a matte black techwear tactical jacket. All other parts of the video must remain unchanged.

Input
Reference image

Prompt: Replace the background with a serene anime-style landscape, featuring a massive, fluffy cumulonimbus cloud towering over a windswept grassy hill, ensuring it appears in the same position and pose within the video scene.

Input
Reference image

Prompt: Add a vintage-style teddy bear made of curly mohair fabric in light brown. All other parts of the video must remain unchanged.

Input
Reference image

Prompt: Replace the bus with a massive matte black luxury SUV. Prominent front grille, aggressive LED headlight design, and wide stance. All other parts of the video must remain unchanged.

Input
Reference image

Prompt: Add a Golden Retriever wearing a red bow tie on the sofa. All other parts of the video must remain unchanged.

Input
Reference image

Prompt: Replace the background with a Chinese ink painting, featuring a large golden mountain peak rising above swirling clouds, ensuring it appears in the same position and pose within the video scene.

Input
Reference image

Prompt: Replace the background with a whimsical fairytale landscape, featuring colorful mushroom houses and a floating castle in a magical forest, ensuring it appears in the same position and pose within the video scene.

Input
Reference image

Prompt: Add a pair of cyberpunk visor sunglasses on the girl's face. It must be tracked and integrated realistically and consistently across all frames, without altering any other video content.

Input
Reference image

Prompt: Add an oversized, cozy bean bag chair made of grey chunky knit wool beside the table. All other parts of the video must remain unchanged.

Input
Reference image

Prompt: Replace the woman's hair with a chic, short bob haircut. All other parts of the video must remain unchanged.

OpenVE-Bench

Video + Text β†’ Video
Input

Prompt: Apply the Impressionist aesthetic to this video, ensuring seamless temporal consistency across all frames. The result should emulate the fluid brushstroke techniques and atmospheric focus of 19th-century Impressionist art, with each frame retaining the original motion, character actions, and camera movements.

Input

Prompt: Apply the Cyberpunk animation style to this video, ensuring seamless temporal consistency across all frames. The result should emulate the dynamic, futuristic ambiance of Cyberpunk media, complete with flickering neon signs, rain-slicked streets reflecting holographic light, and a gritty urban backdrop.

Input

Prompt: Apply the Gongbi animation style to this video, ensuring seamless temporal consistency across every frame. The result should mirror the elegance of traditional Chinese ink painting, with fluid brushstroke transitions, consistent color palettes, and meticulous line work.

Input

Prompt: Apply the soft, warm ambiance of dawn to this video, ensuring seamless temporal consistency across all frames. The final output should exude the gentle glow of early morning light, with pink and golden hues permeating each frame.

Input

Prompt: Apply the Cartoon Style to this video, ensuring seamless temporal consistency across all frames. The final output should mimic the aesthetic of classic animated cartoons, complete with dynamic character animations, exaggerated movements, and a vivid color scheme.

Input

Prompt: Replace the man's dark jacket and light shirt with a light blue short-sleeve polo shirt, maintaining the same position and pointing pose within the scene.

Input

Prompt: Add a neatly trimmed dark beard to the man, ensuring he maintains the same position and pose within the scene.

Input

Prompt: Replace the tree with a golden-leaved tree that shimmers softly, ensuring it maintains the same position and pose within the video scene.

Input

Prompt: Replace the background with a dynamic urban rooftop at dusk. The scene should have slowly moving clouds, twinkling city lights, and a gentle breeze causing slight movement in rooftop objects. The subject should remain perfectly still.

Input

Prompt: Overlay an animated colorful kite in the sky above the vehicle, slightly to the right of center. The kite and its tail should flutter naturally in the wind and be tracked relative to the sky as the camera moves. All other parts of the video must remain unchanged.

Input

Prompt: Transform the moon into a giant bioluminescent jellyfish by making its surface translucent with glowing blue and purple hues and adding long, flowing tentacles trailing beneath it, while preserving its position in space.

Input

Prompt: Transform the paddleboarder into a glowing ethereal water guardian figure. Change the water to sparkle with bioluminescent waves and add glowing mythical water creatures swimming nearby. Replace the sun with a large luminous moon casting a mystical glow over the shoreline.

Input

Prompt: Remove the subtitles at the bottom of the video.

Input

Prompt: Remove the subtitles at the top of the video.

Input

Prompt: Remove the metallic pipe with a smooth cylindrical form, evenly spaced rivets, muted gray weathered surface, consistent orientation (open end outward), gentle curvature, and solitary presence from the entire video sequence. The background must be reconstructed with temporal consistency, and all other video content must remain unchanged.

Input

Prompt: Remove the person wearing a blue and black striped sports jersey, black shorts, with a well-groomed beard, short hair, and intricate tattoos on both arms from the entire video. The background must be reconstructed with temporal consistency, and all other video content must remain unchanged.

Intelligent-VBench-TIV2V

Image + Video + Text β†’ Video
Input
Reference image

Prompt: Replace the car moving forward with the matte green vehicle in the image.

Input
Reference image

Prompt: Replace the green t-shirt of the man with the suit in the image.

Input
Reference image

Prompt: Replace the pressure cooker with the red pot in the image.

Input
Reference image

Prompt: Replace the woman with the woman with silver hair in the image.

Input
Reference image

Prompt: Replace the black leather chair with the chair in the image.

Input
Reference image

Prompt: Replace the white fence with the flower-decorated fence in the image.

Input
Reference image

Prompt: Replace the background with the scene shown in the image.

Input
Reference image

Prompt: Replace the background with the scene shown in the image.

Input
Reference image

Prompt: Add the young boy in the image to stand on the red carpet to the left, facing slightly right, with the festival backdrop and crowd visible behind him.

Input
Reference image

Prompt: Add the woman in the image to stand to the left of the man and look at the man.

Input
Reference image

Prompt: Add the snowman in the image to stand on the snow field near the footprints.

Input
Reference image

Prompt: Add the man in the image seated at the table by the pool.

Input
Reference image

Prompt: Replace yellow metal table with the brown table in the image.

Input
Reference image

Prompt: Replace the woman with the silver-haired woman in the image.

Input
Reference image

Prompt: Replace the brown leather sofa with the sofa in the image.

Input
Reference image

Prompt: Replace the woman's outfit with the evening gown in the image.

Intelligent-VBench-MI2V

Image(s) + Text β†’ Video
Input
Reference image

Prompt: The camera slowly pulls away from the glazed pastry in the image, making it gradually smaller as more of the surrounding yellow plate becomes visible.

Input
Reference image

Prompt: The girl in the image sits at an electronic keyboard, fingers moving across the keys as she plays.

Input
Reference image

Prompt: The woman in the image sits in a vehicle, talking to the camera.

Input
Reference image 1 Reference image 2

Prompt: The woman in the first image moves forward over the background in the second image, her expression shifts from a smile to an animated speaking expression.

Input
Reference image 1 Reference image 2

Prompt: The woman in the first image holds the makeup brush in the second image and applies eyeshadow along her upper lash line.

Input
Reference image 1 Reference image 2

Prompt: The woman in the first image's facial expression changes as she talks over the background shown in the second image.

Input
Reference image 1 Reference image 2

Prompt: The man in the first image does push ups over the background shown in the second image, his muscles bending and contracting as he moves.

Input
Reference image 1 Reference image 2

Prompt: The woman in the first image speaks over the background in the second image, her mouth opening and closing repeatedly.

Input
Reference image 1 Reference image 2

Prompt: The dark gray CitroΓ«n C3 Aircross SUV in the first image drives slowly on a rugged mountain road over the background shown in the second image.

Input
Reference image 1 Reference image 2

Prompt: The man in the first image slowly lowers the glass in the second image from his mouth, his head lifts slightly, and his gaze looks up.

Input
Reference image 1 Reference image 2

Prompt: The garbage truck in the first image remains stationary with flashing red lights. The man in the second image stands near it, adjusting his posture while watching the vehicle.

Input
Reference image 1 Reference image 2

Prompt: The red panda in the first image sits on a tree branch over the background shown in the second image, chewing on a bundle of green bamboo leaves held in its paws.

Input
Reference image 1 Reference image 2

Prompt: The black bird with a white beak in the first image stands motionless in the shallows among tall reeds over the background shown in the second image, making slight, repetitive head movements.

Input
Reference image

Prompt: The woman in the image's long dark hair flutters across her face due to the wind. She looks down at the man beside her with a somber expression, making minor head adjustments.

Input
Reference image

Prompt: The woman in the image, initially talking with her mouth open, speaks more quietly while lowering her head slightly.

Input
Reference image 1 Reference image 2

Prompt: The right hand grips the metal whisk in the first image and stirs the creamy mixture in the second image within a glass bowl held by the left hand.

VBench

Text β†’ Video

Prompt: Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks.

Prompt: A beautiful coastal beach in spring, waves lapping on sand by Vincent van Gogh

Prompt: Sunset time lapse at the beach with moving clouds and colors in the sky.

Prompt: A cat wearing sunglasses at a pool

Prompt: A cute happy Corgi playing in park, sunset

Prompt: Vampire makeup face of beautiful girl, red contact lenses.

Prompt: Ashtray full of butts on table, smoke flowing on black background, close-up

Prompt: A bigfoot walking in the snowstorm.

Prompt: a motorcycle gliding through a snowy field

Prompt: an airplane soaring through a clear blue sky

Prompt: A jellyfish floating through the ocean, with bioluminescent tentacles

Prompt: An epic tornado attacking above a glowing city at night, the tornado is made of smoke

Prompt: A space shuttle launching into orbit, with flames and smoke billowing out from the engines

Prompt: Clown fish swimming through the coral reef

Prompt: A super cool giant robot in Cyberpunk Beijing

Prompt: a zebra bending down to drink water from a river

Main Results

Results on VBench

Model Params Imaging Quality Overall Consistency Subject Consistency Average
Wan 2.25B69.8222.4195.2862.50
UniVideo13B69.3422.6297.0863.01
OmniWeaving8.3B61.7822.4694.1259.45
LoomVideo (Stage 3)5B67.1323.7494.6061.82
LoomVideo (RL)5B70.9223.5994.9363.15

Results on OpenVE-Bench

Type Model Params Global Style BG Change Local Change Local Remove Local Add Subtitle Edit Creative Edit Overall
SpecializedOmniVideo1.3B1.111.181.141.141.361.002.261.31
SpecializedInsViE2B2.201.061.481.361.172.182.021.64
SpecializedDitto14B4.011.682.031.531.412.811.232.10
SpecializedOpenVE-Edit5B3.162.362.981.852.152.912.312.53
SpecializedKiwi-Edit5B3.622.573.763.362.572.913.083.12
UnifiedVACE14B1.491.552.071.461.261.481.471.54
UnifiedVINO13B3.952.393.513.202.682.653.013.07
UnifiedUniVideo13B3.472.583.412.992.832.873.073.05
UnifiedOmniWeaving8.3B3.682.163.782.681.833.482.802.92
UnifiedLoomVideo (Stage 2)5B3.812.463.043.332.213.643.543.15
UnifiedLoomVideo (Stage 3)5B3.622.263.322.822.403.302.862.94
UnifiedLoomVideo (RL)5B3.852.373.413.122.193.423.233.05

Results on RefVIE-Bench

Model Identity Temporal Physical Reference Sim Matting Quality Video Quality Overall
Closed-Source Models
Runway Aleph3.793.653.583.332.812.583.29
Kling-O14.754.664.603.953.212.753.99
Open-Source Models
Kiwi-Edit (All data)3.512.962.913.402.582.402.96
Kiwi-Edit (Ref. data only)3.983.403.343.722.902.513.31
VINO4.184.033.742.932.602.403.53
UniVideo4.193.803.612.902.222.123.38
OmniWeaving3.292.962.823.452.552.352.94
LoomVideo (Stage 3)4.293.902.723.752.652.383.62
LoomVideo (RL)4.503.983.903.882.902.503.78

Results on FashionVideoBench

Model Split by Metrics Split by Task Overall
SCPFVQ Product EditModel EditFreeform EditPRef EditMRef EditMI2V
UniVideo4.084.344.373.844.934.204.244.054.294.26
OmniWeaving3.283.713.703.674.043.492.953.493.723.56
VINO4.184.514.454.024.834.274.274.434.454.38
LoomVideo (Stage 3)4.454.744.614.594.954.474.514.374.704.60
LoomVideo (RL)4.444.714.624.594.924.454.514.374.704.59

Note: SC = Subject Consistency, PF = Prompt Following, VQ = Video Quality. FashionVideoBench is curated from held-out internal data (strictly excluded from the training set) with 50 test cases per sub-task (300 total). Gemini 2.5 Pro is used as the automated judge.

Inference Speed Comparison

Model Params Source Token Injection T2V (s) TV2V (s) Speedup
Wan 2.25B + 5.68B (UMT5-XXL)-138.61--
UniVideo (hidden)13B + 7BToken Concat1792.656140.18-
OmniWeaving8.3B + 7BChannel Concat824.93899.321.00Γ—
VINO13B + 4BToken Concat2793.529555.13-
LoomVideo5B + 8BScale-and-Add132.23166.306.24Γ— / 5.41Γ—

All models are tested on the same GPU type, generating 480Γ—832Γ—97 videos. Speedup (T2V / TV2V) is relative to OmniWeaving.

BibTeX

If you found this work useful, please consider citing our paper as follows:

LoomVideo

@article{wu2026loomvideo,
  title={LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing},
  author={Wu, Jianzong and Lian, Hao and Yang, Jiongfan and Hao, Dachao and Tian, Ye and Tong, Yunhai and Zhu, Jingyuan and Chen, Biaolong and Qi, Qiaosong and Zhang, Aixi and He, Wanggui and Liu, Mushui and Huang, Pipei and Jiang, Hao},
  journal={arXiv preprint arXiv:2606.06042},
  year={2026}
}