LoomVideo | Project Page

Abstract

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs a Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41× acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

Model Architecture

Overall Architecture of LoomVideo. It employs Deepstack injection, Scale-and-Add conditioning, and Negative Temporal RoPE index for efficient unified video generation and editing.

LoomVideo is an integrated MLLM + DiT + VAE framework. We replace the T5 text encoder with Qwen3-VL-8B to process interleaved multimodal inputs (text, reference images, and source videos). The VAE compresses visual inputs into latents, and the DiT generates high-fidelity videos from these conditions. Three key designs are introduced:

(1) Deepstack Injection: We extract hidden states from every MLLM layer and inject them into corresponding DiT layers via cross-attention, ensuring deep layer-to-layer semantic alignment and enhancing instruction-following without heavy adapters.
(2) Scale-and-Add Conditioning: We scale the clean source video latent by a timestep-dependent factor and add it directly to the noised target, introducing zero additional tokens and achieving 5.41× inference acceleration while supporting complex non-rigid edits.
(3) Negative Temporal RoPE: Reference images are assigned negative temporal indices (−τ, −2τ, ...) while target frames use positive indices, enabling the model to robustly distinguish references from targets with minimal extra tokens.

Qualitative Examples

FashionVideoBench

text_video_to_video / product_edit

Input

Prompt: Remove the large white lace collar from the shirt.

Input

Prompt: Change the white shirt to a red polka-dot top.

Input

Prompt: Replace the person's white slippers with black boots.

Input

Prompt: Remove the vase from the cabinet.

Input

Prompt: Remove the logo from the clothing.

Input

Prompt: Remove the pearl bag from the person's hand.

Input

Prompt: Add a silver necklace around the person's neck.

Input

Prompt: Remove the necklace from the person's neck.

Input

Prompt: Remove the fur collar from the clothing.

Input

Prompt: Add a delicate butterfly brooch to the left shoulder of the jacket.

Input

Prompt: Change the subject's floral dress to a black blazer dress.

Input

Prompt: Hang a small brown leather handbag on the woman's left shoulder.

text_video_to_video / model_edit

Input

Prompt: Replace the person: A man with a tanned complexion, dark hair, smiling and showing his teeth. He is wearing a wide-brimmed woven straw hat with a yellow cord chin strap, and a solid black long-sleeved shirt with a small logo on the upper left sleeve.

Input

Prompt: Replace the person: A woman with short, wavy brown hair, fair skin, and bright red lipstick with drop earrings. She wears a dark green short-sleeved blouse with button-down front and a knee-length black A-line skirt with dark green floral pattern, carrying a small structured black handbag.

Input

Prompt: Replace the person: A young man with fair skin and styled black hair, wearing dark sunglasses with thick black frames. He is dressed in a black leather jacket over a plain white t-shirt, with white stripes on the sleeves, black pants with a western-style silver buckle belt, a silver geometric pendant necklace, and holding the strap of a quilted black bag.

Input

Prompt: Replace the person: A woman with long, wavy dark brown hair, pearl stud earrings and a pearl necklace. She wears an off-white blouse with vertical pintuck pleating and scalloped lace collar, tucked into medium-blue flared denim jeans, carrying a book under her right arm.

Input

Prompt: Replace the person: A woman with fair skin and brown hair styled in a high textured bun with braided elements. She wears a solid black zip-up vest over a plain white long-sleeved undershirt, with a stand-up collar and subtle zippered pockets, displaying a bright smile.

Input

Prompt: Replace the person: A highly muscular man with short brown hair and light stubble. He wears a tight-fitting olive green short-sleeved t-shirt and tan cargo shorts, equipped with an olive green tactical chest rig with black adjustable straps and buckles, and a black wristwatch.

Input

Prompt: Replace the person: A woman with long straight dark hair wearing dark sunglasses. She is dressed in a dark olive green quilted coat with diamond-stitched pattern, a fitted white turtleneck, off-white flared pants, and pointy-toed white boots with dark soles.

Input

Prompt: Replace the person: A man with short dark hair and stubble, wearing aviator-style sunglasses with gold frames and brown lenses. He is dressed in a long-sleeved burgundy V-neck cardigan with buttons, ribbed cuffs, two square patch pockets, a beige undershirt visible beneath, and standard-fit dark blue denim jeans.

freeform_edit

Input

Prompt: Edit this video: The video features the same man as the original video, who has short, graying hair, a beard, and wears brown-rimmed glasses. He is wearing the same white short-sleeved t-shirt with a rectangular striped graphic and a small sailboat design as the original video, along with the same dark blue cargo shorts and the same silver watch on his left wrist as the original video. The background is an outdoor urban setting with dark walls, deep shadows, and a paved street surface featuring a solid white line. The camera angle is a medium shot that remains static as the man, initially facing away with his hands in his pockets, turns around to face the camera.

Input

Prompt: Edit this video: The video features the same woman as the original video, wearing a light blue short-sleeved polo shirt with white stripes on the collar and cuffs, paired with a dark brown skirt. She has long brown hair and is wearing small stud earrings. The background shows a bright room with white walls and a stack of books on a surface to the left. The same man as the original video walks across the background from right to left, wearing a matching light blue polo shirt and brown pants. The camera remains static throughout the video, capturing the scene in a medium shot.

Input

Prompt: Edit this video: The video features the same man and woman as the original video, wearing the same blue denim chef coats and aprons. They are standing side-by-side in the same modern kitchen background as the original video. The camera remains static throughout the video, capturing them from the waist up in a medium shot. The man stands on the left, initially adjusting his apron and then resting his hands near his pockets. The woman stands on the right, starting with her hands clasped in front of her. She then raises her right arm, bending it at the elbow, and rests her left hand on her right arm. Both individuals look towards the camera and then slightly off to the side.

Input

Prompt: Edit this video: The video features the same woman as the original video, walking forward on an outdoor sidewalk. She is wearing the same light blue long-sleeved blouse with dark blue trim along the collar, cuffs, and a large bow tie at the neck as the original video. She pairs this top with black trousers and carries a red quilted handbag in her left hand. Her dark hair is pulled back, and she wears red lipstick, a delicate necklace, rings, and the same dangling earrings and watch as the original video. The background consists of a building exterior with a wooden door on the left, a white wall with blue signs, and a large white cylindrical structure overhead, with a blurred street and greenery visible behind her. The camera captures her in a medium shot, tracking backward to keep her centered in the frame as she walks towards the lens.

text_video_product_image_to_video

Input

Prompt: Replace the black ankle boots with white high-top sneakers

Input

Prompt: Replace the handbag with a tote bag.

Input

Prompt: Change the color of the shoes to navy blue with white soles

Input

Prompt: Change the white platform sneakers on feet to black leather loafers

Input

Prompt: Replace the baseball cap with a hat.

Input

Prompt: Replace the straw hat with a hat.

Input

Prompt: Change the blue gradient knit sneakers to red leather sneakers.

Input

Prompt: Replace the baseball cap with a hat.

Input

Prompt: Replace the black sneakers worn on the feet with red running shoes

Input

Prompt: Replace the floral embroidered boots with classic black leather ankle boots

Input

Prompt: Change the straw sun hat to a hat.

Input

Prompt: Change the boots to classic black leather shoes