1

RecTok: Reconstruction Distillation along Rectified Flow

Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between …

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically …

Direct Preference Optimization for LLM-Enhanced Recommendation Systems

The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are …

Vmoba: Mixture-of-block attention for video diffusion models

The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are …

Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow …

Conditional panoramic image generation via masked autoregressive modeling

Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to …

Mmada: Multimodal large diffusion language models

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is …

Training-free heterogeneous graph condensation via data selection

Efficient training of large-scale heterogeneous graphs is of paramount importance in real-world applications. However, existing approaches typically explore simplified models to mitigate resource and time overhead, neglecting the crucial aspect of …

Towards scalable and deep graph neural networks via noise masking

In recent years, Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks. However, scaling them to large graphs is challenging due to the high computational and storage costs of repeated feature propagation and …

Decouple and track: Benchmarking and improving video diffusion transformers for motion transfer

The motion transfer task aims to transfer motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention …