Xiangtai Li

Latest

RecTok: Reconstruction Distillation along Rectified Flow
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Cyberv: Cybernetics for test-time scaling in video understanding
Mixed-r1: Unified reward perspective for reasoning capability in multimodal large language models
Muddit: Liberating generation beyond text-to-image with a unified discrete diffusion model
Conditional panoramic image generation via masked autoregressive modeling
Decouple and track: Benchmarking and improving video diffusion transformers for motion transfer
Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
DreamRelation: Bridging Customization and Relation Generation
You Can't Ignore Either: Unifying Structure and Feature Denoising for Robust Graph Learning
RLRF4Rec: Reinforcement Learning from Recsys Feedback for Enhanced Recommendation Reranking
LLAVADI: What Matters For Multimodal Large Language Models Distillation
Motionbooth: Motion-aware customized text-to-video generation
SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow
VG4D: Vision-Language Model Goes 4D Video Recognition
Explore In-Context Segmentation via Latent Diffusion Models
Towards robust referring image segmentation
Sfnet: Faster and accurate semantic segmentation via semantic flow
Towards open vocabulary learning: A survey
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything
Towards language-driven video inpainting via multimodal large language models
Dst-det: Simple dynamic self-training for open-vocabulary object detection
Multi-task learning with multi-query transformer for dense prediction
Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation
Panopticpartformer++: A unified and decoupled view for panoptic part segmentation
Convolution-enhanced evolving attention networks
TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers
Panoptic-partformer: Learning a unified model for panoptic part segmentation
Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation
Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition
Query Learning of Both Thing and Stuff for Panoptic Segmentation
Improving Video Instance Segmentation via Temporal Pyramid Routing
Video k-net: A simple, strong, and unified baseline for video segmentation
BoundarySqueeze: Image Segmentation as Boundary Squeezing
End-to-end video object detection with spatial-temporal transformers
Dynamic Dual Sampling Module For Fine-Grained Semantic Segmentation
Fast and accurate scene parsing via bi-direction alignment networks
Global aggregation then local distribution for scene parsing
Towards efficient scene understanding via squeeze reasoning
PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation
Enhanced boundary learning for glass-like object segmentation
Fast and Accurate Scene Parsing via Bi-Direction Alignment Networks
Improving semantic segmentation via decoupled body and edge supervision
Semantic flow for fast and accurate scene parsing
Gated fully fusion for semantic segmentation
Global aggregation then local distribution in fully convolutional networks
Dual graph convolutional network for semantic segmentation
Flow2seg: Motion-aided semantic segmentation