1

RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

Advanced by transformer architecture, vision foundation models (VFMs) achieve remarkable progress in performance and generalization ability. Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation. However, most …

Towards language-driven video inpainting via multimodal large language models

We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled …

Mitigating Semantic Confusion from Hostile Neighborhood for Graph Active Learning

Graph Active Learning (GAL), which aims to find the most informative nodes in graphs for annotation to maximize the Graph Neural Networks (GNNs) performance, has attracted many research efforts but remains non-trivial challenges. One major challenge …

Multiple Connectivity Views for Session-based Recommendation

Session-based recommendation (SBR), which makes the next-item recommendation based on previous anonymous actions, has drawn increasing attention. The last decade has seen multiple deep learning-based modeling choices applied on SBR successfully, …

Multi-task learning with multi-query transformer for dense prediction

Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts for each task. The core insight beyond these methods is to maximize the mutual …

Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation

In this work, we focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories. Previous approaches have relied on massive caption datasets and complex pipelines to establish …

Panoptic-partformer: Learning a unified model for panoptic part segmentation

Panoptic Part Segmentation (PPS) aims to unify panoptic segmentation and part segmentation into one task. Previous work mainly utilizes separated approaches to handle thing, stuff, and part predictions individually without performing any shared …

Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation

The Depth-aware Video Panoptic Segmentation (DVPS) is a new challenging vision problem that aims to predict panoptic segmentation and depth in a video simultaneously. The previous work solves this task by extending the existing panoptic segmentation …

Fashionformer: A simple, effective and unified baseline for human fashion segmentation and recognition

Human fashion understanding is one crucial computer vision task since it has comprehensive information for real-world applications. This focus on joint human fashion segmentation and attribute recognition. Contrary to the previous works that …

Query Learning of Both Thing and Stuff for Panoptic Segmentation

Starting from DETR, query based detection and segmentation methods achieve comparable results as previous works with a simplified and elegant pipeline. In this work, a novel, simple and unified baseline, named QueryPanSeg, is proposed for panoptic …