Walker: Self-supervised Multiple Object Tracking by Walking on Temporal Object Appearance Graphs.- Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition.- DiffiT: Diffusion Vision Transformers for Image Generation.- WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation.- GPSFormer: A Global Perception and Local Structure Fitting-based Transformer for Point Cloud Understanding.- FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis.- FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection.- SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs.
- ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities.- MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?.- See and Think: Embodied Agent in Virtual Environment.- PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects.- Bridging the Gap Between Human Motion and Action Semantics via Kinematics Phrases.- VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding.- Masked Angle-Aware Autoencoder for Remote Sensing Images.- Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm.
- MultiGen: Zero-shot Image Generation from Multi-modal Prompts.- GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths.- Learning Chain of Counterfactual Thought for Bias-Robust Vision-Language Reasoning.- SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis.- Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets.- FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition.- Elegantly Written: Disentangling Writer and Character Styles for Enhancing Online Chinese Handwriting.- UniCode : Learning a Unified Codebook for Multimodal Large Language Models.
- When Do We Not Need Larger Vision Models?.- GVGEN: Text-to-3D Generation with Volumetric Representation.- Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model.