Modeling and Driving Human Body Soundfields through Acoustic Primitives.- m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks.- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing.- High-Fidelity 3D Textured Shapes Generation by Sparse Encoding and Adversarial Decoding.- Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization.- I-MedSAM: Implicit Medical Image Segmentation with Segment Anything.- ReMamber: Referring Image Segmentation with Mamba Twister.- TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting.
- CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios.- Segmentation-guided Layer-wise Image Vectorization with Gradient Fills.- Implicit Style-Content Separation using B-LoRA.- OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models.- ActionVOS: Actions as Prompts for Video Object Segmentation.- FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance.- U-COPE: Taking a Further Step to Universal 9D Category-level Object Pose Estimation.- Integrating Markov Blanket Discovery into Causal Representation Learning for Domain Generalization.
- Rotary Position Embedding for Vision Transformer.- Local All-Pair Correspondence for Point Tracking.- MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection.- ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments.- S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis.- ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos.- Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos.- PQ-SAM: Post-training Quantization for Segment Anything Model.
- CPM: Class-conditional Prompting Machine for Audio-visual Segmentation.- Optimizing Factorized Encoder Models: Time and Memory Reduction for Scalable and Efficient Action Recognition.- DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment.