ST-LLM: Large Language Models Are Effective Temporal Learners.- Exact Diffusion Inversion via Bidirectional Integration Approximation.- Textual Query-Driven Mask Transformer for Domain Generalized Segmentation.- EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head.- Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors.- Object-Centric Diffusion for Efficient Video Editing.- Single-Mask Inpainting for Voxel-based Neural Radiance Fields.- McGrids: Monte Carlo-Driven Adaptive Grids for Iso-Surface Extraction.
- Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval.- Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts.- Diffusion for Natural Image Matting.- Agglomerative Token Clustering.- CMD: A Cross Mechanism Domain Adaptation Dataset for 3D Object Detection.- Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning.- ClusteringSDF: Self-Organized Neural Implicit Surfaces for 3D Decomposition.- NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition.
- GIVT: Generative Infinite-Vocabulary Transformers.- Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment.- Regulating Model Reliance on Non-Robust Features by Smoothing Input Marginal Density.- Multi-Modal Video Dialog State Tracking in the Wild.- Factorized Diffusion: Perceptual Illusions by Noise Decomposition.- To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images . For Now.- Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions.
- StereoGlue: Joint Feature Matching and Robust Estimation.- Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory.- Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction.- Robust Zero-Shot Crowd Counting and Localization with Adaptive Resolution SAM.