GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation.- IRGen: Generative Modeling for Image Retrieval.- Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality.- FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos.- A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting.- VISA: Reasoning Video Object Segmentation via Large Language Model.- Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models.- IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation.
- Scaling Backwards: Minimal Synthetic Pre-training?.- BAMM: Bidirectional Autoregressive Motion Model.- Event-based Head Pose Estimation: Benchmark and Method.- Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos.- Towards Multi-modal Transformers in Federated Learning.- Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning.- QueryCDR: Query-based Controllable Distortion Rectification Network for Fisheye Images.- Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics.
- DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution.- Do not move together: per-Gaussian Deformation for 4DGS.- DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion.- CoLA: Conditional Dropout and Language-driven Robust Dual-modal Salient Object Detection.- Image-Feature Weak-to-Strong Consistency: An Enhanced Paradigm for Semi-Supervised Learning.- RPBG: Towards Robust Neural Point-based Graphics in the Wild.- GaussReg: Fast 3D Registration with Gaussian Splatting.- Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators.
- Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation.- IAM-VFI : Interpolate Any Motion for Video Frame Interpolation with motion complexity map.- TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data.