SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature Paper • 2601.10108 • Published 11 days ago • 7
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance Paper • 2512.08765 • Published Dec 9, 2025 • 132
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning Paper • 2509.20360 • Published Sep 24, 2025 • 18
DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation Paper • 2307.01831 • Published Jul 4, 2023 • 8
Mask-Attention-Free Transformer for 3D Instance Segmentation Paper • 2309.01692 • Published Sep 4, 2023 • 1
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Paper • 2403.18814 • Published Mar 27, 2024 • 47
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Paper • 2507.12841 • Published Jul 17, 2025 • 42
Training-Free Efficient Video Generation via Dynamic Token Carving Paper • 2505.16864 • Published May 22, 2025 • 24
Wan: Open and Advanced Large-Scale Video Generative Models Paper • 2503.20314 • Published Mar 26, 2025 • 57
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published Mar 16, 2025 • 35
IterPref: Focal Preference Learning for Code Generation via Iterative Debugging Paper • 2503.02783 • Published Mar 4, 2025 • 7
Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code Paper • 2310.01506 • Published Oct 2, 2023
RL-GPT: Integrating Reinforcement Learning and Code-as-policy Paper • 2402.19299 • Published Feb 29, 2024 • 2
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Paper • 2403.18814 • Published Mar 27, 2024 • 47
Multi-modal Cooking Workflow Construction for Food Recipes Paper • 2008.09151 • Published Aug 20, 2020 • 1
Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers Paper • 2501.03931 • Published Jan 7, 2025 • 15