Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities Paper • 2401.14405 • Published Jan 25, 2024 • 13
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs Paper • 2406.18521 • Published Jun 26, 2024 • 31
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Paper • 2408.12590 • Published Aug 22, 2024 • 35
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published Aug 29, 2024 • 58
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 134
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Paper • 2409.12961 • Published Sep 19, 2024 • 25
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published Oct 3, 2024 • 54
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark Paper • 2410.03051 • Published Oct 4, 2024 • 6
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Paper • 2410.03290 • Published Oct 4, 2024 • 7
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation Paper • 2410.01912 • Published Oct 2, 2024 • 14
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Paper • 2409.20566 • Published Sep 30, 2024 • 54
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper • 2410.13848 • Published Oct 17, 2024 • 37
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders Paper • 2408.15998 • Published Aug 28, 2024 • 89
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Paper • 2411.04923 • Published Nov 7, 2024 • 23
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval Paper • 2412.01558 • Published Dec 2, 2024 • 4
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? Paper • 2412.02611 • Published Dec 3, 2024 • 25
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published Dec 5, 2024 • 119
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper • 2412.04424 • Published Dec 5, 2024 • 63
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Paper • 2412.08737 • Published Dec 11, 2024 • 54
Multimodal Latent Language Modeling with Next-Token Diffusion Paper • 2412.08635 • Published Dec 11, 2024 • 50
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation Paper • 2412.09585 • Published Dec 12, 2024 • 11
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption Paper • 2412.09283 • Published Dec 12, 2024 • 19
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval Paper • 2412.14475 • Published Dec 19, 2024 • 59
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Paper • 2412.15204 • Published Dec 19, 2024 • 39
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception Paper • 2412.14233 • Published Dec 18, 2024 • 7
ABC: Achieving Better Control of Multimodal Embeddings using VLMs Paper • 2503.00329 • Published Mar 1, 2025 • 20
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model Paper • 2504.08685 • Published Apr 11, 2025 • 130
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models Paper • 2504.17789 • Published Apr 24, 2025 • 23
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding Paper • 2507.07984 • Published Jul 10, 2025 • 43
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Paper • 2507.07999 • Published Jul 10, 2025 • 51
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Paper • 2507.12841 • Published Jul 17, 2025 • 43
Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence Paper • 2606.15932 • Published 11 days ago • 33
PhysiFormer: Learning to Simulate Mechanics in World Space Paper • 2606.27364 • Published 2 days ago • 3