General Multimodal Learning - a che111 Collection

che111 's Collections

Work for 3D Medical Vision

Med Multimodal Learning

Localize Viusal Understanding

Generative Model

Synthetic Data Learning

Explaniable, Fairness Work

General Multimodal Learning

General Multimodal Learning

updated Jul 18, 2025

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

Paper • 2401.14405 • Published Jan 25, 2024 • 13
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 31
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 35
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 95
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19, 2024 • 25
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 53
Video Instruction Tuning With Synthetic Data

Paper • 2410.02713 • Published Oct 3, 2024 • 41
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Paper • 2410.03051 • Published Oct 4, 2024 • 6
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Paper • 2410.03290 • Published Oct 4, 2024 • 7
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2, 2024 • 14
MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26, 2024 • 53
Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27, 2024 • 99
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30, 2024 • 54
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 36
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 86
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Paper • 2411.04923 • Published Nov 7, 2024 • 23
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Paper • 2412.01558 • Published Dec 2, 2024 • 4
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published Dec 3, 2024 • 25
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 118
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 62
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 54
Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published Dec 11, 2024 • 49
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Paper • 2412.09585 • Published Dec 12, 2024 • 11
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 59
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published Dec 19, 2024 • 39
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Paper • 2412.14233 • Published Dec 18, 2024 • 6
ABC: Achieving Better Control of Multimodal Embeddings using VLMs

Paper • 2503.00329 • Published Mar 1, 2025 • 20
Scaling Vision Pre-Training to 4K Resolution

Paper • 2503.19903 • Published Mar 25, 2025 • 41
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Paper • 2504.08685 • Published Apr 11, 2025 • 130
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24, 2025 • 23
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Paper • 2507.07984 • Published Jul 10, 2025 • 43
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Paper • 2507.07999 • Published Jul 10, 2025 • 51
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Paper • 2507.12841 • Published Jul 17, 2025 • 42