Multimodal Dataset
updated
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper
• 2407.09413
• Published
• 11
MAVIS: Mathematical Visual Instruction Tuning
Paper
• 2407.08739
• Published
• 32
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper
• 2409.01437
• Published
• 71
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
• 2409.05840
• Published
• 49
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced
Mathematical Reasoning
Paper
• 2409.12568
• Published
• 50
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Paper
• 2410.10816
• Published
• 21
Personalized Visual Instruction Tuning
Paper
• 2410.07113
• Published
• 70
Harnessing Webpage UIs for Text-Rich Visual Understanding
Paper
• 2410.13824
• Published
• 30
EMMA: End-to-End Multimodal Model for Autonomous Driving
Paper
• 2410.23262
• Published
• 3
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
• 2411.07461
• Published
• 23
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
Generation
Paper
• 2411.08380
• Published
• 25
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
Video Reasoning via Core Frame Selection
Paper
• 2411.14794
• Published
• 13
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
• 2411.17991
• Published
• 5
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended
Interleaved Image-Text Generation
Paper
• 2411.18499
• Published
• 18
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
• 2412.00927
• Published
• 29
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
• 2412.05237
• Published
• 46
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published
• 20
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
• 2412.07112
• Published
• 28
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Paper
• 2412.07147
• Published
• 5
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
• 2412.09283
• Published
• 19
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation
Understanding
Paper
• 2412.17295
• Published
• 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
• 2412.18319
• Published
• 39
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published
• 109
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
• 2501.04686
• Published
• 53
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and
Vision-Language Models Derived from Scientific Literature
Paper
• 2501.07171
• Published
• 55
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published
• 34
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image
Interpretation
Paper
• 2502.08168
• Published
• 12
RealSyn: An Effective and Scalable Multimodal Interleaved Document
Transformation Paradigm
Paper
• 2502.12513
• Published
• 16
Qilin: A Multimodal Information Retrieval Dataset with APP-level User
Sessions
Paper
• 2503.00501
• Published
• 12
Unified Reward Model for Multimodal Understanding and Generation
Paper
• 2503.05236
• Published
• 123
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue
Learning
Paper
• 2503.07002
• Published
• 39
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia
Paper
• 2503.07920
• Published
• 101
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model
for Visual Generation and Editing
Paper
• 2503.10639
• Published
• 53
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published
• 36
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web
Search
Paper
• 2503.10582
• Published
• 24
Sightation Counts: Leveraging Sighted User Feedback in Building a
BLV-aligned Dataset of Diagram Descriptions
Paper
• 2503.13369
• Published
• 7
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published
• 17
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published
• 18
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published
• 88
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical
Problems
Paper
• 2503.16549
• Published
• 15
ViLBench: A Suite for Vision-Language Process Reward Modeling
Paper
• 2503.20271
• Published
• 7
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
• 2503.21696
• Published
• 23
LiveVQA: Live Visual Knowledge Seeking
Paper
• 2504.05288
• Published
• 15
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
• 2504.09925
• Published
• 39
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published
• 16
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction
Fine-Tuning
Paper
• 2504.09081
• Published
• 16
Generate, but Verify: Reducing Hallucination in Vision-Language Models
with Retrospective Resampling
Paper
• 2504.13169
• Published
• 39
FocusedAD: Character-centric Movie Audio Description
Paper
• 2504.12157
• Published
• 8
MM-IFEngine: Towards Multimodal Instruction Following
Paper
• 2504.07957
• Published
• 35
TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy
Multi-modal Geometric Problem Solving
Paper
• 2504.15780
• Published
• 6
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for
Scalable and Generalizable Robot Learning
Paper
• 2504.18904
• Published
• 9
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks
Paper
• 2412.04626
• Published
• 13
StreamBridge: Turning Your Offline Video Large Language Model into a
Proactive Streaming Assistant
Paper
• 2505.05467
• Published
• 13
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset
Paper
• 2505.09568
• Published
• 99
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Paper
• 2505.17018
• Published
• 15
Think or Not? Selective Reasoning via Reinforcement Learning for
Vision-Language Models
Paper
• 2505.16854
• Published
• 11
GRIT: Teaching MLLMs to Think with Images
Paper
• 2505.15879
• Published
• 13
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Paper
• 2506.02096
• Published
• 52
MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal
LLMs
Paper
• 2506.01674
• Published
• 28
MS4UI: A Dataset for Multi-modal Summarization of User Interface
Instructional Videos
Paper
• 2506.12623
• Published
• 2
Sekai: A Video Dataset towards World Exploration
Paper
• 2506.15675
• Published
• 66
Phantom-Data : Towards a General Subject-Consistent Video Generation
Dataset
Paper
• 2506.18851
• Published
• 30
MMSearch-R1: Incentivizing LMMs to Search
Paper
• 2506.20670
• Published
• 64
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning
Dataset
Paper
• 2507.03483
• Published
• 24
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual
Dyadic Interactive Human Generation
Paper
• 2507.09862
• Published
• 51
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper
• 2507.16746
• Published
• 34
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
Paper
• 2507.21033
• Published
• 23
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual
Mathematical Reasoning
Paper
• 2508.10433
• Published
• 144
VisCodex: Unified Multimodal Code Generation via Merging Vision and
Coding Models
Paper
• 2508.09945
• Published
• 6
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper
• 2509.12201
• Published
• 106
InternScenes: A Large-scale Simulatable Indoor Scene Dataset with
Realistic Layouts
Paper
• 2509.10813
• Published
• 31
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
Paper
• 2509.11362
• Published
• 5
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
• 2509.24900
• Published
• 53
HoneyBee: Data Recipes for Vision-Language Reasoners
Paper
• 2510.12225
• Published
• 11
Knowledge-based Visual Question Answer with Multimodal Processing,
Retrieval and Filtering
Paper
• 2510.14605
• Published
• 5
Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Paper
• 2510.15162
• Published
• 3
FineVision: Open Data Is All You Need
Paper
• 2510.17269
• Published
• 75
ThinkMorph: Emergent Properties in Multimodal Interleaved
Chain-of-Thought Reasoning
Paper
• 2510.27492
• Published
• 86
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language
Models
Paper
• 2511.01618
• Published
• 11
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
Paper
• 2511.11002
• Published
• 4
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
Paper
• 2511.11407
• Published
• 5
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Paper
• 2511.15186
• Published
• 26
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
Paper
• 2511.16397
• Published
• 10
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
Paper
• 2512.16793
• Published
• 75
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
Paper
• 2512.10945
• Published
• 1
Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding
Paper
• 2512.21643
• Published
• 13
Action100M: A Large-scale Video Action Dataset
Paper
• 2601.10592
• Published
• 30