General Multimodal Learning
updated
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
Modalities
Paper
• 2401.14405
• Published
• 13
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
• 2406.18521
• Published
• 30
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
Paper
• 2408.12590
• Published
• 35
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
• 2409.12961
• Published
• 25
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
• 2410.02740
• Published
• 53
Video Instruction Tuning With Synthetic Data
Paper
• 2410.02713
• Published
• 41
AuroraCap: Efficient, Performant Video Detailed Captioning and a New
Benchmark
Paper
• 2410.03051
• Published
• 6
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video
Large Language Models
Paper
• 2410.03290
• Published
• 7
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
Transformer for Efficient Finegrained Image Generation
Paper
• 2410.01912
• Published
• 14
MIO: A Foundation Model on Multimodal Tokens
Paper
• 2409.17692
• Published
• 53
Emu3: Next-Token Prediction is All You Need
Paper
• 2409.18869
• Published
• 97
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
• 2409.20566
• Published
• 54
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
• 2410.13848
• Published
• 35
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
• 2408.15998
• Published
• 86
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
• 2411.04923
• Published
• 23
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for
Joint Video Highlight Detection and Moment Retrieval
Paper
• 2412.01558
• Published
• 4
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
• 2412.02611
• Published
• 25
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published
• 118
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published
• 62
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
• 2412.08635
• Published
• 49
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary
Embedding Distillation
Paper
• 2412.09585
• Published
• 11
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
• 2412.09283
• Published
• 19
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
• 2412.14475
• Published
• 57
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
• 2412.15204
• Published
• 39
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
• 2412.14233
• Published
• 6
ABC: Achieving Better Control of Multimodal Embeddings using VLMs
Paper
• 2503.00329
• Published
• 20
Scaling Vision Pre-Training to 4K Resolution
Paper
• 2503.19903
• Published
• 41
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper
• 2504.08685
• Published
• 130
Token-Shuffle: Towards High-Resolution Image Generation with
Autoregressive Models
Paper
• 2504.17789
• Published
• 23
OST-Bench: Evaluating the Capabilities of MLLMs in Online
Spatio-temporal Scene Understanding
Paper
• 2507.07984
• Published
• 43
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and
Methodology
Paper
• 2507.07999
• Published
• 51
AnyCap Project: A Unified Framework, Dataset, and Benchmark for
Controllable Omni-modal Captioning
Paper
• 2507.12841
• Published
• 42