MLLM
updated
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
• 2312.16862
• Published • 31
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
• 2312.17172
• Published • 30
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
Programmers
Paper
• 2401.01974
• Published • 7
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper
• 2401.01885
• Published • 28
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
Models
Paper
• 2401.01335
• Published • 69
Improving Text Embeddings with Large Language Models
Paper
• 2401.00368
• Published • 82
Distilling Vision-Language Models on Millions of Videos
Paper
• 2401.06129
• Published • 18
Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
Paper
• 2401.05033
• Published • 18
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
• 2401.06071
• Published • 12
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
Concept Understanding
Paper
• 2401.04575
• Published • 18
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
with Multi-Granularity Answers
Paper
• 2401.04695
• Published • 13
Paper
• 2401.04088
• Published • 160
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes
Interactively
Paper
• 2401.02955
• Published • 23
Understanding LLMs: A Comprehensive Overview from Training to Inference
Paper
• 2401.02038
• Published • 65
Can Large Language Models Understand Context?
Paper
• 2402.00858
• Published • 24
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
Paper
• 2401.17093
• Published • 20
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
• 2401.16420
• Published • 55
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
• 2401.15947
• Published • 53
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD
Generalization
Paper
• 2401.15914
• Published • 7
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
• 2401.13601
• Published • 47
Small Language Model Meets with Reinforced Vision Vocabulary
Paper
• 2401.12503
• Published • 32
Large Language Models are Superpositions of All Characters: Attaining
Arbitrary Role-play via Self-Alignment
Paper
• 2401.12474
• Published • 36
Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated
Text
Paper
• 2401.12070
• Published • 45
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities
Paper
• 2401.12168
• Published • 29
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published • 129
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published • 49
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
• 2403.04132
• Published • 40
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
• 2402.10986
• Published • 82
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
• 2402.10644
• Published • 81
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Paper
• 2402.01622
• Published • 38
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper
• 2403.15042
• Published • 27
When Do We Not Need Larger Vision Models?
Paper
• 2403.13043
• Published • 26
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
• 2404.07972
• Published • 52
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
• 2404.07973
• Published • 32
BRAVE: Broadening the visual encoding of vision-language models
Paper
• 2404.07204
• Published • 19
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
• 2404.14396
• Published • 19
PhysDreamer: Physics-Based Interaction with 3D Objects via Video
Generation
Paper
• 2404.13026
• Published • 24
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
• 2404.12387
• Published • 40
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
• 2404.12390
• Published • 26
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
• 2404.19752
• Published • 24
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
• 2404.16790
• Published • 10
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
• 2405.09798
• Published • 32
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published • 74
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
• 2406.09403
• Published • 23
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper
• 2406.06469
• Published • 29
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper
• 2406.04692
• Published • 59
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
Complex Reasoning Abilities
Paper
• 2406.11768
• Published • 24
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published • 54
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented
Generation
Paper
• 2406.19215
• Published • 32
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
• 2406.15334
• Published • 9
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published • 94
FunAudioLLM: Voice Understanding and Generation Foundation Models for
Natural Interaction Between Humans and LLMs
Paper
• 2407.04051
• Published • 40
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paper
• 2407.03418
• Published • 12
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for
Sparse Architectural Large Language Models
Paper
• 2407.01906
• Published • 46
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
• 2408.05211
• Published • 50
Task-oriented Sequential Grounding in 3D Scenes
Paper
• 2408.04034
• Published • 8
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published • 51
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published • 95
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published • 57
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
• 2408.15998
• Published • 86
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Paper
• 2408.15881
• Published • 21
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published • 133
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
Paper
• 2409.05152
• Published • 32
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published • 54
OLMoE: Open Mixture-of-Experts Language Models
Paper
• 2409.02060
• Published • 80
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
• 2408.16725
• Published • 53
MIO: A Foundation Model on Multimodal Tokens
Paper
• 2409.17692
• Published • 53
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
• 2410.05993
• Published • 111
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Paper
• 2410.19168
• Published • 24