VLM Papers
updated
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published
• 20
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis
Paper
• 2412.06089
• Published
• 4
SILMM: Self-Improving Large Multimodal Models for Compositional
Text-to-Image Generation
Paper
• 2412.05818
• Published
• 1
FLAIR: VLM with Fine-grained Language-informed Image Representations
Paper
• 2412.03561
• Published
• 2
Active Data Curation Effectively Distills Large-Scale Multimodal Models
Paper
• 2411.18674
• Published
• 1
COSMOS: Cross-Modality Self-Distillation for Vision Language
Pre-training
Paper
• 2412.01814
• Published
• 1
CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
Paper
• 2411.16828
• Published
• 1
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image
Pre-training
Paper
• 2411.11927
• Published
• 1
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
• 2412.08580
• Published
• 45
V2PE: Improving Multimodal Long-Context Capability of Vision-Language
Models with Variable Visual Position Encoding
Paper
• 2412.09616
• Published
• 1
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
• 2412.09283
• Published
• 19
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Paper
• 2412.08802
• Published
• 5
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
• 2407.01449
• Published
• 51
HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large
Visual-Language Models
Paper
• 2412.20622
• Published