VLM Papers - a ytaek-oh Collection

ytaek-oh 's Collections

VLM Papers

updated Dec 31, 2024

Save it, later read

CompCap: Improving Multimodal Large Language Models with Composite Captions

Paper • 2412.05243 • Published Dec 6, 2024 • 21
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

Paper • 2412.06089 • Published Dec 8, 2024 • 4
SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

Paper • 2412.05818 • Published Dec 8, 2024 • 1
FLAIR: VLM with Fine-grained Language-informed Image Representations

Paper • 2412.03561 • Published Dec 4, 2024 • 2
Active Data Curation Effectively Distills Large-Scale Multimodal Models

Paper • 2411.18674 • Published Nov 27, 2024 • 1
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Paper • 2412.01814 • Published Dec 2, 2024 • 1
CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Paper • 2411.16828 • Published Nov 25, 2024 • 1
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

Paper • 2411.11927 • Published Nov 18, 2024 • 1
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 45
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Paper • 2412.09616 • Published Dec 12, 2024 • 1
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Paper • 2412.08802 • Published Dec 11, 2024 • 7
ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 51
HALLUCINOGEN: A Benchmark for Evaluating Object Hallucination in Large Visual-Language Models

Paper • 2412.20622 • Published Dec 29, 2024