ColPali: Efficient Document Retrieval with Vision Language Models Paper • 2407.01449 • Published Jun 27, 2024 • 51
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images Paper • 2412.08802 • Published Dec 11, 2024 • 5
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption Paper • 2412.09283 • Published Dec 12, 2024 • 19