SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning Paper • 2606.10804 • Published 16 days ago • 49
Preference Learning Unlocks LLMs' Psycho-Counseling Skills Paper • 2502.19731 • Published Feb 27, 2025 • 8
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains Paper • 2606.14702 • Published 13 days ago • 31
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources Paper • 2605.29250 • Published 28 days ago • 78
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini Paper • 2605.27295 • Published 30 days ago • 23
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Paper • 2605.26244 • Published about 1 month ago • 38
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding Paper • 2605.27365 • Published 30 days ago • 144
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding Paper • 2605.09874 • Published May 11 • 2
jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition Paper • 2605.08384 • Published May 8 • 11
jina-embeddings-v5-omni Collection Multimodal (text + image + video + audio) embedding models aligned with jina-embeddings-v5-text-*. Two sizes, four task variants each. • 27 items • Updated May 12 • 36
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models Paper • 2605.08735 • Published May 9 • 71
view article Article Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents nvidia • Apr 28 • 62
VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph Paper • 2602.12735 • Published Feb 13 • 8
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM Paper • 2509.21990 • Published Sep 26, 2025 • 1