AlpachinoNLP/QTSplus-LLaVA-Video-7B-Qwen2
Image-Text-to-Text
•
Updated
None defined yet.
Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models