Spaces:
Running
Running
| title: ASID-Caption | |
| emoji: 🦉 | |
| colorFrom: indigo | |
| colorTo: gray | |
| sdk: static | |
| pinned: false | |
| # ASID-Caption | |
| We build **ASID-Caption**, a data-and-model suite for **fine-grained audiovisual video understanding**. | |
| Our goal is to move beyond “one video → one generic caption” by providing **attribute-structured supervision** and **quality-verified annotations**, enabling models to produce **more complete, more controllable, and more temporally consistent** descriptions that cover both **visual content** and **audio cues**. | |
| ## What we release | |
| - **ASID-1M**: a large-scale collection of **attribute-structured** audiovisual instructions with both *single-attribute* and *all-attributes* training formats. | |
| - **ASID-Verify**: a scalable curation pipeline that generates, ensembles, verifies, and refines annotations to improve semantic and temporal consistency. | |
| - **ASID-Captioner**: Qwen2.5-Omni-based audiovisual captioning models fine-tuned on ASID-1M. | |
| ## Research interests | |
| - Video understanding & video captioning | |
| - Audio-visual learning | |
| - Multimodal LLMs / instruction tuning | |
| - Data curation, verification, and quality control | |