|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- video-captioning |
|
|
- audiovisual |
|
|
- qwen2.5-omni |
|
|
- instruction-tuning |
|
|
- attribute-structured |
|
|
- quality-verified |
|
|
pipeline_tag: image-text-to-text |
|
|
model-index: |
|
|
- name: ASID-Captioner-7B |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# ASID-Captioner-7B |
|
|
|
|
|
ASID-Captioner-7B is an audiovisual captioning model (based on Qwen2.5-Omni) fine-tuned for attribute-structured and quality-verified video understanding. It is designed to generate fine-grained captions that cover both visual and audio signals, with controllable prompting over multiple attributes. |
|
|
|
|
|
[[🏠 Homepage]([https://](https://asid-caption.github.io/))] [[📖 Arxiv Paper](https://arxiv.org/pdf/2602.13013)] [[🤗 Models & Datasets](https://huggingface.co/AudioVisual-Caption)] [[💻 Code](https://github.com/)] |
|
|
|
|
|
## Introduction |
|
|
|
|
|
Modern video MLLMs often describe long and complex audiovisual content with a single caption, which can be incomplete (missing audio or camera details), unstructured, and weakly controllable. |
|
|
|
|
|
ASID-Captioner-7B is trained to follow attribute-specific instructions and produce more organized, fine-grained descriptions. It is built upon Qwen2.5-Omni and fine-tuned on ASID-1M, which provides structured supervision over multiple attributes (scene, characters, objects, actions, narrative elements, speech, camera, emotions) with quality verification and refinement. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- Audiovisual captioning: uses both video frames and audio (when available). |
|
|
- Attribute-structured instruction following: supports prompts targeting specific attributes (e.g., speech-only, camera-only). |
|
|
- High-quality supervision: trained on attribute-structured, quality-verified instructions from ASID-1M. |
|
|
- Standard Transformers interface: load with transformers and the Qwen2.5-Omni processor/model classes. |
|
|
|
|
|
## What’s in this repo |
|
|
|
|
|
Typical files include: |
|
|
|
|
|
- config.json |
|
|
- generation_config.json |
|
|
- preprocessor_config.json |
|
|
- chat_template.jinja |
|
|
- added_tokens.json / special_tokens_map.json |
|
|
- model-*.safetensors and model.safetensors.index.json |
|
|
|
|
|
## Prompting (recommended) |
|
|
|
|
|
ASID-Captioner-7B works best with explicit attribute prompts, for example: |
|
|
|
|
|
- Describe the scene in the video in detail. Write your answer as one coherent paragraph. |
|
|
- Describe the characters in the video in detail. Write your answer as one coherent paragraph. |
|
|
- Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account. |
|
|
|
|
|
## Usage (minimal, single GPU) |
|
|
|
|
|
### Install |
|
|
|
|
|
```bash |
|
|
pip install -U transformers accelerate |
|
|
``` |
|
|
|
|
|
Optional: faster attention |
|
|
|
|
|
If you want faster attention (optional), install FlashAttention2 following its official instructions. |
|
|
|
|
|
You must also have `qwen_omni_utils.process_mm_info` available in your environment (same as your reference script). |
|
|
|
|
|
### Run inference |
|
|
|
|
|
```python |
|
|
import os |
|
|
import torch |
|
|
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor |
|
|
from qwen_omni_utils import process_mm_info |
|
|
|
|
|
# Constants (same spirit as reference) |
|
|
VIDEO_MAX_PIXELS = 401408 # 512*28*28 |
|
|
VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50 |
|
|
USE_AUDIO_IN_VIDEO = True |
|
|
|
|
|
# Some pipelines use this env var |
|
|
os.environ["VIDEO_MAX_PIXELS"] = str(VIDEO_TOTAL_PIXELS) |
|
|
|
|
|
model_id = "AudioVisual-Caption/ASID-Captioner-7B" |
|
|
|
|
|
model = Qwen2_5OmniForConditionalGeneration.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="cuda", |
|
|
attn_implementation="flash_attention_2", # optional; remove if not available |
|
|
low_cpu_mem_usage=True, |
|
|
) |
|
|
model.disable_talker() |
|
|
|
|
|
processor = Qwen2_5OmniProcessor.from_pretrained(model_id) |
|
|
|
|
|
file_path = "/path/to/video.mp4" |
|
|
prompt = "Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account." |
|
|
|
|
|
conversation = [ |
|
|
{ |
|
|
"role": "system", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "text", |
|
|
"text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." |
|
|
} |
|
|
], |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "video", "video": file_path, "max_pixels": VIDEO_MAX_PIXELS}, |
|
|
{"type": "text", "text": prompt}, |
|
|
], |
|
|
}, |
|
|
] |
|
|
|
|
|
text = processor.apply_chat_template( |
|
|
conversation, |
|
|
add_generation_prompt=True, |
|
|
tokenize=False, |
|
|
) |
|
|
|
|
|
# IMPORTANT: reference-style multimodal extraction |
|
|
audios, images, videos = process_mm_info( |
|
|
conversation, |
|
|
use_audio_in_video=USE_AUDIO_IN_VIDEO, |
|
|
) |
|
|
|
|
|
inputs = processor( |
|
|
text=text, |
|
|
audio=audios, |
|
|
images=images, |
|
|
videos=videos, |
|
|
return_tensors="pt", |
|
|
padding=True, |
|
|
use_audio_in_video=USE_AUDIO_IN_VIDEO, |
|
|
) |
|
|
|
|
|
device = "cuda" |
|
|
inputs = inputs.to(device).to(model.dtype) |
|
|
|
|
|
with torch.no_grad(): |
|
|
text_ids = model.generate( |
|
|
**inputs, |
|
|
use_audio_in_video=USE_AUDIO_IN_VIDEO, |
|
|
do_sample=False, |
|
|
thinker_max_new_tokens=4096, |
|
|
repetition_penalty=1.1, |
|
|
use_cache=True, |
|
|
) |
|
|
|
|
|
decoded = processor.batch_decode( |
|
|
text_ids, |
|
|
skip_special_tokens=True, |
|
|
clean_up_tokenization_spaces=False, |
|
|
)[0] |
|
|
|
|
|
answer = decoded.split("\nassistant\n")[-1].strip() |
|
|
print(answer) |
|
|
``` |
|
|
|
|
|
### Notes (important) |
|
|
|
|
|
- If you do **not** use `process_mm_info`, you may get missing/incorrect audiovisual inputs in some environments. |
|
|
- `use_audio_in_video=True` enables audio-conditioned captioning when your runtime supports extracting audio from the video container. |
|
|
- `thinker_max_new_tokens` is used in the reference script. If your environment does not recognize it, replace with `max_new_tokens`. |
|
|
|
|
|
|
|
|
## Training Data |
|
|
|
|
|
This model is fine-tuned using ASID-1M (attribute-structured and quality-verified audiovisual instructions). |
|
|
Dataset: AudioVisual-Caption/ASID-1M |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use our model in your research, please cite our paper: |
|
|
|
|
|
~~~bibtex |
|
|
@misc{asid2026, |
|
|
title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions}, |
|
|
author={Yunheng Li and Hengrui Zhang and Meng-Hao Guo and Wenzhao Gao and Shaoyong Jia and Shaohui Jiao and Qibin Hou1 and Ming-Ming Cheng}, |
|
|
year={2026} |
|
|
} |
|
|
~~~ |
|
|
|
|
|
## Contact |
|
|
|
|
|
Please open a Discussion on the Hugging Face page for usage questions or issues. |
|
|
``` |
|
|
|