lyhisme commited on
Commit
1f11a15
·
verified ·
1 Parent(s): 5bcc7b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +194 -3
README.md CHANGED
@@ -1,3 +1,194 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ tags:
6
+ - video-captioning
7
+ - audiovisual
8
+ - qwen2.5-omni
9
+ - instruction-tuning
10
+ - attribute-structured
11
+ - quality-verified
12
+ pipeline_tag: image-text-to-text
13
+ model-index:
14
+ - name: ASID-Captioner-7B
15
+ results: []
16
+ ---
17
+
18
+ # ASID-Captioner-7B
19
+
20
+ ASID-Captioner-7B is an audiovisual captioning model (based on Qwen2.5-Omni) fine-tuned for attribute-structured and quality-verified video understanding. It is designed to generate fine-grained captions that cover both visual and audio signals, with controllable prompting over multiple attributes.
21
+
22
+ [[🏠 Homepage](https://)] [[📖 Arxiv Paper](https://arxiv.org/pdf/)] [[🤗 Models & Datasets](https://huggingface.co/AudioVisual-Caption)] [[💻 Code](https://github.com/)]
23
+
24
+ ## Introduction
25
+
26
+ Modern video MLLMs often describe long and complex audiovisual content with a single caption, which can be incomplete (missing audio or camera details), unstructured, and weakly controllable.
27
+
28
+ ASID-Captioner-7B is trained to follow attribute-specific instructions and produce more organized, fine-grained descriptions. It is built upon Qwen2.5-Omni and fine-tuned on ASID-1M, which provides structured supervision over multiple attributes (scene, characters, objects, actions, narrative elements, speech, camera, emotions) with quality verification and refinement.
29
+
30
+ ## Key Features
31
+
32
+ - Audiovisual captioning: uses both video frames and audio (when available).
33
+ - Attribute-structured instruction following: supports prompts targeting specific attributes (e.g., speech-only, camera-only).
34
+ - High-quality supervision: trained on attribute-structured, quality-verified instructions from ASID-1M.
35
+ - Standard Transformers interface: load with transformers and the Qwen2.5-Omni processor/model classes.
36
+
37
+ ## What’s in this repo
38
+
39
+ Typical files include:
40
+
41
+ - config.json
42
+ - generation_config.json
43
+ - preprocessor_config.json
44
+ - chat_template.jinja
45
+ - added_tokens.json / special_tokens_map.json
46
+ - model-*.safetensors and model.safetensors.index.json
47
+
48
+ ## Prompting (recommended)
49
+
50
+ ASID-Captioner-7B works best with explicit attribute prompts, for example:
51
+
52
+ - Describe the scene in the video in detail. Write your answer as one coherent paragraph.
53
+ - Describe the characters in the video in detail. Write your answer as one coherent paragraph.
54
+ - Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account.
55
+
56
+ ## Usage (minimal, single GPU)
57
+
58
+ ### Install
59
+
60
+ ```bash
61
+ pip install -U transformers accelerate
62
+ ```
63
+
64
+ Optional: faster attention
65
+
66
+ If you want faster attention (optional), install FlashAttention2 following its official instructions.
67
+
68
+ You must also have `qwen_omni_utils.process_mm_info` available in your environment (same as your reference script).
69
+
70
+ ### Run inference
71
+
72
+ ```python
73
+ import os
74
+ import torch
75
+ from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
76
+ from qwen_omni_utils import process_mm_info
77
+
78
+ # Constants (same spirit as reference)
79
+ VIDEO_MAX_PIXELS = 401408 # 512*28*28
80
+ VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
81
+ USE_AUDIO_IN_VIDEO = True
82
+
83
+ # Some pipelines use this env var
84
+ os.environ["VIDEO_MAX_PIXELS"] = str(VIDEO_TOTAL_PIXELS)
85
+
86
+ model_id = "AudioVisual-Caption/ASID-Captioner-7B"
87
+
88
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
89
+ model_id,
90
+ torch_dtype=torch.bfloat16,
91
+ device_map="cuda",
92
+ attn_implementation="flash_attention_2", # optional; remove if not available
93
+ low_cpu_mem_usage=True,
94
+ )
95
+ model.disable_talker()
96
+
97
+ processor = Qwen2_5OmniProcessor.from_pretrained(model_id)
98
+
99
+ file_path = "/path/to/video.mp4"
100
+ prompt = "Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account."
101
+
102
+ conversation = [
103
+ {
104
+ "role": "system",
105
+ "content": [
106
+ {
107
+ "type": "text",
108
+ "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
109
+ }
110
+ ],
111
+ },
112
+ {
113
+ "role": "user",
114
+ "content": [
115
+ {"type": "video", "video": file_path, "max_pixels": VIDEO_MAX_PIXELS},
116
+ {"type": "text", "text": prompt},
117
+ ],
118
+ },
119
+ ]
120
+
121
+ text = processor.apply_chat_template(
122
+ conversation,
123
+ add_generation_prompt=True,
124
+ tokenize=False,
125
+ )
126
+
127
+ # IMPORTANT: reference-style multimodal extraction
128
+ audios, images, videos = process_mm_info(
129
+ conversation,
130
+ use_audio_in_video=USE_AUDIO_IN_VIDEO,
131
+ )
132
+
133
+ inputs = processor(
134
+ text=text,
135
+ audio=audios,
136
+ images=images,
137
+ videos=videos,
138
+ return_tensors="pt",
139
+ padding=True,
140
+ use_audio_in_video=USE_AUDIO_IN_VIDEO,
141
+ )
142
+
143
+ device = "cuda"
144
+ inputs = inputs.to(device).to(model.dtype)
145
+
146
+ with torch.no_grad():
147
+ text_ids = model.generate(
148
+ **inputs,
149
+ use_audio_in_video=USE_AUDIO_IN_VIDEO,
150
+ do_sample=False,
151
+ thinker_max_new_tokens=4096,
152
+ repetition_penalty=1.1,
153
+ use_cache=True,
154
+ )
155
+
156
+ decoded = processor.batch_decode(
157
+ text_ids,
158
+ skip_special_tokens=True,
159
+ clean_up_tokenization_spaces=False,
160
+ )[0]
161
+
162
+ answer = decoded.split("\nassistant\n")[-1].strip()
163
+ print(answer)
164
+ ```
165
+
166
+ ### Notes (important)
167
+
168
+ - If you do **not** use `process_mm_info`, you may get missing/incorrect audiovisual inputs in some environments.
169
+ - `use_audio_in_video=True` enables audio-conditioned captioning when your runtime supports extracting audio from the video container.
170
+ - `thinker_max_new_tokens` is used in the reference script. If your environment does not recognize it, replace with `max_new_tokens`.
171
+
172
+
173
+ ## Training Data
174
+
175
+ This model is fine-tuned using ASID-1M (attribute-structured and quality-verified audiovisual instructions).
176
+ Dataset: AudioVisual-Caption/ASID-1M
177
+
178
+
179
+ ## Citation
180
+
181
+ If you use our model in your research, please cite our paper:
182
+
183
+ ~~~bibtex
184
+ @misc{asid2026,
185
+ title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions},
186
+ author={Yunheng Li and Hengrui Zhang and Meng-Hao Guo and Wenzhao Gao and Shaoyong Jia and Shaohui Jiao and Qibin Hou1 and Ming-Ming Cheng},
187
+ year={2026}
188
+ }
189
+ ~~~
190
+
191
+ ## Contact
192
+
193
+ Please open a Discussion on the Hugging Face page for usage questions or issues.
194
+ ```