lyhisme commited on
Commit
0b1dd39
·
verified ·
1 Parent(s): 393feb7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -3
README.md CHANGED
@@ -1,3 +1,195 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```markdown
2
+ ---
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ tags:
7
+ - video-captioning
8
+ - audiovisual
9
+ - qwen2.5-omni
10
+ - instruction-tuning
11
+ - attribute-structured
12
+ - quality-verified
13
+ pipeline_tag: image-text-to-text
14
+ model-index:
15
+ - name: ASID-Captioner-3B
16
+ results: []
17
+ ---
18
+
19
+ # ASID-Captioner-3B
20
+
21
+ ASID-Captioner-3B is an audiovisual captioning model (based on Qwen2.5-Omni) fine-tuned for attribute-structured and quality-verified video understanding. It is designed to generate fine-grained captions that cover both visual and audio signals, with controllable prompting over multiple attributes.
22
+
23
+ [[🏠 Homepage](https://)] [[📖 Arxiv Paper](https://arxiv.org/pdf/)] [[🤗 Models & Datasets](https://huggingface.co/AudioVisual-Caption)] [[💻 Code](https://github.com/)]
24
+
25
+ ## Introduction
26
+
27
+ Modern video MLLMs often describe long and complex audiovisual content with a single caption, which can be incomplete (missing audio or camera details), unstructured, and weakly controllable.
28
+
29
+ ASID-Captioner-3B is trained to follow attribute-specific instructions and produce more organized, fine-grained descriptions. It is built upon Qwen2.5-Omni and fine-tuned on ASID-1M, which provides structured supervision over multiple attributes (scene, characters, objects, actions, narrative elements, speech, camera, emotions) with quality verification and refinement.
30
+
31
+ ## Key Features
32
+
33
+ - Audiovisual captioning: uses both video frames and audio (when available).
34
+ - Attribute-structured instruction following: supports prompts targeting specific attributes (e.g., speech-only, camera-only).
35
+ - High-quality supervision: trained on attribute-structured, quality-verified instructions from ASID-1M.
36
+ - Standard Transformers interface: load with transformers and the Qwen2.5-Omni processor/model classes.
37
+
38
+ ## What’s in this repo
39
+
40
+ Typical files include:
41
+
42
+ - config.json
43
+ - generation_config.json
44
+ - preprocessor_config.json
45
+ - chat_template.jinja
46
+ - added_tokens.json / special_tokens_map.json
47
+ - model-*.safetensors and model.safetensors.index.json
48
+
49
+ ## Prompting (recommended)
50
+
51
+ ASID-Captioner-3B works best with explicit attribute prompts, for example:
52
+
53
+ - Describe the scene in the video in detail. Write your answer as one coherent paragraph.
54
+ - Describe the characters in the video in detail. Write your answer as one coherent paragraph.
55
+ - Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account.
56
+
57
+ ## Usage (minimal, single GPU)
58
+
59
+ ### Install
60
+
61
+ ```bash
62
+ pip install -U transformers accelerate
63
+ ```
64
+
65
+ Optional: faster attention
66
+
67
+ If you want faster attention (optional), install FlashAttention2 following its official instructions.
68
+
69
+ You must also have `qwen_omni_utils.process_mm_info` available in your environment (same as your reference script).
70
+
71
+ ### Run inference
72
+
73
+ ```python
74
+ import os
75
+ import torch
76
+ from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
77
+ from qwen_omni_utils import process_mm_info
78
+
79
+ # Constants (same spirit as reference)
80
+ VIDEO_MAX_PIXELS = 401408 # 512*28*28
81
+ VIDEO_TOTAL_PIXELS = 20070400 # 512*28*28*50
82
+ USE_AUDIO_IN_VIDEO = True
83
+
84
+ # Some pipelines use this env var
85
+ os.environ["VIDEO_MAX_PIXELS"] = str(VIDEO_TOTAL_PIXELS)
86
+
87
+ model_id = "AudioVisual-Caption/ASID-Captioner-3B"
88
+
89
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
90
+ model_id,
91
+ torch_dtype=torch.bfloat16,
92
+ device_map="cuda",
93
+ attn_implementation="flash_attention_2", # optional; remove if not available
94
+ low_cpu_mem_usage=True,
95
+ )
96
+ model.disable_talker()
97
+
98
+ processor = Qwen2_5OmniProcessor.from_pretrained(model_id)
99
+
100
+ file_path = "/path/to/video.mp4"
101
+ prompt = "Provide a comprehensive description of all the content in the video, leaving out no details, and naturally covering the scene, characters, objects, actions, narrative elements, speech, camera, and emotions in a single coherent account."
102
+
103
+ conversation = [
104
+ {
105
+ "role": "system",
106
+ "content": [
107
+ {
108
+ "type": "text",
109
+ "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
110
+ }
111
+ ],
112
+ },
113
+ {
114
+ "role": "user",
115
+ "content": [
116
+ {"type": "video", "video": file_path, "max_pixels": VIDEO_MAX_PIXELS},
117
+ {"type": "text", "text": prompt},
118
+ ],
119
+ },
120
+ ]
121
+
122
+ text = processor.apply_chat_template(
123
+ conversation,
124
+ add_generation_prompt=True,
125
+ tokenize=False,
126
+ )
127
+
128
+ # IMPORTANT: reference-style multimodal extraction
129
+ audios, images, videos = process_mm_info(
130
+ conversation,
131
+ use_audio_in_video=USE_AUDIO_IN_VIDEO,
132
+ )
133
+
134
+ inputs = processor(
135
+ text=text,
136
+ audio=audios,
137
+ images=images,
138
+ videos=videos,
139
+ return_tensors="pt",
140
+ padding=True,
141
+ use_audio_in_video=USE_AUDIO_IN_VIDEO,
142
+ )
143
+
144
+ device = "cuda"
145
+ inputs = inputs.to(device).to(model.dtype)
146
+
147
+ with torch.no_grad():
148
+ text_ids = model.generate(
149
+ **inputs,
150
+ use_audio_in_video=USE_AUDIO_IN_VIDEO,
151
+ do_sample=False,
152
+ thinker_max_new_tokens=4096,
153
+ repetition_penalty=1.1,
154
+ use_cache=True,
155
+ )
156
+
157
+ decoded = processor.batch_decode(
158
+ text_ids,
159
+ skip_special_tokens=True,
160
+ clean_up_tokenization_spaces=False,
161
+ )[0]
162
+
163
+ answer = decoded.split("\nassistant\n")[-1].strip()
164
+ print(answer)
165
+ ```
166
+
167
+ ### Notes (important)
168
+
169
+ - If you do **not** use `process_mm_info`, you may get missing/incorrect audiovisual inputs in some environments.
170
+ - `use_audio_in_video=True` enables audio-conditioned captioning when your runtime supports extracting audio from the video container.
171
+ - `thinker_max_new_tokens` is used in the reference script. If your environment does not recognize it, replace with `max_new_tokens`.
172
+
173
+
174
+ ## Training Data
175
+
176
+ This model is fine-tuned using ASID-1M (attribute-structured and quality-verified audiovisual instructions).
177
+ Dataset: AudioVisual-Caption/ASID-1M
178
+
179
+
180
+ ## Citation
181
+
182
+ If you use our model in your research, please cite our paper:
183
+
184
+ ~~~bibtex
185
+ @misc{asid2026,
186
+ title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions},
187
+ author={Yunheng Li and Hengrui Zhang and Meng-Hao Guo and Wenzhao Gao and Shaoyong Jia and Shaohui Jiao and Qibin Hou1 and Ming-Ming Cheng},
188
+ year={2026}
189
+ }
190
+ ~~~
191
+
192
+ ## Contact
193
+
194
+ Please open a Discussion on the Hugging Face page for usage questions or issues.
195
+ ```