File size: 15,537 Bytes
27634b7 c8e4886 27634b7 c8e4886 27634b7 d3e58bd 27634b7 53a017c 27634b7 53a017c d17c69c 53a017c 27634b7 3b61676 56267d3 27634b7 56267d3 27634b7 56267d3 27634b7 3b61676 27634b7 56267d3 27634b7 57cc8c8 27634b7 f87a9b8 27634b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | ---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- multimodal
- video-understanding
- video-audio understanding
- video-qa
- video-captioning
- video-grounding
- video-reasoning
- short video understanding
---
# ARC-Qwen-Video-7B
[](https://arxiv.org/abs/2507.20939)
[](https://arc.tencent.com/en/ai-demos/multimodal)
[](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/tree/arc-qwen-video)
[](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B)
[](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B)
[](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator)
[](https://tencentarc.github.io/posts/arc-video-announcement/)
[](https://huggingface.co/datasets/TencentARC/ShortVid-Bench)
In this version, we have switched the base model from hunyuan VLM in [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) to [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and introduce [ARC-Qwen-Video-7B](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B) for understanding real-world short videos. We used the same training data and training stages. For a detailed introduction, please refer to [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B). The main distinctions are listed as below,
| Feature | `ARC-Hunyuan-Video-7B` | `ARC-Qwen-Video-7B` |
| ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Base VLM** | Hunyuan-VL-7B-Pretrain | Qwen2.5-VL-7B-Instruct |
| **Frame Resolution** <br> <small>*Each model uses a fixed frame resolution to maintain audio-video synchronization.*</small> | Fixed at `640 x 640` | Fixed at `392 x 292` |
| **Frame Sampling** | • < 150s: 1 FPS <br> • > 150s: Uniformly sample 150 frames. | • < 300s: 1 FPS <br> • > 300s: Uniformly sample 300 frames. |
| **Audio-Video Synchronization** | • < 150s: Sum tokens from 1s audio + 1s video frame. <br> • 150-300s: Sum tokens from corresponding audio segment + video frame. <br> • > 300s: Split audio into 300 segments, use first 2s of each. | • < 300s: Sum tokens from 1s audio + 1s video. <br> • > 300s: Split audio into 300 segments, use middle 1s of each. |
We are also introducing a new model, [ARC-Qwen-Video-7B-Narrator](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator). It can output **timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content**. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video):
[<img src="https://img.youtube.com/vi/Bz1T4wCuWc8/maxresdefault.jpg" alt="视频" width="300">](https://www.youtube.com/watch?v=Bz1T4wCuWc8)
<table border="1" style="width:100%; border-collapse: collapse;">
<tr>
<td style="padding: 15px;">
### 视频概述
这是一个喜剧短片,讲述了一位丈夫藏在棉衣里的私房钱被妻子意外发现,并误以为是丈夫准备的“惊喜”礼物。视频通过夫妻二人的一通电话,生动展现了丈夫从悠闲自得,到震惊错愕,再到崩溃无奈的全过程,充满了戏剧性的反转和幽默感。
### 情节发展分解
视频情节围绕一通电话展开,以下是详细的时间线、场景、说话人和对话内容:
<table>
<thead>
<tr>
<th>时间戳</th>
<th>场景描述</th>
<th>说话人</th>
<th>对话内容 (ASR)</th>
</tr>
</thead>
<tbody>
<tr>
<td>0:00 - 0:05</td>
<td>丈夫头戴浴帽,围着浴巾,在室内泳池边悠闲地自拍。</td>
<td>无</td>
<td>(无对话)</td>
</tr>
<tr>
<td>0:05 - 0:10</td>
<td><b>镜头切换</b>:妻子在服装店里,满脸幸福地给丈夫打电话。</td>
<td>妻子</td>
<td>“哎,老公,老公,我爱你爱你,爱死你了,么么么。”</td>
</tr>
<tr>
<td rowspan="2" style="vertical-align: top;">0:10 - 0:18</td>
<td rowspan="2" style="vertical-align: top;">丈夫接起电话,对妻子的热情感到好奇,妻子则兴奋地揭晓了“惊喜”。</td>
<td>丈夫</td>
<td>“哎,怎么了你这是,这么高兴啊?”</td>
</tr>
<tr>
<td>妻子</td>
<td>“今天我在我的棉衣兜里,发现了你给我的惊喜,一万元哟。”</td>
</tr>
<tr>
<td>0:18 - 0:27</td>
<td>听到“一万元”,丈夫表情瞬间凝固,从疑惑变为震惊和懊悔,但仍强装镇定。</td>
<td>丈夫</td>
<td>“啊?好啊,你你你你开心高兴就行。”</td>
</tr>
<tr>
<td>0:27 - 0:34</td>
<td>妻子开心地告知钱的用途,丈夫的表情彻底僵住,震惊加剧。</td>
<td>妻子</td>
<td>“我当然高兴啊,我用它买了一件新衣裳,等晚上回去穿给你看啊。”</td>
</tr>
<tr>
<td rowspan="3" style="vertical-align: top;">0:34 - 0:46</td>
<td rowspan="3" style="vertical-align: top;">丈夫确认钱已被花掉,情绪崩溃。妻子则认为是丈夫授权的,丈夫忍不住骂了一句。</td>
<td>丈夫</td>
<td>“你已经给买成衣服了?”</td>
</tr>
<tr>
<td>妻子</td>
<td>“当然啦,不是你说的吗?说买我自己喜欢的东西。老公,你真是太好了。”</td>
</tr>
<tr>
<td>丈夫</td>
<td>“你真是败家娘们儿啊你。”</td>
</tr>
<tr>
<td rowspan="4" style="vertical-align: top;">0:46 - 0:59</td>
<td rowspan="4" style="vertical-align: top;">妻子察觉丈夫语气不对,丈夫立刻改口掩饰,并催促妻子早点回家。</td>
<td>妻子</td>
<td>“什么,老公,你说什么?”</td>
</tr>
<tr>
<td>丈夫</td>
<td>“啊?我说好啊,你漂亮我高兴。”</td>
</tr>
<tr>
<td>妻子</td>
<td>“你说的,老公。你今天呀,一定要早点回来哟,我等你哟。”</td>
</tr>
<tr>
<td>丈夫</td>
<td>“行行行行行。”</td>
</tr>
</tbody>
</table>
### 人物与核心冲突
#### 1. 人物分析
丈夫:
行为: 藏私房钱,事发后极力掩饰自己的真实情绪(心痛、懊悔)。
心理变化: 悠闲 -> 疑惑 -> 震惊 -> 崩溃 -> 无奈接受。
特点: 爱面子,对妻子既有爱意也有无奈,典型的“妻管严”形象。
妻子:
行为: 发现钱后,认为是丈夫的爱意表达,并迅速将其消费。
心理变化: 全程处于发现“惊喜”的幸福和喜悦中。
特点: 天真、消费果断,对丈夫充满信任和爱意。
#### 2. 核心冲突
视频的核心冲突在于 “信息的严重不对等” 所造成的戏剧性误会:
* 丈夫视角: 辛苦攒下的 10,000元私房钱被意外发现并花掉,是一场“惊吓”。
* 妻子视角: 丈夫精心准备的 10,000元浪漫基金,是一份巨大的“惊喜”。
这个误会推动了整个故事的发展,丈夫的“打碎牙往肚里咽”和妻子的“理所当然的幸福”形成了强烈的喜剧反差,制造了密集的笑点。
### 总结
该视频通过一个关于“私房钱”的常见家庭情景,巧妙地构建了一个充满反转和幽默的故事。它利用戏剧性讽刺(观众和丈夫知道真相,而妻子蒙在鼓里)的手法,精准捕捉了丈夫在突发状况下的复杂心理活动。整个过程不仅笑料百出,也含蓄地探讨了夫妻间的沟通、信任和金钱观等话题,容易引发观众的共鸣和讨论。
</td>
</tr>
</table>
## Usage
### Dependencies
The installation has been tested and verified on the following environments:
* NVIDIA H20 with CUDA 12.4
* NVIDIA A100 with CUDA 12.1
### Installation
Clone the repo and install dependent packages
```bash
git clone -b arc-qwen-video https://github.com/TencentARC/ARC-Hunyuan-Video-7B.git
cd ARC-Hunyuan-Video-7B
# Install torch 2.6.0 based on your CUDA version
# CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.4
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# CUDA 12.6
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install librosa decord av accelerate
pip uninstall transformers
pip install git+https://github.com/geyuying/transformers.git@arc-qwen-video
pip install flash_attn==2.7.1.post4
# Install FFmpeg according to your system, and ensure that the following command produces a normal version output:
ffmpeg -version
# (Optional) For vllm, please follow the instructions below,
pip uninstall vllm
pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
```
#### An 'Ugly' Workaround for vLLM Installation
If you are unable to install our provided vllm package, we offer an alternative "ugly" method:
1. Install vllm with Qwen2.5-VL support.
2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen2_5_VLForConditionalGeneration".
3. Patch the vllm source code. Locate the file vllm/model_executor/models/qwen2_5_vl.py in your vllm installation path. Add the following code inside the __init__ method of the Qwen2_5_VLForConditionalGeneration class:
```
whisper_path = 'openai/whisper-large-v3'
speech_encoder = WhisperModel.from_pretrained(whisper_path).encoder
self.speech_encoder = speech_encoder
speech_dim = speech_encoder.config.d_model
llm_hidden_size = config.vision_config.out_hidden_size
self.mlp_speech = nn.Sequential(
nn.LayerNorm(speech_dim),
nn.Linear(speech_dim, llm_hidden_size),
nn.GELU(),
nn.Linear(llm_hidden_size, llm_hidden_size)
)
```
**Why this works**: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5.
### Inference
```bash
# Our model currently excels at processing short videos of up to 5 minutes.
# If your video is longer, we recommend following the approach used in our demo and API:
# split the video into segments for inference, and then use an LLM to integrate the results.
```
To quickly verify that your environment is set up correctly and that video and audio information are being processed as expected, you can run the following test case with ARC-Qwen-Video-7B.
```bash
video_path = "examples/猪排.mp4"
task = "QA"
question = "What did the man say at the beginning of the video after measuring the thickness of the fried pork cutlet?"
```
Expected Result: If the model's output contains the phrase "So thin", it indicates that your installation is successful.
#### Inference without vllm
```bash
cd ARC-Hunyuan-Video-7B
# For ARC-Hunyuan-Video-7B
python3 inference_arc_qwen_video.py
# For ARC-Hunyuan-Video-7B-Narrator
python3 inference_arc_qwen_video_narrator.py
```
#### Inference with vllm
```bash
cd ARC-Hunyuan-Video-7B
# For ARC-Hunyuan-Video-7B
python3 vllm_arc_qwen_vl_video_batch.py --batch_inference
# For ARC-Hunyuan-Video-7B-Narrator
python3 vllm_arc_qwen_vl_video_batch_narrator.py --batch_inference
```
## Benchmark Performance
| | Video-MMMU | MMVU | Temp-Compass | Video-Holmes | Video-MME | VCR-Bench | MV-Bench | ShortVid-Bench | Charades-STA |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| ARC-Hunyuan-Video-7B | 31.1 | 49.1 | 66.0 | 40.9 | 58.7 | 50.5 | **62.6** | **73.0** | **54.8** |
| ARC-Qwen-Video-7B | **41.3** | **55.5** | **68.7** | **51.1** | **61.0** | **52.3** | 60.8 | 72.6 | 52.8 |
Quantitative evaluation is performed on different benchmarks using accuracy as the evaluation metric, except for the grounding task on Charades-STA, which uses mIoU. For all benchmarks other than VideoMMMU and Charades-STA, we only evaluated the multiple-choice questions.
## Citation
If you find the work helpful, please consider citing:
```bash
@article{ge2025arc,
title={ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts},
author={Ge, Yuying and Ge, Yixiao and Li, Chen and Wang, Teng and Pu, Junfu and Li, Yizhuo and Qiu, Lu and Ma, Jin and Duan, Lisheng and Zuo, Xinyu and others},
journal={arXiv preprint arXiv:2507.20939},
year={2025}
}
```
|