Update README.md

57cc8c8 verified 6 months ago

15.5 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: video-text-to-text
	library_name: transformers
	tags:
	- multimodal
	- video-understanding
	- video-audio understanding
	- video-qa
	- video-captioning
	- video-grounding
	- video-reasoning
	- short video understanding
	---

	# ARC-Qwen-Video-7B

	[![arXiv](https://img.shields.io/badge/arXiv-2507.20939-b31b1b.svg)](https://arxiv.org/abs/2507.20939)
	[![Demo](https://img.shields.io/badge/ARC-Demo-blue)](https://arc.tencent.com/en/ai-demos/multimodal)
	[![Code](https://img.shields.io/badge/Github-Code-orange)](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/tree/arc-qwen-video)
	[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B)
	[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B)
	[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator)
	[![Blog](https://img.shields.io/badge/ARC-Blog-green)](https://tencentarc.github.io/posts/arc-video-announcement/)
	[![Benchmark](https://img.shields.io/badge/ShortVid-Bench-orange)](https://huggingface.co/datasets/TencentARC/ShortVid-Bench)


	In this version, we have switched the base model from hunyuan VLM in [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) to [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and introduce [ARC-Qwen-Video-7B](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B) for understanding real-world short videos. We used the same training data and training stages. For a detailed introduction, please refer to [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B). The main distinctions are listed as below,

	\| Feature \| `ARC-Hunyuan-Video-7B` \| `ARC-Qwen-Video-7B` \|
	\| ---------------------------------------------------------------------------------------------------------------------- \| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| Base VLM \| Hunyuan-VL-7B-Pretrain \| Qwen2.5-VL-7B-Instruct \|
	\| Frame Resolution <br> <small>Each model uses a fixed frame resolution to maintain audio-video synchronization.</small> \| Fixed at `640 x 640` \| Fixed at `392 x 292` \|
	\| Frame Sampling \| • < 150s: 1 FPS <br> • > 150s: Uniformly sample 150 frames. \| • < 300s: 1 FPS <br> • > 300s: Uniformly sample 300 frames. \|
	\| Audio-Video Synchronization \| • < 150s: Sum tokens from 1s audio + 1s video frame. <br> • 150-300s: Sum tokens from corresponding audio segment + video frame. <br> • > 300s: Split audio into 300 segments, use first 2s of each. \| • < 300s: Sum tokens from 1s audio + 1s video. <br> • > 300s: Split audio into 300 segments, use middle 1s of each. \|


	We are also introducing a new model, [ARC-Qwen-Video-7B-Narrator](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator). It can output timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video):

	[<img src="https://img.youtube.com/vi/Bz1T4wCuWc8/maxresdefault.jpg" alt="视频" width="300">](https://www.youtube.com/watch?v=Bz1T4wCuWc8)

	<table border="1" style="width:100%; border-collapse: collapse;">
	<tr>
	<td style="padding: 15px;">

	### 视频概述

	这是一个喜剧短片，讲述了一位丈夫藏在棉衣里的私房钱被妻子意外发现，并误以为是丈夫准备的“惊喜”礼物。视频通过夫妻二人的一通电话，生动展现了丈夫从悠闲自得，到震惊错愕，再到崩溃无奈的全过程，充满了戏剧性的反转和幽默感。

	### 情节发展分解

	视频情节围绕一通电话展开，以下是详细的时间线、场景、说话人和对话内容：

	<table>
	<thead>
	<tr>
	<th>时间戳</th>
	<th>场景描述</th>
	<th>说话人</th>
	<th>对话内容 (ASR)</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>0:00 - 0:05</td>
	<td>丈夫头戴浴帽，围着浴巾，在室内泳池边悠闲地自拍。</td>
	<td>无</td>
	<td>(无对话)</td>
	</tr>
	<tr>
	<td>0:05 - 0:10</td>
	<td><b>镜头切换</b>：妻子在服装店里，满脸幸福地给丈夫打电话。</td>
	<td>妻子</td>
	<td>“哎，老公，老公，我爱你爱你，爱死你了，么么么。”</td>
	</tr>
	<tr>
	<td rowspan="2" style="vertical-align: top;">0:10 - 0:18</td>
	<td rowspan="2" style="vertical-align: top;">丈夫接起电话，对妻子的热情感到好奇，妻子则兴奋地揭晓了“惊喜”。</td>
	<td>丈夫</td>
	<td>“哎，怎么了你这是，这么高兴啊？”</td>
	</tr>
	<tr>
	<td>妻子</td>
	<td>“今天我在我的棉衣兜里，发现了你给我的惊喜，一万元哟。”</td>
	</tr>
	<tr>
	<td>0:18 - 0:27</td>
	<td>听到“一万元”，丈夫表情瞬间凝固，从疑惑变为震惊和懊悔，但仍强装镇定。</td>
	<td>丈夫</td>
	<td>“啊？好啊，你你你你开心高兴就行。”</td>
	</tr>
	<tr>
	<td>0:27 - 0:34</td>
	<td>妻子开心地告知钱的用途，丈夫的表情彻底僵住，震惊加剧。</td>
	<td>妻子</td>
	<td>“我当然高兴啊，我用它买了一件新衣裳，等晚上回去穿给你看啊。”</td>
	</tr>
	<tr>
	<td rowspan="3" style="vertical-align: top;">0:34 - 0:46</td>
	<td rowspan="3" style="vertical-align: top;">丈夫确认钱已被花掉，情绪崩溃。妻子则认为是丈夫授权的，丈夫忍不住骂了一句。</td>
	<td>丈夫</td>
	<td>“你已经给买成衣服了？”</td>
	</tr>
	<tr>
	<td>妻子</td>
	<td>“当然啦，不是你说的吗？说买我自己喜欢的东西。老公，你真是太好了。”</td>
	</tr>
	<tr>
	<td>丈夫</td>
	<td>“你真是败家娘们儿啊你。”</td>
	</tr>
	<tr>
	<td rowspan="4" style="vertical-align: top;">0:46 - 0:59</td>
	<td rowspan="4" style="vertical-align: top;">妻子察觉丈夫语气不对，丈夫立刻改口掩饰，并催促妻子早点回家。</td>
	<td>妻子</td>
	<td>“什么，老公，你说什么？”</td>
	</tr>
	<tr>
	<td>丈夫</td>
	<td>“啊？我说好啊，你漂亮我高兴。”</td>
	</tr>
	<tr>
	<td>妻子</td>
	<td>“你说的，老公。你今天呀，一定要早点回来哟，我等你哟。”</td>
	</tr>
	<tr>
	<td>丈夫</td>
	<td>“行行行行行。”</td>
	</tr>
	</tbody>
	</table>

	### 人物与核心冲突

	#### 1. 人物分析

	丈夫:
	行为: 藏私房钱，事发后极力掩饰自己的真实情绪（心痛、懊悔）。
	心理变化: 悠闲 -> 疑惑 -> 震惊 -> 崩溃 -> 无奈接受。
	特点: 爱面子，对妻子既有爱意也有无奈，典型的“妻管严”形象。

	妻子:
	行为: 发现钱后，认为是丈夫的爱意表达，并迅速将其消费。
	心理变化: 全程处于发现“惊喜”的幸福和喜悦中。
	特点: 天真、消费果断，对丈夫充满信任和爱意。

	#### 2. 核心冲突

	视频的核心冲突在于 “信息的严重不对等” 所造成的戏剧性误会：

	* 丈夫视角: 辛苦攒下的 10,000元私房钱被意外发现并花掉，是一场“惊吓”。
	* 妻子视角: 丈夫精心准备的 10,000元浪漫基金，是一份巨大的“惊喜”。

	这个误会推动了整个故事的发展，丈夫的“打碎牙往肚里咽”和妻子的“理所当然的幸福”形成了强烈的喜剧反差，制造了密集的笑点。

	### 总结

	该视频通过一个关于“私房钱”的常见家庭情景，巧妙地构建了一个充满反转和幽默的故事。它利用戏剧性讽刺（观众和丈夫知道真相，而妻子蒙在鼓里）的手法，精准捕捉了丈夫在突发状况下的复杂心理活动。整个过程不仅笑料百出，也含蓄地探讨了夫妻间的沟通、信任和金钱观等话题，容易引发观众的共鸣和讨论。

	</td>
	</tr>
	</table>

	## Usage

	### Dependencies
	The installation has been tested and verified on the following environments:
	* NVIDIA H20 with CUDA 12.4
	* NVIDIA A100 with CUDA 12.1

	### Installation

	Clone the repo and install dependent packages

	```bash
	git clone -b arc-qwen-video https://github.com/TencentARC/ARC-Hunyuan-Video-7B.git
	cd ARC-Hunyuan-Video-7B

	# Install torch 2.6.0 based on your CUDA version
	# CUDA 11.8
	pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
	# CUDA 12.4
	pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
	# CUDA 12.6
	pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

	pip install librosa decord av accelerate
	pip uninstall transformers
	pip install git+https://github.com/geyuying/transformers.git@arc-qwen-video
	pip install flash_attn==2.7.1.post4

	# Install FFmpeg according to your system, and ensure that the following command produces a normal version output:
	ffmpeg -version

	# (Optional) For vllm, please follow the instructions below,
	pip uninstall vllm
	pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video
	```

	#### An 'Ugly' Workaround for vLLM Installation
	If you are unable to install our provided vllm package, we offer an alternative "ugly" method:

	1. Install vllm with Qwen2.5-VL support.

	2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen2_5_VLForConditionalGeneration".

	3. Patch the vllm source code. Locate the file vllm/model_executor/models/qwen2_5_vl.py in your vllm installation path. Add the following code inside the __init__ method of the Qwen2_5_VLForConditionalGeneration class:

	```
	whisper_path = 'openai/whisper-large-v3'
	speech_encoder = WhisperModel.from_pretrained(whisper_path).encoder
	self.speech_encoder = speech_encoder
	speech_dim = speech_encoder.config.d_model
	llm_hidden_size = config.vision_config.out_hidden_size
	self.mlp_speech = nn.Sequential(
	nn.LayerNorm(speech_dim),
	nn.Linear(speech_dim, llm_hidden_size),
	nn.GELU(),
	nn.Linear(llm_hidden_size, llm_hidden_size)
	)
	```
	Why this works: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5.

	### Inference

	```bash
	# Our model currently excels at processing short videos of up to 5 minutes.
	# If your video is longer, we recommend following the approach used in our demo and API:
	# split the video into segments for inference, and then use an LLM to integrate the results.
	```

	To quickly verify that your environment is set up correctly and that video and audio information are being processed as expected, you can run the following test case with ARC-Qwen-Video-7B.

	```bash
	video_path = "examples/猪排.mp4"
	task = "QA"
	question = "What did the man say at the beginning of the video after measuring the thickness of the fried pork cutlet?"
	```
	Expected Result: If the model's output contains the phrase "So thin", it indicates that your installation is successful.

	#### Inference without vllm

	```bash
	cd ARC-Hunyuan-Video-7B

	# For ARC-Hunyuan-Video-7B
	python3 inference_arc_qwen_video.py

	# For ARC-Hunyuan-Video-7B-Narrator
	python3 inference_arc_qwen_video_narrator.py
	```

	#### Inference with vllm

	```bash
	cd ARC-Hunyuan-Video-7B

	# For ARC-Hunyuan-Video-7B
	python3 vllm_arc_qwen_vl_video_batch.py --batch_inference

	# For ARC-Hunyuan-Video-7B-Narrator
	python3 vllm_arc_qwen_vl_video_batch_narrator.py --batch_inference
	```

	## Benchmark Performance
	\| \| Video-MMMU \| MMVU \| Temp-Compass \| Video-Holmes \| Video-MME \| VCR-Bench \| MV-Bench \| ShortVid-Bench \| Charades-STA \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| ARC-Hunyuan-Video-7B \| 31.1 \| 49.1 \| 66.0 \| 40.9 \| 58.7 \| 50.5 \| 62.6 \| 73.0 \| 54.8 \|
	\| ARC-Qwen-Video-7B \| 41.3 \| 55.5 \| 68.7 \| 51.1 \| 61.0 \| 52.3 \| 60.8 \| 72.6 \| 52.8 \|

	Quantitative evaluation is performed on different benchmarks using accuracy as the evaluation metric, except for the grounding task on Charades-STA, which uses mIoU. For all benchmarks other than VideoMMMU and Charades-STA, we only evaluated the multiple-choice questions.

	## Citation

	If you find the work helpful, please consider citing:

	```bash
	@article{ge2025arc,
	title={ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts},
	author={Ge, Yuying and Ge, Yixiao and Li, Chen and Wang, Teng and Pu, Junfu and Li, Yizhuo and Qiu, Lu and Ma, Jin and Duan, Lisheng and Zuo, Xinyu and others},
	journal={arXiv preprint arXiv:2507.20939},
	year={2025}
	}
	```