| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - Qwen/Qwen2.5-VL-7B-Instruct |
| | pipeline_tag: video-text-to-text |
| | library_name: transformers |
| | tags: |
| | - multimodal |
| | - video-understanding |
| | - video-audio understanding |
| | - video-qa |
| | - video-captioning |
| | - video-grounding |
| | - video-reasoning |
| | - short video understanding |
| | --- |
| | |
| | # ARC-Qwen-Video-7B |
| |
|
| | [](https://arxiv.org/abs/2507.20939) |
| | [](https://arc.tencent.com/en/ai-demos/multimodal) |
| | [](https://github.com/TencentARC/ARC-Hunyuan-Video-7B/tree/arc-qwen-video) |
| | [](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) |
| | [](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B) |
| | [](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator) |
| | [](https://tencentarc.github.io/posts/arc-video-announcement/) |
| | [](https://huggingface.co/datasets/TencentARC/ShortVid-Bench) |
| |
|
| |
|
| | In this version, we have switched the base model from hunyuan VLM in [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B) to [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and introduce [ARC-Qwen-Video-7B](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B) for understanding real-world short videos. We used the same training data and training stages. For a detailed introduction, please refer to [ARC-Hunyuan-Video-7B](https://huggingface.co/TencentARC/ARC-Hunyuan-Video-7B). The main distinctions are listed as below, |
| |
|
| | | Feature | `ARC-Hunyuan-Video-7B` | `ARC-Qwen-Video-7B` | |
| | | ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| | | **Base VLM** | Hunyuan-VL-7B-Pretrain | Qwen2.5-VL-7B-Instruct | |
| | | **Frame Resolution** <br> <small>*Each model uses a fixed frame resolution to maintain audio-video synchronization.*</small> | Fixed at `640 x 640` | Fixed at `392 x 292` | |
| | | **Frame Sampling** | • < 150s: 1 FPS <br> • > 150s: Uniformly sample 150 frames. | • < 300s: 1 FPS <br> • > 300s: Uniformly sample 300 frames. | |
| | | **Audio-Video Synchronization** | • < 150s: Sum tokens from 1s audio + 1s video frame. <br> • 150-300s: Sum tokens from corresponding audio segment + video frame. <br> • > 300s: Split audio into 300 segments, use first 2s of each. | • < 300s: Sum tokens from 1s audio + 1s video. <br> • > 300s: Split audio into 300 segments, use middle 1s of each. | |
| |
|
| |
|
| | We are also introducing a new model, [ARC-Qwen-Video-7B-Narrator](https://huggingface.co/TencentARC/ARC-Qwen-Video-7B-Narrator). It can output **timestamped video descriptions, speaker identities, and the specific ASR (Automatic Speech Recognition) content**. By processing its output with an external LLM, you can obtain more comprehensive structured information as follows (Click to watch the video): |
| |
|
| | [<img src="https://img.youtube.com/vi/Bz1T4wCuWc8/maxresdefault.jpg" alt="视频" width="300">](https://www.youtube.com/watch?v=Bz1T4wCuWc8) |
| |
|
| | <table border="1" style="width:100%; border-collapse: collapse;"> |
| | <tr> |
| | <td style="padding: 15px;"> |
| |
|
| | ### 视频概述 |
| |
|
| | 这是一个喜剧短片,讲述了一位丈夫藏在棉衣里的私房钱被妻子意外发现,并误以为是丈夫准备的“惊喜”礼物。视频通过夫妻二人的一通电话,生动展现了丈夫从悠闲自得,到震惊错愕,再到崩溃无奈的全过程,充满了戏剧性的反转和幽默感。 |
| |
|
| | ### 情节发展分解 |
| |
|
| | 视频情节围绕一通电话展开,以下是详细的时间线、场景、说话人和对话内容: |
| |
|
| | <table> |
| | <thead> |
| | <tr> |
| | <th>时间戳</th> |
| | <th>场景描述</th> |
| | <th>说话人</th> |
| | <th>对话内容 (ASR)</th> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <tr> |
| | <td>0:00 - 0:05</td> |
| | <td>丈夫头戴浴帽,围着浴巾,在室内泳池边悠闲地自拍。</td> |
| | <td>无</td> |
| | <td>(无对话)</td> |
| | </tr> |
| | <tr> |
| | <td>0:05 - 0:10</td> |
| | <td><b>镜头切换</b>:妻子在服装店里,满脸幸福地给丈夫打电话。</td> |
| | <td>妻子</td> |
| | <td>“哎,老公,老公,我爱你爱你,爱死你了,么么么。”</td> |
| | </tr> |
| | <tr> |
| | <td rowspan="2" style="vertical-align: top;">0:10 - 0:18</td> |
| | <td rowspan="2" style="vertical-align: top;">丈夫接起电话,对妻子的热情感到好奇,妻子则兴奋地揭晓了“惊喜”。</td> |
| | <td>丈夫</td> |
| | <td>“哎,怎么了你这是,这么高兴啊?”</td> |
| | </tr> |
| | <tr> |
| | <td>妻子</td> |
| | <td>“今天我在我的棉衣兜里,发现了你给我的惊喜,一万元哟。”</td> |
| | </tr> |
| | <tr> |
| | <td>0:18 - 0:27</td> |
| | <td>听到“一万元”,丈夫表情瞬间凝固,从疑惑变为震惊和懊悔,但仍强装镇定。</td> |
| | <td>丈夫</td> |
| | <td>“啊?好啊,你你你你开心高兴就行。”</td> |
| | </tr> |
| | <tr> |
| | <td>0:27 - 0:34</td> |
| | <td>妻子开心地告知钱的用途,丈夫的表情彻底僵住,震惊加剧。</td> |
| | <td>妻子</td> |
| | <td>“我当然高兴啊,我用它买了一件新衣裳,等晚上回去穿给你看啊。”</td> |
| | </tr> |
| | <tr> |
| | <td rowspan="3" style="vertical-align: top;">0:34 - 0:46</td> |
| | <td rowspan="3" style="vertical-align: top;">丈夫确认钱已被花掉,情绪崩溃。妻子则认为是丈夫授权的,丈夫忍不住骂了一句。</td> |
| | <td>丈夫</td> |
| | <td>“你已经给买成衣服了?”</td> |
| | </tr> |
| | <tr> |
| | <td>妻子</td> |
| | <td>“当然啦,不是你说的吗?说买我自己喜欢的东西。老公,你真是太好了。”</td> |
| | </tr> |
| | <tr> |
| | <td>丈夫</td> |
| | <td>“你真是败家娘们儿啊你。”</td> |
| | </tr> |
| | <tr> |
| | <td rowspan="4" style="vertical-align: top;">0:46 - 0:59</td> |
| | <td rowspan="4" style="vertical-align: top;">妻子察觉丈夫语气不对,丈夫立刻改口掩饰,并催促妻子早点回家。</td> |
| | <td>妻子</td> |
| | <td>“什么,老公,你说什么?”</td> |
| | </tr> |
| | <tr> |
| | <td>丈夫</td> |
| | <td>“啊?我说好啊,你漂亮我高兴。”</td> |
| | </tr> |
| | <tr> |
| | <td>妻子</td> |
| | <td>“你说的,老公。你今天呀,一定要早点回来哟,我等你哟。”</td> |
| | </tr> |
| | <tr> |
| | <td>丈夫</td> |
| | <td>“行行行行行。”</td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | |
| | ### 人物与核心冲突 |
| |
|
| | #### 1. 人物分析 |
| |
|
| | 丈夫: |
| | 行为: 藏私房钱,事发后极力掩饰自己的真实情绪(心痛、懊悔)。 |
| | 心理变化: 悠闲 -> 疑惑 -> 震惊 -> 崩溃 -> 无奈接受。 |
| | 特点: 爱面子,对妻子既有爱意也有无奈,典型的“妻管严”形象。 |
| | |
| | 妻子: |
| | 行为: 发现钱后,认为是丈夫的爱意表达,并迅速将其消费。 |
| | 心理变化: 全程处于发现“惊喜”的幸福和喜悦中。 |
| | 特点: 天真、消费果断,对丈夫充满信任和爱意。 |
| | |
| | #### 2. 核心冲突 |
| |
|
| | 视频的核心冲突在于 “信息的严重不对等” 所造成的戏剧性误会: |
| |
|
| | * 丈夫视角: 辛苦攒下的 10,000元私房钱被意外发现并花掉,是一场“惊吓”。 |
| | * 妻子视角: 丈夫精心准备的 10,000元浪漫基金,是一份巨大的“惊喜”。 |
| |
|
| | 这个误会推动了整个故事的发展,丈夫的“打碎牙往肚里咽”和妻子的“理所当然的幸福”形成了强烈的喜剧反差,制造了密集的笑点。 |
| |
|
| | ### 总结 |
| |
|
| | 该视频通过一个关于“私房钱”的常见家庭情景,巧妙地构建了一个充满反转和幽默的故事。它利用戏剧性讽刺(观众和丈夫知道真相,而妻子蒙在鼓里)的手法,精准捕捉了丈夫在突发状况下的复杂心理活动。整个过程不仅笑料百出,也含蓄地探讨了夫妻间的沟通、信任和金钱观等话题,容易引发观众的共鸣和讨论。 |
| |
|
| | </td> |
| | </tr> |
| | </table> |
| |
|
| | ## Usage |
| |
|
| | ### Dependencies |
| | The installation has been tested and verified on the following environments: |
| | * NVIDIA H20 with CUDA 12.4 |
| | * NVIDIA A100 with CUDA 12.1 |
| |
|
| | ### Installation |
| |
|
| | Clone the repo and install dependent packages |
| |
|
| | ```bash |
| | git clone -b arc-qwen-video https://github.com/TencentARC/ARC-Hunyuan-Video-7B.git |
| | cd ARC-Hunyuan-Video-7B |
| | |
| | # Install torch 2.6.0 based on your CUDA version |
| | # CUDA 11.8 |
| | pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118 |
| | # CUDA 12.4 |
| | pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 |
| | # CUDA 12.6 |
| | pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126 |
| | |
| | pip install librosa decord av accelerate |
| | pip uninstall transformers |
| | pip install git+https://github.com/geyuying/transformers.git@arc-qwen-video |
| | pip install flash_attn==2.7.1.post4 |
| | |
| | # Install FFmpeg according to your system, and ensure that the following command produces a normal version output: |
| | ffmpeg -version |
| | |
| | # (Optional) For vllm, please follow the instructions below, |
| | pip uninstall vllm |
| | pip install git+https://github.com/geyuying/vllm.git@arc-qwen-video |
| | ``` |
| |
|
| | #### An 'Ugly' Workaround for vLLM Installation |
| | If you are unable to install our provided vllm package, we offer an alternative "ugly" method: |
| |
|
| | 1. Install vllm with Qwen2.5-VL support. |
| |
|
| | 2. Modify config.json. In your model weights directory, open config.json and change the architectures field to "Qwen2_5_VLForConditionalGeneration". |
| |
|
| | 3. Patch the vllm source code. Locate the file vllm/model_executor/models/qwen2_5_vl.py in your vllm installation path. Add the following code inside the __init__ method of the Qwen2_5_VLForConditionalGeneration class: |
| | |
| | ``` |
| | whisper_path = 'openai/whisper-large-v3' |
| | speech_encoder = WhisperModel.from_pretrained(whisper_path).encoder |
| | self.speech_encoder = speech_encoder |
| | speech_dim = speech_encoder.config.d_model |
| | llm_hidden_size = config.vision_config.out_hidden_size |
| | self.mlp_speech = nn.Sequential( |
| | nn.LayerNorm(speech_dim), |
| | nn.Linear(speech_dim, llm_hidden_size), |
| | nn.GELU(), |
| | nn.Linear(llm_hidden_size, llm_hidden_size) |
| | ) |
| | ``` |
| | **Why this works**: Our model is based on the Qwen-VL-2.5 architecture, with the addition of an audio encoder and a corresponding MLP. During vllm inference, the multi-modal encoder processes inputs sequentially, while the LLM performs batch inference. Since we only need to pass the final multi-modal embeddings to the LLM, we can reuse the existing code for Qwen-VL-2.5. |
| | |
| | ### Inference |
| |
|
| | ```bash |
| | # Our model currently excels at processing short videos of up to 5 minutes. |
| | # If your video is longer, we recommend following the approach used in our demo and API: |
| | # split the video into segments for inference, and then use an LLM to integrate the results. |
| | ``` |
| |
|
| | To quickly verify that your environment is set up correctly and that video and audio information are being processed as expected, you can run the following test case with ARC-Qwen-Video-7B. |
| |
|
| | ```bash |
| | video_path = "examples/猪排.mp4" |
| | task = "QA" |
| | question = "What did the man say at the beginning of the video after measuring the thickness of the fried pork cutlet?" |
| | ``` |
| | Expected Result: If the model's output contains the phrase "So thin", it indicates that your installation is successful. |
| |
|
| | #### Inference without vllm |
| |
|
| | ```bash |
| | cd ARC-Hunyuan-Video-7B |
| | |
| | # For ARC-Hunyuan-Video-7B |
| | python3 inference_arc_qwen_video.py |
| | |
| | # For ARC-Hunyuan-Video-7B-Narrator |
| | python3 inference_arc_qwen_video_narrator.py |
| | ``` |
| |
|
| | #### Inference with vllm |
| |
|
| | ```bash |
| | cd ARC-Hunyuan-Video-7B |
| | |
| | # For ARC-Hunyuan-Video-7B |
| | python3 vllm_arc_qwen_vl_video_batch.py --batch_inference |
| | |
| | # For ARC-Hunyuan-Video-7B-Narrator |
| | python3 vllm_arc_qwen_vl_video_batch_narrator.py --batch_inference |
| | ``` |
| |
|
| | ## Benchmark Performance |
| | | | Video-MMMU | MMVU | Temp-Compass | Video-Holmes | Video-MME | VCR-Bench | MV-Bench | ShortVid-Bench | Charades-STA | |
| | |:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | | ARC-Hunyuan-Video-7B | 31.1 | 49.1 | 66.0 | 40.9 | 58.7 | 50.5 | **62.6** | **73.0** | **54.8** | |
| | | ARC-Qwen-Video-7B | **41.3** | **55.5** | **68.7** | **51.1** | **61.0** | **52.3** | 60.8 | 72.6 | 52.8 | |
| |
|
| | Quantitative evaluation is performed on different benchmarks using accuracy as the evaluation metric, except for the grounding task on Charades-STA, which uses mIoU. For all benchmarks other than VideoMMMU and Charades-STA, we only evaluated the multiple-choice questions. |
| |
|
| | ## Citation |
| |
|
| | If you find the work helpful, please consider citing: |
| |
|
| | ```bash |
| | @article{ge2025arc, |
| | title={ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts}, |
| | author={Ge, Yuying and Ge, Yixiao and Li, Chen and Wang, Teng and Pu, Junfu and Li, Yizhuo and Qiu, Lu and Ma, Jin and Duan, Lisheng and Zuo, Xinyu and others}, |
| | journal={arXiv preprint arXiv:2507.20939}, |
| | year={2025} |
| | } |
| | ``` |
| |
|