Improve model card with detailed architecture, installation and inference instructions

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +53 -31
README.md CHANGED
@@ -1,8 +1,11 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
 
5
  pipeline_tag: video-text-to-text
 
 
 
6
  tags:
7
  - vision-language-model
8
  - long-video-understanding
@@ -13,22 +16,15 @@ tags:
13
 
14
  # 🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding
15
 
16
- [![Project Page](https://img.shields.io/badge/Project-Page-green?logo=googlechrome&logoColor=white)](https://feielysia.github.io/tempo-page/)
17
- [![Demo](https://img.shields.io/badge/πŸ€—_Space-Demo-yellow)](https://huggingface.co/spaces/Vision-CAIR/Tempo)
18
- [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/FeiElysia/Tempo)
19
- [![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv)](https://arxiv.org/abs/2604.08120)
20
- [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
21
-
22
- ## πŸ“– Model Overview
23
 
24
- **Tempo-6B** is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It effectively resolves the structural mismatch between massive video streams and bounded LLM context windows.
25
 
26
- Instead of relying on query-agnostic heuristics like sparse sampling or uniform pooling which often discard decisive moments, Tempo acts as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.
27
-
28
- ### ✨ Key Features
29
- - **Adaptive Token Allocation (ATA):** Acts as a training-free, *O(1)* dynamic router. Exploiting the zero-shot relevance prior of the SVLM, it allocates dense representational bandwidth only to query-critical segments.
30
- - **Token Efficiency:** Achieves aggressive dynamic compression (0.5–16 tokens/frame), actively compressing redundancies into minimal *temporal anchors* to maintain overarching causality.
31
- - **Hour-Long Video Capability:** Effectively processes and answers complex queries for videos over an hour long without hitting context limits or suffering from the *lost-in-the-middle* phenomenon.
32
 
33
  ## πŸ—οΈ Architecture
34
  Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
@@ -36,44 +32,70 @@ Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global L
36
  - **Global LLM:** Qwen/Qwen3-4B
37
  - **Total Parameters:** ~6B
38
 
 
 
 
 
 
39
  ## πŸš€ Quick Start
40
 
41
- To use Tempo-6B for inference, please rely on our official GitHub repository which contains the necessary custom code and interactive Gradio UI.
42
 
43
  ```bash
44
- # 1. Clone the repository
45
  git clone https://github.com/FeiElysia/Tempo.git
46
  cd Tempo
47
 
48
- # 2. Create environment
49
  conda create -n tempo python=3.12 -y
50
  conda activate tempo
51
 
52
- # 3. Install all packages (PyTorch 2.6.0 + CUDA 12.4)
53
  pip install -r requirements.txt
 
 
 
 
 
 
 
 
54
 
55
- # 4. Run inference (Example)
56
- # Check our github for detailed inference scripts
57
- python inference.py --model_path Vision-CAIR/Tempo-6B --video_path /path/to/video.mp4 --query "Your question"
 
 
58
  ```
59
- *(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via `transformers` without our codebase will not work out-of-the-box.)*
60
 
61
- ## πŸ† Performance
62
 
63
- Extensive experiments demonstrate that our compact 6B architecture achieves state-of-the-art performance on extreme-long video tasks.
 
 
 
64
 
65
- **LVBench (Extreme-Long, 4101s avg):**
66
- - **Tempo-6B (8K Budget):** **52.3**
67
- - **Tempo-6B (Scaled to 2048 frames):** **53.7**
 
 
 
 
68
 
69
- ## πŸ“‘ Citation
 
 
70
 
71
- If you find Tempo or our code useful for your research, please consider citing our paper:
 
 
72
 
73
  ```bibtex
74
  @article{fei2026small,
75
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
76
- author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
77
  journal={arXiv preprint arXiv:2604.08120},
78
  year={2026}
79
  }
 
 
1
  ---
 
2
  language:
3
  - en
4
+ license: apache-2.0
5
  pipeline_tag: video-text-to-text
6
+ base_model:
7
+ - Qwen/Qwen3-VL-2B-Instruct
8
+ - Qwen/Qwen3-4B
9
  tags:
10
  - vision-language-model
11
  - long-video-understanding
 
16
 
17
  # 🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding
18
 
19
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue?style=flat-square)](https://feielysia.github.io/tempo-page/)
20
+ [![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b?style=flat-square)](https://huggingface.co/papers/2604.08120)
21
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-yellow?style=flat-square)](https://huggingface.co/spaces/Vision-CAIR/Tempo)
22
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=flat-square)](https://github.com/FeiElysia/Tempo)
23
+ [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg?style=flat-square)](https://opensource.org/licenses/Apache-2.0)
 
 
24
 
25
+ **Tempo-6B** is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper [Small Vision-Language Models are Smart Compressors for Long Video Understanding](https://huggingface.co/papers/2604.08120).
26
 
27
+ Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.
 
 
 
 
 
28
 
29
  ## πŸ—οΈ Architecture
30
  Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
 
32
  - **Global LLM:** Qwen/Qwen3-4B
33
  - **Total Parameters:** ~6B
34
 
35
+ ### ✨ Key Features
36
+ - **Adaptive Token Allocation (ATA):** Acts as a training-free, *O(1)* dynamic router. It allocates dense representational bandwidth only to query-critical segments.
37
+ - **Token Efficiency:** Achieves aggressive dynamic compression (0.5–16 tokens/frame), maintaining global causality while discarding redundancies.
38
+ - **Hour-Long Video Capability:** Effectively processes and answers complex queries for videos over an hour long without hitting context limits.
39
+
40
  ## πŸš€ Quick Start
41
 
42
+ ### 1. Installation
43
 
44
  ```bash
45
+ # Clone the repository
46
  git clone https://github.com/FeiElysia/Tempo.git
47
  cd Tempo
48
 
49
+ # Create environment
50
  conda create -n tempo python=3.12 -y
51
  conda activate tempo
52
 
53
+ # Install dependencies
54
  pip install -r requirements.txt
55
+ ```
56
+
57
+ ### 2. Prepare Checkpoints
58
+
59
+ To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
60
+
61
+ ```bash
62
+ mkdir -p checkpoints
63
 
64
+ # 1. Download the final Tempo-6B model
65
+ huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
66
+
67
+ # 2. Download the base Qwen3-VL model
68
+ huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
69
  ```
 
70
 
71
+ ### 3. Inference
72
 
73
+ **Launch Gradio Web UI:**
74
+ ```bash
75
+ python app.py
76
+ ```
77
 
78
+ **CLI Inference:**
79
+ ```bash
80
+ python infer.py \
81
+ --model_path "./checkpoints/Tempo-6B" \
82
+ --video_path "/path/to/your/video.mp4" \
83
+ --query "Describe the video in detail."
84
+ ```
85
 
86
+ *(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via `transformers` without the official codebase will not work out-of-the-box.)*
87
+
88
+ ## πŸ† Performance
89
 
90
+ Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores **52.3**, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
91
+
92
+ ## πŸ“‘ Citation
93
 
94
  ```bibtex
95
  @article{fei2026small,
96
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
97
+ author={Fei, Junjie and Chen, Jun triangle, Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and Shuming Liu and Lemeng Wu and Raghuraman Krishnamoorthi and Vikas Chandra and Mohamed Elhoseiny and Chenchen Zhu},
98
  journal={arXiv preprint arXiv:2604.08120},
99
  year={2026}
100
  }
101
+ ```