Improve model card with detailed architecture, installation and inference instructions
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,8 +1,11 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
| 5 |
pipeline_tag: video-text-to-text
|
|
|
|
|
|
|
|
|
|
| 6 |
tags:
|
| 7 |
- vision-language-model
|
| 8 |
- long-video-understanding
|
|
@@ -13,22 +16,15 @@ tags:
|
|
| 13 |
|
| 14 |
# π¬ Tempo-6B: Efficient Query-Aware Long Video Understanding
|
| 15 |
|
| 16 |
-
[ designed explicitly for extreme-long video understanding. It
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
### β¨ Key Features
|
| 29 |
-
- **Adaptive Token Allocation (ATA):** Acts as a training-free, *O(1)* dynamic router. Exploiting the zero-shot relevance prior of the SVLM, it allocates dense representational bandwidth only to query-critical segments.
|
| 30 |
-
- **Token Efficiency:** Achieves aggressive dynamic compression (0.5β16 tokens/frame), actively compressing redundancies into minimal *temporal anchors* to maintain overarching causality.
|
| 31 |
-
- **Hour-Long Video Capability:** Effectively processes and answers complex queries for videos over an hour long without hitting context limits or suffering from the *lost-in-the-middle* phenomenon.
|
| 32 |
|
| 33 |
## ποΈ Architecture
|
| 34 |
Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
|
|
@@ -36,44 +32,70 @@ Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global L
|
|
| 36 |
- **Global LLM:** Qwen/Qwen3-4B
|
| 37 |
- **Total Parameters:** ~6B
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
## π Quick Start
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
```bash
|
| 44 |
-
#
|
| 45 |
git clone https://github.com/FeiElysia/Tempo.git
|
| 46 |
cd Tempo
|
| 47 |
|
| 48 |
-
#
|
| 49 |
conda create -n tempo python=3.12 -y
|
| 50 |
conda activate tempo
|
| 51 |
|
| 52 |
-
#
|
| 53 |
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
#
|
| 56 |
-
|
| 57 |
-
|
|
|
|
|
|
|
| 58 |
```
|
| 59 |
-
*(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via `transformers` without our codebase will not work out-of-the-box.)*
|
| 60 |
|
| 61 |
-
##
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
-
**
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
| 72 |
|
| 73 |
```bibtex
|
| 74 |
@article{fei2026small,
|
| 75 |
title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
|
| 76 |
-
author={Fei, Junjie and Chen, Jun
|
| 77 |
journal={arXiv preprint arXiv:2604.08120},
|
| 78 |
year={2026}
|
| 79 |
}
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
pipeline_tag: video-text-to-text
|
| 6 |
+
base_model:
|
| 7 |
+
- Qwen/Qwen3-VL-2B-Instruct
|
| 8 |
+
- Qwen/Qwen3-4B
|
| 9 |
tags:
|
| 10 |
- vision-language-model
|
| 11 |
- long-video-understanding
|
|
|
|
| 16 |
|
| 17 |
# π¬ Tempo-6B: Efficient Query-Aware Long Video Understanding
|
| 18 |
|
| 19 |
+
[](https://feielysia.github.io/tempo-page/)
|
| 20 |
+
[](https://huggingface.co/papers/2604.08120)
|
| 21 |
+
[](https://huggingface.co/spaces/Vision-CAIR/Tempo)
|
| 22 |
+
[](https://github.com/FeiElysia/Tempo)
|
| 23 |
+
[](https://opensource.org/licenses/Apache-2.0)
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
**Tempo-6B** is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper [Small Vision-Language Models are Smart Compressors for Long Video Understanding](https://huggingface.co/papers/2604.08120).
|
| 26 |
|
| 27 |
+
Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## ποΈ Architecture
|
| 30 |
Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
|
|
|
|
| 32 |
- **Global LLM:** Qwen/Qwen3-4B
|
| 33 |
- **Total Parameters:** ~6B
|
| 34 |
|
| 35 |
+
### β¨ Key Features
|
| 36 |
+
- **Adaptive Token Allocation (ATA):** Acts as a training-free, *O(1)* dynamic router. It allocates dense representational bandwidth only to query-critical segments.
|
| 37 |
+
- **Token Efficiency:** Achieves aggressive dynamic compression (0.5β16 tokens/frame), maintaining global causality while discarding redundancies.
|
| 38 |
+
- **Hour-Long Video Capability:** Effectively processes and answers complex queries for videos over an hour long without hitting context limits.
|
| 39 |
+
|
| 40 |
## π Quick Start
|
| 41 |
|
| 42 |
+
### 1. Installation
|
| 43 |
|
| 44 |
```bash
|
| 45 |
+
# Clone the repository
|
| 46 |
git clone https://github.com/FeiElysia/Tempo.git
|
| 47 |
cd Tempo
|
| 48 |
|
| 49 |
+
# Create environment
|
| 50 |
conda create -n tempo python=3.12 -y
|
| 51 |
conda activate tempo
|
| 52 |
|
| 53 |
+
# Install dependencies
|
| 54 |
pip install -r requirements.txt
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
### 2. Prepare Checkpoints
|
| 58 |
+
|
| 59 |
+
To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
mkdir -p checkpoints
|
| 63 |
|
| 64 |
+
# 1. Download the final Tempo-6B model
|
| 65 |
+
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
|
| 66 |
+
|
| 67 |
+
# 2. Download the base Qwen3-VL model
|
| 68 |
+
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
|
| 69 |
```
|
|
|
|
| 70 |
|
| 71 |
+
### 3. Inference
|
| 72 |
|
| 73 |
+
**Launch Gradio Web UI:**
|
| 74 |
+
```bash
|
| 75 |
+
python app.py
|
| 76 |
+
```
|
| 77 |
|
| 78 |
+
**CLI Inference:**
|
| 79 |
+
```bash
|
| 80 |
+
python infer.py \
|
| 81 |
+
--model_path "./checkpoints/Tempo-6B" \
|
| 82 |
+
--video_path "/path/to/your/video.mp4" \
|
| 83 |
+
--query "Describe the video in detail."
|
| 84 |
+
```
|
| 85 |
|
| 86 |
+
*(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via `transformers` without the official codebase will not work out-of-the-box.)*
|
| 87 |
+
|
| 88 |
+
## π Performance
|
| 89 |
|
| 90 |
+
Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores **52.3**, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
|
| 91 |
+
|
| 92 |
+
## π Citation
|
| 93 |
|
| 94 |
```bibtex
|
| 95 |
@article{fei2026small,
|
| 96 |
title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
|
| 97 |
+
author={Fei, Junjie and Chen, Jun triangle, Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and Shuming Liu and Lemeng Wu and Raghuraman Krishnamoorthi and Vikas Chandra and Mohamed Elhoseiny and Chenchen Zhu},
|
| 98 |
journal={arXiv preprint arXiv:2604.08120},
|
| 99 |
year={2026}
|
| 100 |
}
|
| 101 |
+
```
|