Vision-CAIR
/

Tempo-6B

@@ -1,8 +1,11 @@
 ---
-license: apache-2.0
 language:
 - en
 pipeline_tag: video-text-to-text
 tags:
 - vision-language-model
 - long-video-understanding
@@ -13,22 +16,15 @@ tags:
 # 🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding
-[![Project Page](https://img.shields.io/badge/Project-Page-green?logo=googlechrome&logoColor=white)](https://feielysia.github.io/tempo-page/)
-[![Demo](https://img.shields.io/badge/🤗_Space-Demo-yellow)](https://huggingface.co/spaces/Vision-CAIR/Tempo)
-[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?logo=github)](https://github.com/FeiElysia/Tempo)
-[![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv)](https://arxiv.org/abs/2604.08120)
-[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-## 📖 Model Overview
-**Tempo-6B** is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It effectively resolves the structural mismatch between massive video streams and bounded LLM context windows.
-Instead of relying on query-agnostic heuristics like sparse sampling or uniform pooling which often discard decisive moments, Tempo acts as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.
-### ✨ Key Features
-- **Adaptive Token Allocation (ATA):** Acts as a training-free, *O(1)* dynamic router. Exploiting the zero-shot relevance prior of the SVLM, it allocates dense representational bandwidth only to query-critical segments.
-- **Token Efficiency:** Achieves aggressive dynamic compression (0.5–16 tokens/frame), actively compressing redundancies into minimal *temporal anchors* to maintain overarching causality.
-- **Hour-Long Video Capability:** Effectively processes and answers complex queries for videos over an hour long without hitting context limits or suffering from the *lost-in-the-middle* phenomenon.
 ## 🏗️ Architecture
 Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
@@ -36,44 +32,70 @@ Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global L
 - **Global LLM:** Qwen/Qwen3-4B
 - **Total Parameters:** ~6B
 ## 🚀 Quick Start
-To use Tempo-6B for inference, please rely on our official GitHub repository which contains the necessary custom code and interactive Gradio UI.
 ```bash
-# 1. Clone the repository
 git clone https://github.com/FeiElysia/Tempo.git
 cd Tempo
-# 2. Create environment
 conda create -n tempo python=3.12 -y
 conda activate tempo
-# 3. Install all packages (PyTorch 2.6.0 + CUDA 12.4)
 pip install -r requirements.txt
-# 4. Run inference (Example)
-# Check our github for detailed inference scripts
-python inference.py --model_path Vision-CAIR/Tempo-6B --video_path /path/to/video.mp4 --query "Your question"
 ```
-*(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via `transformers` without our codebase will not work out-of-the-box.)*
-## 🏆 Performance
-Extensive experiments demonstrate that our compact 6B architecture achieves state-of-the-art performance on extreme-long video tasks.
-**LVBench (Extreme-Long, 4101s avg):**
-- **Tempo-6B (8K Budget):** **52.3**
-- **Tempo-6B (Scaled to 2048 frames):** **53.7**
-## 📑 Citation
-If you find Tempo or our code useful for your research, please consider citing our paper:
 ```bibtex
 @article{fei2026small,
   title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
-  author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
   journal={arXiv preprint arXiv:2604.08120},
   year={2026}
 }

 ---
 language:
 - en
+license: apache-2.0
 pipeline_tag: video-text-to-text
+base_model:
+- Qwen/Qwen3-VL-2B-Instruct
+- Qwen/Qwen3-4B
 tags:
 - vision-language-model
 - long-video-understanding
 # 🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding
+[![Project Page](https://img.shields.io/badge/Project-Page-blue?style=flat-square)](https://feielysia.github.io/tempo-page/)
+[![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b?style=flat-square)](https://huggingface.co/papers/2604.08120)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo-yellow?style=flat-square)](https://huggingface.co/spaces/Vision-CAIR/Tempo)
+[![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=flat-square)](https://github.com/FeiElysia/Tempo)
+[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg?style=flat-square)](https://opensource.org/licenses/Apache-2.0)
+**Tempo-6B** is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper [Small Vision-Language Models are Smart Compressors for Long Video Understanding](https://huggingface.co/papers/2604.08120).
+Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.
 ## 🏗️ Architecture
 Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
 - **Global LLM:** Qwen/Qwen3-4B
 - **Total Parameters:** ~6B
+### ✨ Key Features
+- **Adaptive Token Allocation (ATA):** Acts as a training-free, *O(1)* dynamic router. It allocates dense representational bandwidth only to query-critical segments.
+- **Token Efficiency:** Achieves aggressive dynamic compression (0.5–16 tokens/frame), maintaining global causality while discarding redundancies.
+- **Hour-Long Video Capability:** Effectively processes and answers complex queries for videos over an hour long without hitting context limits.
 ## 🚀 Quick Start
+### 1. Installation
 ```bash
+# Clone the repository
 git clone https://github.com/FeiElysia/Tempo.git
 cd Tempo
+# Create environment
 conda create -n tempo python=3.12 -y
 conda activate tempo
+# Install dependencies
 pip install -r requirements.txt
+```
+### 2. Prepare Checkpoints
+To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
+```bash
+mkdir -p checkpoints
+# 1. Download the final Tempo-6B model
+huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
+# 2. Download the base Qwen3-VL model
+huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
 ```
+### 3. Inference
+**Launch Gradio Web UI:**
+```bash
+python app.py
+```
+**CLI Inference:**
+```bash
+python infer.py \
+    --model_path "./checkpoints/Tempo-6B" \
+    --video_path "/path/to/your/video.mp4" \
+    --query "Describe the video in detail."
+```
+*(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via `transformers` without the official codebase will not work out-of-the-box.)*
+## 🏆 Performance
+Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores **52.3**, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
+## 📑 Citation
 ```bibtex
 @article{fei2026small,
   title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
+  author={Fei, Junjie and Chen, Jun triangle, Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and Shuming Liu and Lemeng Wu and Raghuraman Krishnamoorthi and Vikas Chandra and Mohamed Elhoseiny and Chenchen Zhu},
   journal={arXiv preprint arXiv:2604.08120},
   year={2026}
 }
+```