🎬 Tempo-6B: Efficient Query-Aware Long Video Understanding

Tempo-6B is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper Small Vision-Language Models are Smart Compressors for Long Video Understanding.

Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.

🏗️ Architecture

Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).

Local Compressor: Qwen3-VL-2B-Instruct
Global LLM: Qwen/Qwen3-4B
Total Parameters: ~6B

✨ Key Features

Adaptive Token Allocation (ATA): Acts as a training-free, O(1) dynamic router. It allocates dense representational bandwidth only to query-critical segments.
Token Efficiency: Achieves aggressive dynamic compression (0.5–16 tokens/frame), maintaining global causality while discarding redundancies.
Hour-Long Video Capability: Effectively processes and answers complex queries for videos over an hour long without hitting context limits.

🚀 Quick Start

1. Installation

Create a new conda environment and install all required dependencies:

# Clone our repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo

# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo

# Install all packages (PyTorch 2.6.0 + CUDA 12.4)
pip install -r requirements.txt

⚡ Installing Flash-Attention

Since flash-attn installation can be highly environment-dependent, please install it manually using one of the methods below:


# Method 1
pip install flash-attn==2.7.4.post1

# Method 2: Without Build Isolation
pip install flash-attn==2.7.4.post1 --no-build-isolation

# Method 3: If you are unable to build from source, you can directly download and install the pre-built wheel:
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
rm flash_attn*.whl

2. Prepare Checkpoints

To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.

mkdir -p checkpoints

# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B

# 2. Download the base Qwen3-VL model (Required for architecture initialization)
# 💡 Note: To avoid caching Qwen3-VL in the default system drive during inference, 
# you can modify Tempo-6B's `config.json`: change "Qwen/Qwen3-VL-2B-Instruct" to "./checkpoints/Qwen3-VL-2B-Instruct" and run:
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct

3. Inference

Launch Gradio Web UI:

python app.py

CLI Inference:

python infer.py \
    --model_path "./checkpoints/Tempo-6B" \
    --video_path "/path/to/your/video.mp4" \
    --query "Describe the video in detail."

(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via transformers without the official codebase will not work out-of-the-box.)

🏆 Performance

Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On LVBench (average video length 4101s), Tempo-6B scores scores 52.3 on the extreme-long LVBench under a strict 8K visual token budget (53.7 with 12K budget), outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.

📑 Citation

@article{fei2026small,
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
  author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
  journal={arXiv preprint arXiv:2604.08120},
  year={2026}
}