π¬ Tempo-6B: Efficient Query-Aware Long Video Understanding
Tempo-6B is an efficient, query-aware Multimodal Large Language Model (MLLM) designed explicitly for extreme-long video understanding. It was presented in the paper Small Vision-Language Models are Smart Compressors for Long Video Understanding.
Tempo effectively resolves the structural mismatch between massive video streams and bounded LLM context windows by acting as a smart temporal compressor. It performs early cross-modal distillation, generating highly compact, intent-aligned video representations in a single forward pass.
ποΈ Architecture
Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global Large Language Model (LLM).
- Local Compressor: Qwen3-VL-2B-Instruct
- Global LLM: Qwen/Qwen3-4B
- Total Parameters: ~6B
β¨ Key Features
- Adaptive Token Allocation (ATA): Acts as a training-free, O(1) dynamic router. It allocates dense representational bandwidth only to query-critical segments.
- Token Efficiency: Achieves aggressive dynamic compression (0.5β16 tokens/frame), maintaining global causality while discarding redundancies.
- Hour-Long Video Capability: Effectively processes and answers complex queries for videos over an hour long without hitting context limits.
π Quick Start
1. Installation
Create a new conda environment and install all required dependencies:
# Clone our repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo
# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo
# Install all packages (PyTorch 2.6.0 + CUDA 12.4)
pip install -r requirements.txt
β‘ Installing Flash-Attention
Since flash-attn installation can be highly environment-dependent, please install it manually using one of the methods below:
# Method 1
pip install flash-attn==2.7.4.post1
# Method 2: Without Build Isolation
pip install flash-attn==2.7.4.post1 --no-build-isolation
# Method 3: If you are unable to build from source, you can directly download and install the pre-built wheel:
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
rm flash_attn*.whl
2. Prepare Checkpoints
To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
mkdir -p checkpoints
# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
# 2. Download the base Qwen3-VL model (Required for architecture initialization)
# π‘ Note: To avoid caching Qwen3-VL in the default system drive during inference,
# you can modify Tempo-6B's `config.json`: change "Qwen/Qwen3-VL-2B-Instruct" to "./checkpoints/Qwen3-VL-2B-Instruct" and run:
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
3. Inference
Launch Gradio Web UI:
python app.py
CLI Inference:
python infer.py \
--model_path "./checkpoints/Tempo-6B" \
--video_path "/path/to/your/video.mp4" \
--query "Describe the video in detail."
(Note: Since Tempo relies on custom routing mechanisms, downloading the weights directly via transformers without the official codebase will not work out-of-the-box.)
π Performance
Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On LVBench (average video length 4101s), Tempo-6B scores scores 52.3 on the extreme-long LVBench under a strict 8K visual token budget (53.7 with 12K budget), outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
π Citation
@article{fei2026small,
title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
journal={arXiv preprint arXiv:2604.08120},
year={2026}
}
- Downloads last month
- 125