Vision-CAIR
/

Tempo-6B

@@ -41,8 +41,10 @@ Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global L
 ### 1. Installation
 ```bash
-# Clone the repository
 git clone https://github.com/FeiElysia/Tempo.git
 cd Tempo
@@ -50,10 +52,28 @@ cd Tempo
 conda create -n tempo python=3.12 -y
 conda activate tempo
-# Install dependencies
 pip install -r requirements.txt
 ```
 ### 2. Prepare Checkpoints
 To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
@@ -64,7 +84,9 @@ mkdir -p checkpoints
 # 1. Download the final Tempo-6B model
 huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
-# 2. Download the base Qwen3-VL model
 huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
 ```
@@ -87,7 +109,7 @@ python infer.py \
 ## 🏆 Performance
-Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores **52.3**, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
 ## 📑 Citation

 ### 1. Installation
+Create a new conda environment and install all required dependencies:
 ```bash
+# Clone our repository
 git clone https://github.com/FeiElysia/Tempo.git
 cd Tempo
 conda create -n tempo python=3.12 -y
 conda activate tempo
+# Install all packages (PyTorch 2.6.0 + CUDA 12.4)
 pip install -r requirements.txt
 ```
+#### ⚡ Installing Flash-Attention
+Since `flash-attn` installation can be highly environment-dependent, please install it manually using one of the methods below:
+```bash
+# Method 1
+pip install flash-attn==2.7.4.post1
+# Method 2: Without Build Isolation
+pip install flash-attn==2.7.4.post1 --no-build-isolation
+# Method 3: If you are unable to build from source, you can directly download and install the pre-built wheel:
+wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
+pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
+rm flash_attn*.whl
+```
 ### 2. Prepare Checkpoints
 To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
 # 1. Download the final Tempo-6B model
 huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
+# 2. Download the base Qwen3-VL model (Required for architecture initialization)
+# 💡 Note: To avoid caching Qwen3-VL in the default system drive during inference,
+# you can modify Tempo-6B's `config.json`: change "Qwen/Qwen3-VL-2B-Instruct" to "./checkpoints/Qwen3-VL-2B-Instruct" and run:
 huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
 ```
 ## 🏆 Performance
+Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores scores **52.3** on the extreme-long LVBench under a strict 8K visual token budget (**53.7** with 12K budget), outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
 ## 📑 Citation