Update README.md
Browse files
README.md
CHANGED
|
@@ -41,8 +41,10 @@ Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global L
|
|
| 41 |
|
| 42 |
### 1. Installation
|
| 43 |
|
|
|
|
|
|
|
| 44 |
```bash
|
| 45 |
-
# Clone
|
| 46 |
git clone https://github.com/FeiElysia/Tempo.git
|
| 47 |
cd Tempo
|
| 48 |
|
|
@@ -50,10 +52,28 @@ cd Tempo
|
|
| 50 |
conda create -n tempo python=3.12 -y
|
| 51 |
conda activate tempo
|
| 52 |
|
| 53 |
-
# Install
|
| 54 |
pip install -r requirements.txt
|
| 55 |
```
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
### 2. Prepare Checkpoints
|
| 58 |
|
| 59 |
To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
|
|
@@ -64,7 +84,9 @@ mkdir -p checkpoints
|
|
| 64 |
# 1. Download the final Tempo-6B model
|
| 65 |
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
|
| 66 |
|
| 67 |
-
# 2. Download the base Qwen3-VL model
|
|
|
|
|
|
|
| 68 |
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
|
| 69 |
```
|
| 70 |
|
|
@@ -87,7 +109,7 @@ python infer.py \
|
|
| 87 |
|
| 88 |
## 🏆 Performance
|
| 89 |
|
| 90 |
-
Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores **52.3**, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
|
| 91 |
|
| 92 |
## 📑 Citation
|
| 93 |
|
|
|
|
| 41 |
|
| 42 |
### 1. Installation
|
| 43 |
|
| 44 |
+
Create a new conda environment and install all required dependencies:
|
| 45 |
+
|
| 46 |
```bash
|
| 47 |
+
# Clone our repository
|
| 48 |
git clone https://github.com/FeiElysia/Tempo.git
|
| 49 |
cd Tempo
|
| 50 |
|
|
|
|
| 52 |
conda create -n tempo python=3.12 -y
|
| 53 |
conda activate tempo
|
| 54 |
|
| 55 |
+
# Install all packages (PyTorch 2.6.0 + CUDA 12.4)
|
| 56 |
pip install -r requirements.txt
|
| 57 |
```
|
| 58 |
|
| 59 |
+
#### ⚡ Installing Flash-Attention
|
| 60 |
+
|
| 61 |
+
Since `flash-attn` installation can be highly environment-dependent, please install it manually using one of the methods below:
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
|
| 65 |
+
# Method 1
|
| 66 |
+
pip install flash-attn==2.7.4.post1
|
| 67 |
+
|
| 68 |
+
# Method 2: Without Build Isolation
|
| 69 |
+
pip install flash-attn==2.7.4.post1 --no-build-isolation
|
| 70 |
+
|
| 71 |
+
# Method 3: If you are unable to build from source, you can directly download and install the pre-built wheel:
|
| 72 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
|
| 73 |
+
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
|
| 74 |
+
rm flash_attn*.whl
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
### 2. Prepare Checkpoints
|
| 78 |
|
| 79 |
To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
|
|
|
|
| 84 |
# 1. Download the final Tempo-6B model
|
| 85 |
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
|
| 86 |
|
| 87 |
+
# 2. Download the base Qwen3-VL model (Required for architecture initialization)
|
| 88 |
+
# 💡 Note: To avoid caching Qwen3-VL in the default system drive during inference,
|
| 89 |
+
# you can modify Tempo-6B's `config.json`: change "Qwen/Qwen3-VL-2B-Instruct" to "./checkpoints/Qwen3-VL-2B-Instruct" and run:
|
| 90 |
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
|
| 91 |
```
|
| 92 |
|
|
|
|
| 109 |
|
| 110 |
## 🏆 Performance
|
| 111 |
|
| 112 |
+
Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores scores **52.3** on the extreme-long LVBench under a strict 8K visual token budget (**53.7** with 12K budget), outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
|
| 113 |
|
| 114 |
## 📑 Citation
|
| 115 |
|