Vision-CAIR-Admin commited on
Commit
9c767a6
·
verified ·
1 Parent(s): 82eb796

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -4
README.md CHANGED
@@ -41,8 +41,10 @@ Tempo natively unifies a local Small Vision-Language Model (SVLM) and a global L
41
 
42
  ### 1. Installation
43
 
 
 
44
  ```bash
45
- # Clone the repository
46
  git clone https://github.com/FeiElysia/Tempo.git
47
  cd Tempo
48
 
@@ -50,10 +52,28 @@ cd Tempo
50
  conda create -n tempo python=3.12 -y
51
  conda activate tempo
52
 
53
- # Install dependencies
54
  pip install -r requirements.txt
55
  ```
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ### 2. Prepare Checkpoints
58
 
59
  To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
@@ -64,7 +84,9 @@ mkdir -p checkpoints
64
  # 1. Download the final Tempo-6B model
65
  huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
66
 
67
- # 2. Download the base Qwen3-VL model
 
 
68
  huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
69
  ```
70
 
@@ -87,7 +109,7 @@ python infer.py \
87
 
88
  ## 🏆 Performance
89
 
90
- Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores **52.3**, outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
91
 
92
  ## 📑 Citation
93
 
 
41
 
42
  ### 1. Installation
43
 
44
+ Create a new conda environment and install all required dependencies:
45
+
46
  ```bash
47
+ # Clone our repository
48
  git clone https://github.com/FeiElysia/Tempo.git
49
  cd Tempo
50
 
 
52
  conda create -n tempo python=3.12 -y
53
  conda activate tempo
54
 
55
+ # Install all packages (PyTorch 2.6.0 + CUDA 12.4)
56
  pip install -r requirements.txt
57
  ```
58
 
59
+ #### ⚡ Installing Flash-Attention
60
+
61
+ Since `flash-attn` installation can be highly environment-dependent, please install it manually using one of the methods below:
62
+
63
+ ```bash
64
+
65
+ # Method 1
66
+ pip install flash-attn==2.7.4.post1
67
+
68
+ # Method 2: Without Build Isolation
69
+ pip install flash-attn==2.7.4.post1 --no-build-isolation
70
+
71
+ # Method 3: If you are unable to build from source, you can directly download and install the pre-built wheel:
72
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
73
+ pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
74
+ rm flash_attn*.whl
75
+ ```
76
+
77
  ### 2. Prepare Checkpoints
78
 
79
  To run the inference script successfully, you need to download both the Tempo-6B weights and the base Qwen3-VL model for architecture initialization.
 
84
  # 1. Download the final Tempo-6B model
85
  huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
86
 
87
+ # 2. Download the base Qwen3-VL model (Required for architecture initialization)
88
+ # 💡 Note: To avoid caching Qwen3-VL in the default system drive during inference,
89
+ # you can modify Tempo-6B's `config.json`: change "Qwen/Qwen3-VL-2B-Instruct" to "./checkpoints/Qwen3-VL-2B-Instruct" and run:
90
  huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct
91
  ```
92
 
 
109
 
110
  ## 🏆 Performance
111
 
112
+ Tempo-6B achieves state-of-the-art performance on extreme-long video tasks. On **LVBench** (average video length 4101s), Tempo-6B scores scores **52.3** on the extreme-long LVBench under a strict 8K visual token budget (**53.7** with 12K budget), outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
113
 
114
  ## 📑 Citation
115