SiyouLi commited on
Commit
6334d67
·
verified ·
1 Parent(s): 172dbac

Add files using upload-large-folder tool

Browse files
README.md CHANGED
@@ -1,17 +1,213 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
- ## ✨ Cite our work
5
- If you find this repo useful, please consider citing:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ```bibtex
8
  @misc{li2025seeingforesttreesqueryaware,
9
- title={Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models},
10
- author={Siyou Li and Huanan Wu and Juexi Shao and Yinghao Ma and Yujian Gan and Yihao Luo and Yuwei Wang and Dong Nie and Lu Wang and Wengqing Wu and Le Zhang and Massimo Poesio and Juntao Yu},
11
- year={2025},
12
- eprint={2511.11910},
13
- archivePrefix={arXiv},
14
- primaryClass={cs.CV},
15
- url={https://arxiv.org/abs/2511.11910},
16
  }
17
  ```
 
1
  ---
2
  license: mit
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ language:
6
+ - en
7
+ tags:
8
+ - multimodal
9
+ - vision
10
+ - video
11
+ - long-video
12
+ - token-selection
13
+ - compression
14
+ - qwen2.5-vl
15
+ - qtsplus
16
  ---
17
+
18
+ <p>
19
+ <h1>
20
+ <img src="./assets/logo_with_glasses.svg" height=150px align="right"/>
21
+ Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
22
+ </h1>
23
+ </p>
24
+
25
+ [![arXiv](https://img.shields.io/badge/arXiv-2511.11910-b31b1b.svg)](https://arxiv.org/abs/2511.11910)
26
+ [![Website](https://img.shields.io/badge/%F0%9F%8C%8E%20Website-Official%20Page-blue)](https://qtsplus.github.io/)
27
+ [![Github](https://img.shields.io/badge/github-repo-green?logo=github)](https://github.com/Siyou-Li/QTSplus)
28
+
29
+ ## Model Description
30
+ ![](./assets/qtsplus.svg)
31
+
32
+ QTSplus-3B-FT is a Qwen2.5-VL–based multimodal LLM finetuned with Query‑Aware Token Selector (QTS+), a lightweight visual token selection module that acts as an information gate between the vision encoder and the LLM.
33
+
34
+ - Query‑aware selection: scores vision tokens via cross‑attention against the input text query.
35
+ - Adaptive retention: predicts an instance‑specific budget and keeps only the most relevant tokens.
36
+ - Temporal reasoning: a small re‑encoder preserves temporal order with absolute time cues.
37
+ - Efficient long‑video understanding: up to 89% vision token compression and 28% end‑to‑end latency reduction on long videos (see paper for details).
38
+
39
+ Model sources
40
+ - Paper: Seeing the Forest and the Trees (document.pdf; arXiv:2511.11910)
41
+ - Website: https://qtsplus.github.io/
42
+ - Code: https://github.com/Siyou-Li/QTSplus
43
+
44
+ ## Intended Uses & Limitations
45
+
46
+ Intended uses
47
+ - Long‑video question answering and captioning
48
+ - Multi‑image reasoning and story understanding
49
+ - Efficient multimodal chat with reduced latency on long inputs
50
+
51
+ Limitations
52
+ - May miss fine details if the predicted retention budget is too small.
53
+ - Inherits biases and failure modes from the base Qwen2.5‑VL model and training data.
54
+ - Not a safety‑aligned system; outputs may be inaccurate or unsafe without human oversight.
55
+
56
+ ## Quick Start
57
+
58
+ The repository is designed around a conda‑based Python 3.11 environment with a CUDA‑enabled GPU. The commands below are taken directly from `environment.sh` and provide a reproducible setup on recent Linux distributions.
59
+
60
+ 1. **Create and activate the conda environment**
61
+
62
+ ```bash
63
+ conda create -n qtsplus python=3.11 -y
64
+ conda activate qtsplus
65
+ ```
66
+
67
+ 2. **Install toolchain and CUDA toolkit**
68
+
69
+ ```bash
70
+ conda install conda-forge::gcc=11 conda-forge::gxx=11 -y
71
+ conda install nvidia/label/cuda-12.8.1::cuda-toolkit -y
72
+ conda install av -c conda-forge -y
73
+ ```
74
+
75
+ 3. **Install PyTorch with CUDA 12.8 support**
76
+
77
+ ```bash
78
+ pip3 install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu128
79
+ ```
80
+
81
+ 4. **Install core Python libraries**
82
+
83
+ ```bash
84
+ pip install transformers==4.57.1
85
+ DS_BUILD_CUTLASS_OPS=0 DS_BUILD_RAGGED_DEVICE_OPS=0 DS_BUILD_EVOFORMER_ATTN=0 pip install deepspeed
86
+ pip install accelerate pandas wandb matplotlib scikit-learn datasets evaluate ftfy sentencepiece bitsandbytes
87
+ ```
88
+
89
+ 5. **Install FlashAttention (prebuilt wheel)**
90
+
91
+ ```bash
92
+ pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.4.22/flash_attn-2.8.1+cu128torch2.9-cp311-cp311-linux_x86_64.whl
93
+ ```
94
+
95
+ This wheel is specific to Linux x86_64, CUDA 12.8, PyTorch 2.9.0 and Python 3.11; if you deviate from this configuration, you will need to install a compatible FlashAttention build instead.
96
+
97
+ 6. **Verify installation**
98
+
99
+ After installation, you should be able to run:
100
+
101
+ ```bash
102
+ python -c "import torch, transformers, deepspeed, accelerate; print(torch.cuda.is_available())"
103
+ ```
104
+
105
+ which should print `True` on a correctly configured GPU machine.
106
+
107
+ Video example
108
+ ```python
109
+ import torch, glob, os
110
+ from transformers import AutoModelForCausalLM, AutoProcessor
111
+ from qwen_vl_utils import process_vision_info
112
+
113
+ model_id = "AlpachinoNLP/QTSplus-3B-FT"
114
+ device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
115
+ dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float16
116
+
117
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).to(dtype=dtype, device=device).eval()
118
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
119
+
120
+ question = "Summarize the key events in this video."
121
+ video_path = "/path/to/video.mp4"
122
+
123
+ messages = [{
124
+ "role": "user",
125
+ "content": [
126
+ {"type": "video", "video": video_path, "max_pixels": 360*420, "fps": 1.0},
127
+ {"type": "text", "text": question},
128
+ ],
129
+ }]
130
+
131
+ chat = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
132
+ _, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
133
+
134
+ inputs = processor(text=[chat], images=None, videos=video_inputs, padding=True, return_tensors="pt", **video_kwargs)
135
+ inputs = inputs.to(dtype=torch.float16, device=device)
136
+
137
+ # Pack vision inputs for QTS+
138
+ pixel_values_videos = inputs.pop("pixel_values_videos", None)
139
+ video_grid_thw = inputs.pop("video_grid_thw", None)
140
+ inputs.pop("second_per_grid_ts", None)
141
+ vision_input = None
142
+ if pixel_values_videos is not None and video_grid_thw is not None:
143
+ vision_input = {"pixel_values_videos": pixel_values_videos, "video_grid_thw": video_grid_thw}
144
+
145
+ # Text ids from the question only (exclude special/system/vision tokens)
146
+ question_ids = processor.tokenizer(question, return_tensors="pt", add_special_tokens=False).input_ids.to(dtype=torch.long, device=device)
147
+
148
+ out_ids = model.generate(vision_input=vision_input, input_ids=inputs.input_ids, question_input_ids=question_ids, max_new_tokens=256)
149
+ trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out_ids)]
150
+ text = processor.batch_decode(trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=True)
151
+ print(text[0])
152
+ ```
153
+
154
+ Multiple images (treated as a video sequence)
155
+ ```python
156
+ images_dir = "/path/to/images"
157
+ image_list = sorted(glob.glob(os.path.join(images_dir, "*.jpg"))) or sorted(glob.glob(os.path.join(images_dir, "*.jpeg")))
158
+ messages = [{
159
+ "role": "user",
160
+ "content": [
161
+ {"type": "video", "video": image_list},
162
+ {"type": "text", "text": "What story do these images tell?"},
163
+ ],
164
+ }]
165
+
166
+ chat = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
167
+ _, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
168
+ inputs = processor(text=[chat], images=None, videos=video_inputs, padding=True, return_tensors="pt", **video_kwargs).to(dtype=torch.float16, device=device)
169
+
170
+ pixel_values_videos = inputs.pop("pixel_values_videos", None)
171
+ video_grid_thw = inputs.pop("video_grid_thw", None)
172
+ inputs.pop("second_per_grid_ts", None)
173
+ vision_input = {"pixel_values_videos": pixel_values_videos, "video_grid_thw": video_grid_thw}
174
+
175
+ out = model.generate(vision_input=vision_input, input_ids=inputs.input_ids, max_new_tokens=256)
176
+ print(processor.decode(out[0], skip_special_tokens=True))
177
+ ```
178
+
179
+ Notes
180
+ - The chat template is applied via `processor.apply_chat_template` and expects the messages schema shown above.
181
+ - QTS+ expects the vision payload under the `vision_input` keyword argument during generation.
182
+ - For fully offline use, pass `local_files_only=True` to `from_pretrained` calls once the files are cached locally.
183
+
184
+ ## Efficiency & Controls
185
+
186
+ The following QTS+ hyperparameters in `config.json` control compression and selection behavior:
187
+ - `qts_plus_rho_min` / `qts_plus_rho_max`: min/max retention ratio bounds.(default: 0.05 / 0.5)
188
+ - `qts_plus_tau_s`: scoring temperature for cross‑attention.(default: 0.5)
189
+ - `qts_plus_nmax`: hard cap on selected tokens per sample. (default: 25600)
190
+ These trade off detail vs. speed/memory. See the paper for guidance, ablations, and latency/throughput measurements.
191
+
192
+
193
+ ## Safety, Bias, and Limitations
194
+
195
+ - Outputs may be factually incorrect, biased, or unsafe. Do not use without human oversight.
196
+ - QTS+ compresses the vision stream; extremely small budgets may drop rare but important details.
197
+ - Inherits safety/bias characteristics from the underlying Qwen2.5‑VL model and training data.
198
+
199
+ ## Citation
200
+
201
+ If you find this work helpful, please cite:
202
 
203
  ```bibtex
204
  @misc{li2025seeingforesttreesqueryaware,
205
+ title = {Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models},
206
+ author = {Siyou Li and Huanan Wu and Juexi Shao and Yinghao Ma and Yujian Gan and Yihao Luo and Yuwei Wang and Dong Nie and Lu Wang and Wengqing Wu and Le Zhang and Massimo Poesio and Juntao Yu},
207
+ year = {2025},
208
+ eprint = {2511.11910},
209
+ archivePrefix= {arXiv},
210
+ primaryClass = {cs.CV},
211
+ url = {https://arxiv.org/abs/2511.11910}
212
  }
213
  ```
assets/dataset.svg ADDED
assets/logo_with_glasses.svg ADDED
assets/qtsplus.svg ADDED
assets/system_load.svg ADDED
assets/training_process.svg ADDED