Update README.md

14c17cc verified 24 days ago

6.67 kB

	---
	license: apache-2.0
	tags:
	- vision-language-model
	- inference-optimization
	- token-pruning
	- qwen2-vl
	library_name: sparsevlm
	---

	# SparseVLM

	[![PyPI](https://img.shields.io/pypi/v/sparsevlm)](https://pypi.org/project/sparsevlm/)
	[![Paper](https://img.shields.io/badge/ICML_2025-Paper-blue)](https://arxiv.org/abs/2410.04417)
	[![License](https://img.shields.io/badge/License-Apache_2.0-green)](LICENSE)
	[![Tests](https://github.com/aryanchauhan31/SparseVLM/actions/workflows/tests.yml/badge.svg)](https://github.com/aryanchauhan31/SparseVLM/actions)

	Training-free visual token pruning for Qwen2.5-VL. Scores visual tokens by how much text attends to them, prunes the unimportant ones from the KV cache, and decodes with the smaller cache.

	Based on [SparseVLM: Visual Token Sparsification for Efficient VLM Inference](https://arxiv.org/abs/2410.04417) (ICML 2025).

	---

	## Install

	```bash
	pip install sparsevlm
	```

	Requirements: Python 3.10+, PyTorch 2.1+, transformers 4.49+

	---

	## Quick start

	```python
	import torch
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from sparsevlm import sparsevlm_generate
	from PIL import Image

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"Qwen/Qwen2.5-VL-7B-Instruct",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	attn_implementation="eager",
	)
	processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

	image = Image.open("your_image.jpg")
	messages = [{"role": "user", "content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Describe this image in detail."}
	]}]
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

	# count visual tokens
	n_vis = int((inputs["image_grid_thw"][0].prod() / 4).item())

	output = sparsevlm_generate(
	model, processor, inputs,
	n_vis=n_vis,
	keep_n_vis=n_vis // 4, # keep 25% of visual tokens
	max_new_tokens=256,
	)
	print(processor.decode(output[0][1:], skip_special_tokens=True))
	```

	---

	## Benchmark results

	Measured on NVIDIA A100-SXM4-40GB, Qwen2.5-VL-7B-Instruct, bfloat16, SDPA attention.

	### Real photo — Fuji mountain + Milky Way (4928×2773px, 16320 visual tokens)

	\| Config \| Tokens kept \| Time \| Speedup \| Output quality \|
	\|---\|---\|---\|---\|---\|
	\| Baseline \| 16320 (100%) \| 9738ms \| 1.00× \| Identifies Fuji, Milky Way, snow cap, star colors \|
	\| SparseVLM 50% \| 8192 \| 9441ms \| 1.03× \| Same quality \|
	\| SparseVLM 25% \| 4080 \| 9297ms \| 1.05× \| All key details preserved \|
	\| SparseVLM 10% \| 1632 \| 9425ms \| 1.03× \| Still correctly describes scene \|

	> Key result: Full 4K image (16K tokens) runs without OOM. Without SparseVLM's hook-based scoring, the 16K-token image requires materialising a 15GB attention matrix and crashes. The scorer computes only the text→visual submatrix (35 × 16320 = 32MB instead of 15GB).

	### Resized photo (896×504px, 576 visual tokens), batch=1

	\| Tokens kept \| Time \| Speedup \|
	\|---\|---\|---\|
	\| 576 (100%) \| 2167ms \| 1.00× \|
	\| 288 (50%) \| 1685ms \| 1.29× \|
	\| 144 (25%) \| 1565ms \| 1.39× \|
	\| 72 (12%) \| 1620ms \| 1.34× \|

	### When to expect larger speedup

	Speedup grows when the KV cache is large relative to model weights:

	\| Scenario \| Expected speedup \|
	\|---\|---\|
	\| Single image, short generation \| ~1.1–1.4× \|
	\| Single image, 256+ output tokens \| ~1.5–2.5× \|
	\| Batch=32, high-res images \| ~2–4× \|
	\| Very long visual context (10K+ tokens) \| ~2–4× \|

	---

	## How it works

	### Token scoring (no extra parameters)

	At decoder layer 2, a lightweight hook intercepts the attention projection and computes:

	```
	A_tv = Q_text @ K_visual^T # only the text→visual submatrix
	# 35 × 16320 instead of 16320 × 16320
	score_i = sum over text tokens of attention to visual token i
	```

	Visual tokens with high scores are important to the text query. Low-score tokens are pruned from the KV cache before decoding starts.

	### KV cache pruning

	After scoring, the KV cache is sliced to keep only the top-K visual entries plus all text entries. The model then decodes with a smaller cache — fewer keys to attend over per decode step.

	```
	Prefill: build KV cache for all 16320 visual tokens
	Score: rank each visual token by text attention (32MB op)
	Prune: keep top-K, drop the rest
	Decode: attend over K + N_text keys instead of 16320 + N_text
	```

	### Position fix (`rope_deltas`)

	After pruning, Qwen2.5-VL's internal position counter (`rope_deltas`) is adjusted so decode tokens get correct positional embeddings despite the shorter cache.

	---

	## API

	### `sparsevlm_generate`

	```python
	from sparsevlm import sparsevlm_generate

	output = sparsevlm_generate(
	model, # Qwen2_5_VLForConditionalGeneration
	processor, # AutoProcessor
	inputs, # dict from processor(...)
	n_vis, # total visual tokens in the sequence
	keep_n_vis, # how many to keep (e.g. n_vis // 4 for 25%)
	max_new_tokens=256, # generation length
	target_layer=2, # which layer to score from (default 2)
	device="cuda", # primary device
	)
	# returns: token ids [B, max_new_tokens]
	```

	### `apply_sparsevlm` / `remove_hooks` (hook-based API)

	```python
	from sparsevlm import apply_sparsevlm, reset_n_vis, remove_hooks

	state = apply_sparsevlm(model, n_vis=256)
	reset_n_vis(state, n_vis=256) # call before each generate
	output = model.generate(...)
	remove_hooks(state)
	```

	---

	## Model support

	\| Model \| Status \|
	\|---\|---\|
	\| Qwen/Qwen2.5-VL-7B-Instruct \| Tested \|
	\| Qwen/Qwen2.5-VL-3B-Instruct \| Should work \|
	\| Qwen/Qwen2.5-VL-72B-Instruct \| Should work \|
	\| Qwen/Qwen2-VL-* \| Legacy support \|

	---

	## Limitations

	- Requires `attn_implementation="eager"` or `"sdpa"`. Flash Attention 2 (separate package) is not required.
	- Speedup is modest (~1.1–1.4×) for single-image, short-generation use cases. The gain comes from long generations, high-resolution images, or batched serving.
	- Currently tested with Qwen2.5-VL. Other VLM families would need architecture-specific adaptation.

	---

	## Citation

	```bibtex
	@inproceedings{zhang2024sparsevlm,
	title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
	author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
	Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
	Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
	booktitle={ICML},
	year={2025}
	}
	```

	Apache 2.0 license.