Cloudriver commited on
Commit
7001cbd
·
verified ·
1 Parent(s): cdfe5d4

Update README: benchmark positioning and citations

Browse files
Files changed (1) hide show
  1. README.md +47 -8
README.md CHANGED
@@ -6,20 +6,23 @@ base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
  ---
8
 
9
- # ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
10
 
11
- *Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen*
12
 
13
- <a href="https://arxiv.org/abs/2509.15235"><img src="https://img.shields.io/static/v1?label=arXiv&message=Paper&color=red&logo=arxiv"></a>
14
- <a href="https://github.com/KangJialiang/ViSpec"><img src="https://img.shields.io/static/v1?label=GitHub&message=Code&color=blue&logo=github"></a>
15
 
16
- ## Overview
17
 
18
- Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups ($<1.5\times$). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce **Vision-Aware Speculative Decoding (ViSpec)**, a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding.
 
 
19
 
20
  ## Citation
21
 
22
- If you find our work useful, please consider citing:
 
 
23
 
24
  ```bibtex
25
  @inproceedings{vispec,
@@ -28,4 +31,40 @@ If you find our work useful, please consider citing:
28
  booktitle={Annual Conference on Neural Information Processing Systems},
29
  year={2025}
30
  }
31
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
  ---
8
 
9
+ # ViSpec-Qwen2.5-VL-7B-Instruct (Benchmark Release)
10
 
11
+ This model repo is part of a **multimodal speculative decoding benchmark suite**.
12
 
13
+ ## Why this repo exists
 
14
 
15
+ We maintain a unified benchmark codebase that includes multiple methods (Baseline, EAGLE, EAGLE2, Lookahead, MSD, ViSpec) so users can run training/evaluation more easily under one setup.
16
 
17
+ - The methods are aggregated here for **user convenience** (shared dataset format, scripts, and metrics).
18
+ - The original ideas and implementations belong to their respective authors.
19
+ - This specific Hugging Face repo hosts the **ViSpec-Qwen2.5-VL-7B-Instruct checkpoint** used in our benchmark runs.
20
 
21
  ## Citation
22
 
23
+ If you use this checkpoint and benchmark, please cite ViSpec and the original methods you compare against.
24
+
25
+ ### ViSpec
26
 
27
  ```bibtex
28
  @inproceedings{vispec,
 
31
  booktitle={Annual Conference on Neural Information Processing Systems},
32
  year={2025}
33
  }
34
+ ```
35
+
36
+ ### EAGLE / EAGLE2 / EAGLE3
37
+
38
+ ```bibtex
39
+ @inproceedings{li2024eagle,
40
+ author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
41
+ title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty},
42
+ booktitle = {International Conference on Machine Learning},
43
+ year = {2024}
44
+ }
45
+
46
+ @inproceedings{li2024eagle2,
47
+ author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
48
+ title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees},
49
+ booktitle = {Empirical Methods in Natural Language Processing},
50
+ year = {2024}
51
+ }
52
+
53
+ @inproceedings{li2025eagle3,
54
+ author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
55
+ title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
56
+ booktitle = {Annual Conference on Neural Information Processing Systems},
57
+ year = {2025}
58
+ }
59
+ ```
60
+
61
+ ### Other integrated baselines (links)
62
+
63
+ - Lookahead Decoding: https://lmsys.org/blog/2023-11-21-lookahead-decoding/
64
+ - MSD-LLaVA1.5-7B: https://huggingface.co/lucylyn/MSD-LLaVA1.5-7B
65
+ - Medusa: https://github.com/FasterDecoding/Medusa
66
+
67
+ ## Notes
68
+
69
+ - This model card focuses on benchmark usage and attribution.
70
+ - For full benchmark code and scripts, please refer to the benchmark repository used in your experiment setup.