husj576 nielsr HF Staff commited on
Commit
ad82d12
·
1 Parent(s): 0752ae4

Add model card for GTO draft model (#1)

Browse files

- Add model card for GTO draft model (e4173e2681bb750b3766b20d59c9360069b4150a)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +60 -3
README.md CHANGED
@@ -1,3 +1,60 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - speculative-decoding
7
+ - gto
8
+ ---
9
+
10
+ # GTO: Group Tree Optimization for Speculative Decoding
11
+
12
+ This repository contains a draft model for speculative decoding trained using **Group Tree Optimization (GTO)**.
13
+
14
+ GTO is a framework designed to bridge the "draft policy misalignment" between training (which often focuses on single-token greedy paths) and inference (which uses tree-based re-ranking and verification). It introduces a **Draft Tree Reward** objective and a **Group-based Draft Policy Training** scheme to optimize acceptance lengths and inference speed.
15
+
16
+ ## Paper
17
+
18
+ [Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding](https://arxiv.org/abs/2509.22134)
19
+
20
+ ## GitHub Repository
21
+
22
+ For implementation details, training scripts, and inference code, please visit the official repository:
23
+ [https://github.com/hsj576/GTO](https://github.com/hsj576/GTO)
24
+
25
+ ## Overview
26
+
27
+ GTO achieves significant performance improvements:
28
+ - **5.6x** faster than vanilla autoregressive decoding.
29
+ - **7%** faster than prior state-of-the-art EAGLE-3.
30
+ - Improves acceptance length by aligning training with the decoding-time tree policy.
31
+
32
+ ## Inference
33
+
34
+ The official implementation provides a web interface for inference. To use this draft model with a base model, you can run the following command from the GTO repository:
35
+
36
+ ```bash
37
+ python -m application.webui --ea-model-path [path of GTO weight]\
38
+ --base-model-path [path of the original model]\
39
+ --model-type [vicuna\llama3\qwen]\
40
+ --total-token [int]
41
+ ```
42
+
43
+ The `total-token` parameter specifies the number of draft tokens. Adjust this value based on your specific hardware and model size for optimal results.
44
+
45
+ ## Citation
46
+
47
+ If you find GTO useful in your research, please cite the following paper:
48
+
49
+ ```bibtex
50
+ @article{hu2025bridging,
51
+ title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
52
+ author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
53
+ journal={arXiv preprint arXiv:2509.22134},
54
+ year={2025}
55
+ }
56
+ ```
57
+
58
+ ## Acknowledgements
59
+
60
+ The implementation is based on the open-source repository of [EAGLE](https://github.com/SafeAILab/EAGLE/tree/main) and has been influenced by projects in the LLM community such as [HASS](https://github.com/HArmonizedSS/HASS) and [GRIFFIN](https://github.com/hsj576/GRIFFIN).