Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
Group Tree Optimization (GTO) is a framework designed to address draft policy misalignment in speculative decoding. While standard methods optimize for a single greedy path, GTO aligns training with the actual tree-based decoding policy used during inference. This is achieved through a Draft Tree Reward objective and a stable Group-based Draft Policy Training scheme.
- Paper: Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
- Repository: https://github.com/hsj576/GTO
Performance
GTO achieves state-of-the-art acceleration for LLM inference:
- 5.6x faster than vanilla autoregressive decoding.
- 7% faster than previous state-of-the-art methods like EAGLE-3.
Usage
To use this model for accelerated inference, please follow the setup instructions in the official GTO repository.
Inference via Web UI
The codebase provides a web interface for testing the acceleration. After setting up the environment and cloning the repo, you can run:
python -m application.webui --ea-model-path [path of GTO weight] \
--base-model-path [path of the original model] \
--model-type [vicuna\llama3\qwen] \
--total-token [int]
The total-token parameter represents the number of draft tokens. Adjusting this based on your specific device and model can achieve better results.
Citation
If you find this work useful, please cite:
@article{hu2025bridging,
title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
journal={arXiv preprint arXiv:2509.22134},
year={2025}
}
Acknowledgements
The implementation is based on the open-source repository of EAGLE. This project has been influenced by many projects in the LLM community, such as HASS and GRIFFIN.
- Downloads last month
- 30