GTO: Group Tree Optimization for Speculative Decoding
This repository contains a draft model for speculative decoding trained using Group Tree Optimization (GTO).
GTO is a framework designed to bridge the "draft policy misalignment" between training (which often focuses on single-token greedy paths) and inference (which uses tree-based re-ranking and verification). It introduces a Draft Tree Reward objective and a Group-based Draft Policy Training scheme to optimize acceptance lengths and inference speed.
Paper
Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
GitHub Repository
For implementation details, training scripts, and inference code, please visit the official repository: https://github.com/hsj576/GTO
Overview
GTO achieves significant performance improvements:
- 5.6x faster than vanilla autoregressive decoding.
- 7% faster than prior state-of-the-art EAGLE-3.
- Improves acceptance length by aligning training with the decoding-time tree policy.
Inference
The official implementation provides a web interface for inference. To use this draft model with a base model, you can run the following command from the GTO repository:
python -m application.webui --ea-model-path [path of GTO weight]\
--base-model-path [path of the original model]\
--model-type [vicuna\llama3\qwen]\
--total-token [int]
The total-token parameter specifies the number of draft tokens. Adjust this value based on your specific hardware and model size for optimal results.
Citation
If you find GTO useful in your research, please cite the following paper:
@article{hu2025bridging,
title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
journal={arXiv preprint arXiv:2509.22134},
year={2025}
}
Acknowledgements
The implementation is based on the open-source repository of EAGLE and has been influenced by projects in the LLM community such as HASS and GRIFFIN.
- Downloads last month
- 21