GTO: Group Tree Optimization for Speculative Decoding
Group Tree Optimization (GTO) is a framework designed to bridge the gap between training objectives and decoding policies in speculative decoding. While standard speculative decoding uses a tree-based policy for token verification, typical training objectives only optimize for a single greedy path. GTO aligns these by introducing a Draft Tree Reward and Group-based Draft Policy Training.
- Paper: Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
- GitHub: https://github.com/hsj576/GTO
Overview
GTO addresses draft policy misalignment through two primary components:
- Draft Tree Reward: A sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance.
- Group-based Draft Policy Training: A stable optimization scheme that contrasts trees from the current and a frozen reference draft model, applying a PPO-style surrogate for robust updates.
Performance
Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), GTO achieves significant acceleration:
- Up to 5.6x faster than vanilla autoregressive decoding.
- Yields an additional 7.7% speedup over prior state-of-the-art methods like EAGLE-3.
- Increases token acceptance length by 7.4%.
Inference
The inference code provided in the official repository automatically handles model weight allocation across multiple GPUs. You can launch a web interface using the following command:
python -m application.webui --ea-model-path [path of GTO weight]\
--base-model-path [path of the original model]\
--model-type [vicuna\llama3\qwen]\
--total-token [int]
Note: total-token represents the number of draft tokens. Adjusting this value based on your hardware and specific base model can further optimize performance.
Citation
@article{hu2025bridging,
title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
journal={arXiv preprint arXiv:2509.22134},
year={2025}
}
Acknowledgements
This implementation is based on the EAGLE repository and influenced by projects like HASS and GRIFFIN.
- Downloads last month
- 36