Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Group Tree Optimization (GTO) is a framework designed to address draft policy misalignment in speculative decoding. While standard methods optimize for a single greedy path, GTO aligns training with the actual tree-based decoding policy used during inference. This is achieved through a Draft Tree Reward objective and a stable Group-based Draft Policy Training scheme.

Performance

GTO achieves state-of-the-art acceleration for LLM inference:

  • 5.6x faster than vanilla autoregressive decoding.
  • 7% faster than previous state-of-the-art methods like EAGLE-3.

Usage

To use this model for accelerated inference, please follow the setup instructions in the official GTO repository.

Inference via Web UI

The codebase provides a web interface for testing the acceleration. After setting up the environment and cloning the repo, you can run:

python -m application.webui --ea-model-path [path of GTO weight] \ 
    --base-model-path [path of the original model] \
    --model-type [vicuna\llama3\qwen] \
    --total-token [int]

The total-token parameter represents the number of draft tokens. Adjusting this based on your specific device and model can achieve better results.

Citation

If you find this work useful, please cite:

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

The implementation is based on the open-source repository of EAGLE. This project has been influenced by many projects in the LLM community, such as HASS and GRIFFIN.

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for husj576/GTO-deepseek-8B