GTO: Group Tree Optimization for Speculative Decoding

This repository contains the draft model weights for GTO (Group Tree Optimization), as introduced in the paper Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding.

GTO is a novel framework designed to address draft policy misalignment in speculative decoding. It aligns training with the decoding-time tree policy through two main components:

  1. Draft Tree Reward: A sampling-free objective equal to the expected acceptance length of the draft tree under the target model.
  2. Group-based Draft Policy Training: A stable optimization scheme that contrasts trees from the current and a frozen reference draft model.

Resources

Performance

GTO achieves significant speedups across dialogue (MT-Bench), code (HumanEval), and math (GSM8K) tasks:

  • 5.6x faster than vanilla autoregressive decoding.
  • 7.7% additional speedup over prior state-of-the-art methods like EAGLE-3.

Inference

To use these weights, you should use the inference code provided in the official repository. The implementation supports multi-GPU weight allocation.

You can use the suggested web interface by running:

python -m application.webui --ea-model-path [path of GTO weight] \ 
        --base-model-path [path of the original model] \
        --model-type [vicuna\llama3\qwen] \
        --total-token [int]

The total-token parameter represents the number of draft tokens. Adjusting this value according to the specific device and model can achieve better results.

Citation

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

This implementation is based on the open-source repository of EAGLE. This project has also been influenced by HASS, GRIFFIN, and other projects in the LLM community.

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for husj576/GTO-llama33-instruct-70B