GTO: Group Tree Optimization for Speculative Decoding

Group Tree Optimization (GTO) is a framework designed to bridge the gap between training objectives and decoding policies in speculative decoding. While standard speculative decoding uses a tree-based policy for token verification, typical training objectives only optimize for a single greedy path. GTO aligns these by introducing a Draft Tree Reward and Group-based Draft Policy Training.

Overview

GTO addresses draft policy misalignment through two primary components:

  1. Draft Tree Reward: A sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance.
  2. Group-based Draft Policy Training: A stable optimization scheme that contrasts trees from the current and a frozen reference draft model, applying a PPO-style surrogate for robust updates.

Performance

Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), GTO achieves significant acceleration:

  • Up to 5.6x faster than vanilla autoregressive decoding.
  • Yields an additional 7.7% speedup over prior state-of-the-art methods like EAGLE-3.
  • Increases token acceptance length by 7.4%.

Inference

The inference code provided in the official repository automatically handles model weight allocation across multiple GPUs. You can launch a web interface using the following command:

python -m application.webui --ea-model-path [path of GTO weight]\ 
        --base-model-path [path of the original model]\
        --model-type [vicuna\llama3\qwen]\
        --total-token [int]

Note: total-token represents the number of draft tokens. Adjusting this value based on your hardware and specific base model can further optimize performance.

Citation

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

This implementation is based on the EAGLE repository and influenced by projects like HASS and GRIFFIN.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for husj576/GTO-qwen3-8B