Instructions to use husj576/GTO-llama31-instruct-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use husj576/GTO-llama31-instruct-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="husj576/GTO-llama31-instruct-8B")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("husj576/GTO-llama31-instruct-8B")
model = AutoModelForCausalLM.from_pretrained("husj576/GTO-llama31-instruct-8B")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use husj576/GTO-llama31-instruct-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "husj576/GTO-llama31-instruct-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "husj576/GTO-llama31-instruct-8B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/husj576/GTO-llama31-instruct-8B

SGLang

How to use husj576/GTO-llama31-instruct-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "husj576/GTO-llama31-instruct-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "husj576/GTO-llama31-instruct-8B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "husj576/GTO-llama31-instruct-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "husj576/GTO-llama31-instruct-8B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use husj576/GTO-llama31-instruct-8B with Docker Model Runner:
```
docker model run hf.co/husj576/GTO-llama31-instruct-8B
```

GTO: Group Tree Optimization for Speculative Decoding

This repository contains a draft model for speculative decoding trained using Group Tree Optimization (GTO).

GTO is a framework designed to bridge the "draft policy misalignment" between training (which often focuses on single-token greedy paths) and inference (which uses tree-based re-ranking and verification). It introduces a Draft Tree Reward objective and a Group-based Draft Policy Training scheme to optimize acceptance lengths and inference speed.

Paper

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

GitHub Repository

For implementation details, training scripts, and inference code, please visit the official repository: https://github.com/hsj576/GTO

Overview

GTO achieves significant performance improvements:

5.6x faster than vanilla autoregressive decoding.
7% faster than prior state-of-the-art EAGLE-3.
Improves acceptance length by aligning training with the decoding-time tree policy.

Inference

The official implementation provides a web interface for inference. To use this draft model with a base model, you can run the following command from the GTO repository:

python -m application.webui --ea-model-path [path of GTO weight]\ 
        --base-model-path [path of the original model]\
        --model-type [vicuna\llama3\qwen]\
        --total-token [int]

The total-token parameter specifies the number of draft tokens. Adjust this value based on your specific hardware and model size for optimal results.

Citation

If you find GTO useful in your research, please cite the following paper:

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

The implementation is based on the open-source repository of EAGLE and has been influenced by projects in the LLM community such as HASS and GRIFFIN.

Downloads last month: 3

Paper for husj576/GTO-llama31-instruct-8B

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Paper • 2509.22134 • Published Sep 26, 2025