gpt-oss-120b-speculator.eagle3
Model Overview
- Verifier: openai/gpt-oss-120b
- Speculative Decoding Algorithm: EAGLE-3
- Model Architecture: Eagle3Speculator
- Release Date: 03/10/2026
- Version: 2.0
- Model Developers: RedHat
This is a speculator model designed for use with openai/gpt-oss-120b, based on the EAGLE-3 speculative decoding algorithm.
It was trained using the speculators library on a combination of the Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered dataset and the train_sft split of the HuggingFaceH4/ultrachat_200k dataset. Training data used Magpie + UltraChat with responses from the gpt-oss-120b model (reasoning).
This model should be used with the openai/gpt-oss-120b chat template, specifically through the /chat/completions endpoint.
vLLM version
This draft model uses norm_before_fc (pre-FC RMSNorm for gpt-oss Eagle3). You need vLLM that includes PR #38111 (merged 2026-03-12).
- Use one of:
vLLM >= 0.17.2, when available, or
Install from main: Use
cu129orcu130in the URL to match your CUDA version (nvcc --version):git clone https://github.com/vllm-project/vllm.git cd vllm VLLM_USE_PRECOMPILED=1 uv pip install -U -e . \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly/<CUDA version>
Use with vLLM
vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--speculative-config '{
"model": "RedHatAI/gpt-oss-120b-speculator.eagle3",
"num_speculative_tokens": 5,
"method": "eagle3"
}' \
--no-enable-prefix-caching \
--max-num-seqs 64 \
--enforce-eager
Evaluations
Model / run: gpt-oss-120b-from-self-ckpt5
vLLM: 0.17.1rc1.dev531+g89f572dbc.d20260324
Training data: Magpie + UltraChat; responses from the gpt-oss-120b model (reasoning).
Use cases
| Use Case | Dataset | Number of Samples |
|---|---|---|
| Coding | HumanEval | 164 |
| Math Reasoning | math_reasoning | 80 |
| Question Answering | qa | 80 |
| MT_bench (Question) | question | 80 |
| RAG | rag | 80 |
| Summarization | summarization | 80 |
| Translation | translation | 80 |
Acceptance lengths (draft length, temperature=default)
| Dataset | k=1 | k=2 | k=3 | k=4 | k=5 |
|---|---|---|---|---|---|
| HumanEval | 1.75 | 2.27 | 2.63 | 2.85 | 3.01 |
| math_reasoning | 1.79 | 2.37 | 2.80 | 3.09 | 3.29 |
| qa | 1.61 | 1.96 | 2.15 | 2.26 | 2.33 |
| question | 1.67 | 2.09 | 2.35 | 2.51 | 2.61 |
| rag | 1.63 | 2.01 | 2.23 | 2.35 | 2.43 |
| summarization | 1.63 | 1.99 | 2.19 | 2.31 | 2.36 |
| translation | 1.64 | 2.05 | 2.29 | 2.44 | 2.52 |
Details
Configuration
- temperature: default (vLLM default sampling)
- backend: vLLM chat_completions
- rate-type: throughput
- max-seconds per run: 300
- hardware: 8× A100 80GB GPU (tensor parallel 4)
- vLLM version: 0.17.1rc1.dev531+g89f572dbc.d20260324
- Benchmark data: RedHatAI/speculator_benchmarks
- vLLM serve: --no-enable-prefix-caching, --max-num-seqs 64, --enforce-eager
Command
# Serve
vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--speculative-config '{
"model": "RedHatAI/gpt-oss-120b-speculator.eagle3",
"num_speculative_tokens": 5,
"method": "eagle3"
}' \
--no-enable-prefix-caching \
--max-num-seqs 64 \
--enforce-eager
# Benchmark
GUIDELLM__PREFERRED_ROUTE="chat_completions" \
GUIDELLM__MAX_CONCURRENCY=128 \
guidellm benchmark \
--target "http://localhost:8000/v1" \
--data "RedHatAI/speculator_benchmarks" \
--data-args '{"data_files": "HumanEval.jsonl"}' \
--rate-type throughput \
--max-seconds 300
- Downloads last month
- 1,401