| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | tags: |
| | - speculative-decoding |
| | - eagle3 |
| | - glm |
| | - draft-model |
| | - text-generation |
| | --- |
| | |
| | # EAGLE3 Draft Model for GLM-4.7-Flash |
| |
|
| | GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with **GLM-4.7-Flash**. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass. |
| |
|
| | **Version:** 1.0 |
| | **Release Date:** 2026-02-16 |
| | **Organization:** ThoughtWorks |
| | **License:** apache-2.0 |
| |
|
| | --- |
| |
|
| | ## Model Overview |
| |
|
| | This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.70× throughput improvement** under concurrent load. |
| |
|
| | **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters |
| | **Draft Model Size**: 277.4 MB |
| | **Architecture**: 1-layer transformer with 2048 hidden dimensions |
| |
|
| | ### Key Features |
| |
|
| | - **FlashInfer Compatible**: head_dim=128 ✓ |
| | - **Acceptance Rate**: 40.0% (MT-Bench, B=1) |
| | - **Speedup**: 1.39× TPOT (B=1), 1.70× throughput (B=32) |
| | - **Hardware**: Optimized for single GPU (TP=1) deployment |
| | |
| | --- |
| | |
| | ## Architecture Specifications |
| | |
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Hidden Size | 2048 | |
| | | Attention Heads | 16 | |
| | | KV Heads (GQA) | 4 | |
| | | Head Dimension | 128 | |
| | | Intermediate Size | 8192 | |
| | | Layers | 1 | |
| | | Vocabulary Size | 154880 | |
| | | Draft Vocab Size | 32000 | |
| | |
| | **Note**: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing. |
| | |
| | --- |
| | |
| | ## Training Details |
| | |
| | ### Dataset |
| | |
| | **Mixed Diversity** — 54K samples |
| | |
| | Composition: |
| | - 45% ShareGPT |
| | - 35% UltraChat |
| | - 20% PerfectBlend |
| | |
| | Average tokens per sample: 1300 |
| | |
| | ### Hyperparameters |
| | |
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Epochs | 3 | |
| | | Batch Size | 1 | |
| | | Learning Rate | 1e-4 | |
| | | Warmup Ratio | 0.03 | |
| | | Max Length | 1024 | |
| | |
| | ### Training Results |
| | |
| | - **Training Acceptance Rate**: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%) |
| | |
| | --- |
| | |
| | ## Benchmark Results |
| | |
| | **Dataset**: MT-Bench (154 prompts, max_tokens=512, temperature=0.7) |
| | **Hardware**: Single NVIDIA H100 (79GB), TP=1 |
| | **Backend**: FlashInfer |
| | **Spec Config**: num_steps=3, num_draft_tokens=6, eagle_topk=4 |
| |
|
| | ### Metric Definitions |
| |
|
| | - **Acceptance Rate**: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average. |
| | - **Acceptance Length**: Average number of consecutive draft tokens accepted per verification step (directly determines speedup). |
| | - **TTFT**: Time To First Token (prefill latency) in milliseconds |
| | - **TPOT**: Time Per Output Token (decode latency) in milliseconds |
| | - **Throughput**: Tokens generated per second |
| |
|
| | ### Batch Size 1 (Single Request - Latency Optimization) |
| |
|
| | #### Server-Side Metrics (Prometheus — Ground Truth) |
| |
|
| | | Metric | Baseline | EAGLE3 | Speedup | |
| | |--------|----------|--------|---------| |
| | | TTFT (ms) | 76.1 | 74.74 | **1.02×** | |
| | | TPOT (ms) | 8.18 | 5.89 | **1.39×** | |
| | | Throughput (tok/s) | 120.3 | 167.75 | **1.39×** | |
| | | Acceptance Rate (%) | — | **40.0%** | — | |
| | | Acceptance Length | — | **2.4** | — | |
| |
|
| | ### Batch Size 32 (Concurrent Load - Throughput Optimization) |
| |
|
| | #### Server-Side Metrics (Prometheus — Ground Truth) |
| |
|
| | | Metric | Baseline | EAGLE3 | Speedup | |
| | |--------|----------|--------|---------| |
| | | TTFT (ms) | 2988 | 3210 | 0.93× | |
| | | TPOT (ms) | 22.57 | 17.33 | **1.30×** | |
| | | Throughput (tok/s) | 258.61 | 440.15 | **1.70×** | |
| | | Acceptance Rate (%) | — | **40.0%†** | — | |
| | | Acceptance Length | — | **2.4†** | — | |
| |
|
| | †Same server session as B=1; concurrent benchmark does not collect per-request accept stats. |
| |
|
| | **Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most). |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install sglang transformers |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | ```bash |
| | python -m sglang.launch_server \ |
| | --model-path zai-org/GLM-4.7-Flash \ |
| | --speculative-algorithm EAGLE3 \ |
| | --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \ |
| | --speculative-num-steps 3 \ |
| | --speculative-num-draft-tokens 6 \ |
| | --speculative-eagle-topk 4 \ |
| | --tp 1 \ |
| | --trust-remote-code \ |
| | --port 30000 \ |
| | --enable-metrics |
| | ``` |
| |
|
| | ### Python API |
| |
|
| | ```python |
| | import requests |
| | |
| | response = requests.post( |
| | "http://localhost:30000/v1/chat/completions", |
| | json={ |
| | "model": "default", |
| | "messages": [{"role": "user", "content": "Hello!"}], |
| | "max_tokens": 100, |
| | "temperature": 0.7, |
| | } |
| | ) |
| | print(response.json()) |
| | ``` |
| |
|
| | ### Performance Tips |
| |
|
| | 1. **Backend Selection**: Use FlashInfer backend (default) for optimal performance |
| | 2. **Tuning**: Adjust `num_draft_tokens` based on workload (3-6 recommended) |
| | 3. **Monitoring**: Enable `--enable-metrics` flag and monitor `/metrics` endpoint for acceptance rates |
| | 4. **Validation**: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | - Requires SGLang backend with EAGLE3 support |
| | - Optimized for TP=1 inference (single GPU deployment) |
| | - FlashInfer backend recommended for optimal performance |
| |
|
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{glm_4.7_flash_eagle3_2026, |
| | title={EAGLE3 Draft Model for GLM-4.7-Flash}, |
| | author={ThoughtWorks}, |
| | year={2026}, |
| | howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}}, |
| | } |
| | ``` |
| |
|
| | ### EAGLE3 Paper |
| |
|
| | ```bibtex |
| | @article{wang2025eagle3, |
| | title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads}, |
| | author={Wang, Yuhui and others}, |
| | journal={arXiv preprint arXiv:2503.01840}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Additional Resources |
| |
|
| | - **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | apache-2.0 |
| |
|
| | --- |
| |
|
| | ## Contact |
| |
|
| | For questions or issues, open a discussion on the [model page](https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3/discussions). |