---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- speculative-decoding
- eagle3
- glm
- draft-model
- text-generation
---

# EAGLE3 Draft Model for GLM-4.7-Flash

GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with **GLM-4.7-Flash**. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass.

**Version:** 1.0
**Release Date:** 2026-02-16
**Organization:** ThoughtWorks
**License:** apache-2.0

---

## Model Overview

This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.70× throughput improvement** under concurrent load.

**Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
**Draft Model Size**: 277.4 MB
**Architecture**: 1-layer transformer with 2048 hidden dimensions

### Key Features

- **FlashInfer Compatible**: head_dim=128 ✓
- **Acceptance Rate**: 40.0% (MT-Bench, B=1)
- **Speedup**: 1.39× TPOT (B=1), 1.70× throughput (B=32)
- **Hardware**: Optimized for single GPU (TP=1) deployment

---

## Architecture Specifications

| Parameter | Value |
|-----------|-------|
| Hidden Size | 2048 |
| Attention Heads | 16 |
| KV Heads (GQA) | 4 |
| Head Dimension | 128 |
| Intermediate Size | 8192 |
| Layers | 1 |
| Vocabulary Size | 154880 |
| Draft Vocab Size | 32000 |

**Note**: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing.

---

## Training Details

### Dataset

**Mixed Diversity** — 54K samples

Composition:
- 45% ShareGPT
- 35% UltraChat
- 20% PerfectBlend

Average tokens per sample: 1300

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Epochs | 3 |
| Batch Size | 1 |
| Learning Rate | 1e-4 |
| Warmup Ratio | 0.03 |
| Max Length | 1024 |

### Training Results

- **Training Acceptance Rate**: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%)

---

## Benchmark Results

**Dataset**: MT-Bench (154 prompts, max_tokens=512, temperature=0.7)
**Hardware**: Single NVIDIA H100 (79GB), TP=1
**Backend**: FlashInfer
**Spec Config**: num_steps=3, num_draft_tokens=6, eagle_topk=4

### Metric Definitions

- **Acceptance Rate**: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average.
- **Acceptance Length**: Average number of consecutive draft tokens accepted per verification step (directly determines speedup).
- **TTFT**: Time To First Token (prefill latency) in milliseconds
- **TPOT**: Time Per Output Token (decode latency) in milliseconds
- **Throughput**: Tokens generated per second

### Batch Size 1 (Single Request - Latency Optimization)

#### Server-Side Metrics (Prometheus — Ground Truth)

| Metric | Baseline | EAGLE3 | Speedup |
|--------|----------|--------|---------|
| TTFT (ms) | 76.1 | 74.74 | **1.02×** |
| TPOT (ms) | 8.18 | 5.89 | **1.39×** |
| Throughput (tok/s) | 120.3 | 167.75 | **1.39×** |
| Acceptance Rate (%) | — | **40.0%** | — |
| Acceptance Length | — | **2.4** | — |

### Batch Size 32 (Concurrent Load - Throughput Optimization)

#### Server-Side Metrics (Prometheus — Ground Truth)

| Metric | Baseline | EAGLE3 | Speedup |
|--------|----------|--------|---------|
| TTFT (ms) | 2988 | 3210 | 0.93× |
| TPOT (ms) | 22.57 | 17.33 | **1.30×** |
| Throughput (tok/s) | 258.61 | 440.15 | **1.70×** |
| Acceptance Rate (%) | — | **40.0%†** | — |
| Acceptance Length | — | **2.4†** | — |

†Same server session as B=1; concurrent benchmark does not collect per-request accept stats.

**Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).

---

## Usage

### Installation

```bash
pip install sglang transformers
```

### Basic Usage

```bash
python -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-Flash \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
  --speculative-num-steps 3 \
  --speculative-num-draft-tokens 6 \
  --speculative-eagle-topk 4 \
  --tp 1 \
  --trust-remote-code \
  --port 30000 \
  --enable-metrics
```

### Python API

```python
import requests

response = requests.post(
    "http://localhost:30000/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100,
        "temperature": 0.7,
    }
)
print(response.json())
```

### Performance Tips

1. **Backend Selection**: Use FlashInfer backend (default) for optimal performance
2. **Tuning**: Adjust `num_draft_tokens` based on workload (3-6 recommended)
3. **Monitoring**: Enable `--enable-metrics` flag and monitor `/metrics` endpoint for acceptance rates
4. **Validation**: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly

---

## Limitations

- Requires SGLang backend with EAGLE3 support
- Optimized for TP=1 inference (single GPU deployment)
- FlashInfer backend recommended for optimal performance


---

## Citation

```bibtex
@misc{glm_4.7_flash_eagle3_2026,
  title={EAGLE3 Draft Model for GLM-4.7-Flash},
  author={ThoughtWorks},
  year={2026},
  howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}},
}
```

### EAGLE3 Paper

```bibtex
@article{wang2025eagle3,
  title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
  author={Wang, Yuhui and others},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}
```

---

## Additional Resources

- **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)

---

## License

apache-2.0

---

## Contact

For questions or issues, open a discussion on the [model page](https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3/discussions).