thoughtworks
/

GLM-4.7-Flash-Eagle3

+# EAGLE3 Draft Model for GLM-4.7-Flash
+GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with **GLM-4.7-Flash**. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass.
+**Version:** 1.0
+**Release Date:** 2026-02-16
+**Organization:** ThoughtWorks
+**License:** Apache-2.0
+---
+## Model Overview
+This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.7× throughput improvement** under concurrent load.
+**Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
+**Draft Model Size**: 277.4 MB
+**Architecture**: 1-layer transformer with 2048 hidden dimensions
+### Key Features
+- **FlashInfer Compatible**: head_dim=128 ✓
+- **Acceptance Rate**: 40.0% (MT-Bench, B=1)
+- **Speedup**: 1.39× TPOT (B=1), 1.7× throughput (B=32)
+- **Hardware**: Optimized for TP=4 deployment
+---
+## Architecture Specifications
+| Parameter | Value |
+|-----------|-------|
+| Hidden Size | 2048 |
+| Attention Heads | 16 |
+| KV Heads (GQA) | 4 |
+| Head Dimension | 128 |
+| Intermediate Size | 8192 |
+| Layers | 1 |
+| Vocabulary Size | 154880 |
+| Draft Vocab Size | 32000 |
+**Note**: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing.
+---
+## Training Details
+### Dataset
+**Mixed Diversity** — 54K samples
+Composition:
+- 45% ShareGPT
+- 35% UltraChat
+- 20% PerfectBlend
+Average tokens per sample: 1300
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Epochs | 3 |
+| Batch Size | 1 |
+| Learning Rate | 1e-4 |
+| Warmup Ratio | 0.03 |
+| Max Length | 1024 |
+| TP Size | 4 |
+### Training Results
+- **Training Acceptance Rate**: 79.2% (at position k=0)
+- **Best Checkpoint**: epoch_2_step_37323
+- **Experiment ID**: exp-K
+---
+## Benchmark Results
+**Dataset**: MT-Bench (154 prompts, max_tokens=512, temperature=0.7)
+**Hardware**: Single NVIDIA H100 (79GB), TP=1
+**Backend**: FlashInfer
+**Spec Config**: num_steps=3, num_draft_tokens=6, eagle_topk=4
+### Metric Definitions
+- **Acceptance Rate**: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average.
+- **Acceptance Length**: Average number of consecutive draft tokens accepted per verification step (directly determines speedup).
+- **TTFT**: Time To First Token (prefill latency) in milliseconds
+- **TPOT**: Time Per Output Token (decode latency) in milliseconds
+- **Throughput**: Tokens generated per second
+### Batch Size 1 (Single Request - Latency Optimization)
+#### Server-Side Metrics (Prometheus — Ground Truth)
+| Metric | Baseline | EAGLE3 | Speedup |
+|--------|----------|--------|---------|
+| TTFT (ms) | 76.1 | 74.74 | **1.02×** |
+| TPOT (ms) | 8.18 | 5.89 | **1.39×** |
+| Throughput (tok/s) | 120.3 | 167.75 | **1.39×** |
+| Acceptance Rate | -- | **40.0%** | -- |
+| Acceptance Length | -- | **2.4** | -- |
+### Batch Size 32 (Concurrent Load - Throughput Optimization)
+#### Server-Side Metrics (Prometheus — Ground Truth)
+| Metric | Baseline | EAGLE3 | Speedup |
+|--------|----------|--------|---------|
+| TTFT (ms) | 2988 | 3210 | **0.93×** |
+| TPOT (ms) | 22.57 | 17.33 | **1.3×** |
+| Throughput (tok/s) | 258.61 | 440.15 | **1.7×** |
+**Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).
+---
+## Usage
+### Installation
+```bash
+pip install sglang transformers
+```
+### Basic Usage
+```bash
+python -m sglang.launch_server \
+  --model-path zai-org/GLM-4.7-Flash \
+  --speculative-algorithm EAGLE3 \
+  --speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
+  --speculative-num-steps 3 \
+  --speculative-num-draft-tokens 6 \
+  --speculative-eagle-topk 4 \
+  --tp 1 \
+  --trust-remote-code \
+  --port 30000 \
+  --enable-metrics
+```
+### Python API
+```python
+import requests
+response = requests.post(
+    "http://localhost:30000/v1/chat/completions",
+    json={
+        "model": "default",
+        "messages": [{"role": "user", "content": "Hello!"}],
+        "max_tokens": 100,
+        "temperature": 0.7,
+    }
+)
+print(response.json())
+```
+### Performance Tips
+1. **Backend Selection**: Use FlashInfer backend (default) for optimal performance
+2. **Tuning**: Adjust `num_draft_tokens` based on workload (3-6 recommended)
+3. **Monitoring**: Enable `--enable-metrics` flag and monitor `/metrics` endpoint for acceptance rates
+4. **Validation**: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly
+---
+## Limitations
+- Requires SGLang backend with EAGLE3 support
+- Optimized for TP=1 inference (single GPU deployment)
+- FlashInfer backend recommended for optimal performance
+- Head dimension 128 ensures FlashInfer compatibility
+---
+## Citation
+```bibtex
+@misc{glm_4.7_flash_eagle3_2026,
+  title={EAGLE3 Draft Model for GLM-4.7-Flash},
+  author={ThoughtWorks},
+  year={2026},
+  howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}},
+}
+```
+### EAGLE3 Paper
+```bibtex
+@article{wang2024eagle3,
+  title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
+  author={Wang, Yuhui and others},
+  journal={arXiv preprint arXiv:2501.XXXXX},
+  year={2024}
+}
+```
+---
+## Additional Resources
+- **Benchmark Results**: [https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md](https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md)
+- **Training Guide**: [https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md](https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md)
+- **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
+---
+## License
+Apache-2.0
+---
+## Contact
+For questions or issues, please contact ThoughtWorks or open an issue in the [baby-shark repository](https://github.com/thoughtworks/baby-shark).