docs: polish model card, remove internal details, fix citation

5652f19 verified 11 days ago

6.23 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- speculative-decoding
	- eagle3
	- glm
	- draft-model
	- text-generation
	---

	# EAGLE3 Draft Model for GLM-4.7-Flash

	GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding with GLM-4.7-Flash. It enables faster inference by predicting multiple future tokens in parallel, which are then verified by the target model in a single forward pass.

	Version: 1.0
	Release Date: 2026-02-16
	Organization: ThoughtWorks
	License: apache-2.0

	---

	## Model Overview

	This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving 1.39× TPOT speedup for single requests and 1.70× throughput improvement under concurrent load.

	Target Model: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
	Draft Model Size: 277.4 MB
	Architecture: 1-layer transformer with 2048 hidden dimensions

	### Key Features

	- FlashInfer Compatible: head_dim=128 ✓
	- Acceptance Rate: 40.0% (MT-Bench, B=1)
	- Speedup: 1.39× TPOT (B=1), 1.70× throughput (B=32)
	- Hardware: Optimized for single GPU (TP=1) deployment

	---

	## Architecture Specifications

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden Size \| 2048 \|
	\| Attention Heads \| 16 \|
	\| KV Heads (GQA) \| 4 \|
	\| Head Dimension \| 128 \|
	\| Intermediate Size \| 8192 \|
	\| Layers \| 1 \|
	\| Vocabulary Size \| 154880 \|
	\| Draft Vocab Size \| 32000 \|

	Note: Hidden size matches target model (GLM-4.7-Flash) for embedding weight sharing.

	---

	## Training Details

	### Dataset

	Mixed Diversity — 54K samples

	Composition:
	- 45% ShareGPT
	- 35% UltraChat
	- 20% PerfectBlend

	Average tokens per sample: 1300

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 3 \|
	\| Batch Size \| 1 \|
	\| Learning Rate \| 1e-4 \|
	\| Warmup Ratio \| 0.03 \|
	\| Max Length \| 1024 \|

	### Training Results

	- Training Acceptance Rate: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%)

	---

	## Benchmark Results

	Dataset: MT-Bench (154 prompts, max_tokens=512, temperature=0.7)
	Hardware: Single NVIDIA H100 (79GB), TP=1
	Backend: FlashInfer
	Spec Config: num_steps=3, num_draft_tokens=6, eagle_topk=4

	### Metric Definitions

	- Acceptance Rate: Percentage of draft tokens accepted by target model, averaged across all verification steps (NOT position-specific). Example: 40% = 2.4 out of 6 predicted tokens accepted on average.
	- Acceptance Length: Average number of consecutive draft tokens accepted per verification step (directly determines speedup).
	- TTFT: Time To First Token (prefill latency) in milliseconds
	- TPOT: Time Per Output Token (decode latency) in milliseconds
	- Throughput: Tokens generated per second

	### Batch Size 1 (Single Request - Latency Optimization)

	#### Server-Side Metrics (Prometheus — Ground Truth)

	\| Metric \| Baseline \| EAGLE3 \| Speedup \|
	\|--------\|----------\|--------\|---------\|
	\| TTFT (ms) \| 76.1 \| 74.74 \| 1.02× \|
	\| TPOT (ms) \| 8.18 \| 5.89 \| 1.39× \|
	\| Throughput (tok/s) \| 120.3 \| 167.75 \| 1.39× \|
	\| Acceptance Rate (%) \| — \| 40.0% \| — \|
	\| Acceptance Length \| — \| 2.4 \| — \|

	### Batch Size 32 (Concurrent Load - Throughput Optimization)

	#### Server-Side Metrics (Prometheus — Ground Truth)

	\| Metric \| Baseline \| EAGLE3 \| Speedup \|
	\|--------\|----------\|--------\|---------\|
	\| TTFT (ms) \| 2988 \| 3210 \| 0.93× \|
	\| TPOT (ms) \| 22.57 \| 17.33 \| 1.30× \|
	\| Throughput (tok/s) \| 258.61 \| 440.15 \| 1.70× \|
	\| Acceptance Rate (%) \| — \| 40.0%† \| — \|
	\| Acceptance Length \| — \| 2.4† \| — \|

	†Same server session as B=1; concurrent benchmark does not collect per-request accept stats.

	Key Insight: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).

	---

	## Usage

	### Installation

	```bash
	pip install sglang transformers
	```

	### Basic Usage

	```bash
	python -m sglang.launch_server \
	--model-path zai-org/GLM-4.7-Flash \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path thoughtworks/GLM-4.7-Flash-Eagle3 \
	--speculative-num-steps 3 \
	--speculative-num-draft-tokens 6 \
	--speculative-eagle-topk 4 \
	--tp 1 \
	--trust-remote-code \
	--port 30000 \
	--enable-metrics
	```

	### Python API

	```python
	import requests

	response = requests.post(
	"http://localhost:30000/v1/chat/completions",
	json={
	"model": "default",
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 100,
	"temperature": 0.7,
	}
	)
	print(response.json())
	```

	### Performance Tips

	1. Backend Selection: Use FlashInfer backend (default) for optimal performance
	2. Tuning: Adjust `num_draft_tokens` based on workload (3-6 recommended)
	3. Monitoring: Enable `--enable-metrics` flag and monitor `/metrics` endpoint for acceptance rates
	4. Validation: Verify acceptance rate > 0% after server startup to confirm draft model loaded correctly

	---

	## Limitations

	- Requires SGLang backend with EAGLE3 support
	- Optimized for TP=1 inference (single GPU deployment)
	- FlashInfer backend recommended for optimal performance


	---

	## Citation

	```bibtex
	@misc{glm_4.7_flash_eagle3_2026,
	title={EAGLE3 Draft Model for GLM-4.7-Flash},
	author={ThoughtWorks},
	year={2026},
	howpublished={\url{https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3}},
	}
	```

	### EAGLE3 Paper

	```bibtex
	@article{wang2025eagle3,
	title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
	author={Wang, Yuhui and others},
	journal={arXiv preprint arXiv:2503.01840},
	year={2025}
	}
	```

	---

	## Additional Resources

	- Target Model: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)

	---

	## License

	apache-2.0

	---

	## Contact

	For questions or issues, open a discussion on the [model page](https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3/discussions).