thoughtworks
/

GLM-4.7-Flash-Eagle3

@@ -3,17 +3,11 @@ license: apache-2.0
 library_name: transformers
 pipeline_tag: text-generation
 tags:
 - speculative-decoding
 - eagle3
 - glm
 - draft-model
 - text-generation
 ---
 # EAGLE3 Draft Model for GLM-4.7-Flash
@@ -29,7 +23,7 @@ GLM-4.7-Flash-Eagle3 is an EAGLE3 draft model trained for speculative decoding w
 ## Model Overview
-This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.7× throughput improvement** under concurrent load.
 **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
 **Draft Model Size**: 277.4 MB
@@ -39,8 +33,8 @@ This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https:
 - **FlashInfer Compatible**: head_dim=128 ✓
 - **Acceptance Rate**: 40.0% (MT-Bench, B=1)
-- **Speedup**: 1.39× TPOT (B=1), 1.7× throughput (B=32)
-- **Hardware**: Optimized for TP=4 deployment
 ---
@@ -68,14 +62,10 @@ This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https:
 **Mixed Diversity** — 54K samples
 Composition:
 - 45% ShareGPT
 - 35% UltraChat
 - 20% PerfectBlend
 Average tokens per sample: 1300
 ### Hyperparameters
@@ -87,15 +77,10 @@ Average tokens per sample: 1300
 | Learning Rate | 1e-4 |
 | Warmup Ratio | 0.03 |
 | Max Length | 1024 |
-| TP Size | 4 |
 ### Training Results
-- **Training Acceptance Rate**: 79.2% (at position k=0)
-- **Best Checkpoint**: epoch_2_step_37323
-- **Experiment ID**: exp-K
 ---
@@ -123,8 +108,8 @@ Average tokens per sample: 1300
 | TTFT (ms) | 76.1 | 74.74 | **1.02×** |
 | TPOT (ms) | 8.18 | 5.89 | **1.39×** |
 | Throughput (tok/s) | 120.3 | 167.75 | **1.39×** |
-| Acceptance Rate | -- | **40.0%** | -- |
-| Acceptance Length | -- | **2.4** | -- |
 ### Batch Size 32 (Concurrent Load - Throughput Optimization)
@@ -132,9 +117,13 @@ Average tokens per sample: 1300
 | Metric | Baseline | EAGLE3 | Speedup |
 |--------|----------|--------|---------|
-| TTFT (ms) | 2988 | 3210 | **0.93×** |
-| TPOT (ms) | 22.57 | 17.33 | **1.3×** |
-| Throughput (tok/s) | 258.61 | 440.15 | **1.7×** |
 **Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).
@@ -162,7 +151,6 @@ python -m sglang.launch_server \
   --trust-remote-code \
   --port 30000 \
   --enable-metrics
 ```
 ### Python API
@@ -193,15 +181,10 @@ print(response.json())
 ## Limitations
 - Requires SGLang backend with EAGLE3 support
 - Optimized for TP=1 inference (single GPU deployment)
 - FlashInfer backend recommended for optimal performance
-- Head dimension 128 ensures FlashInfer compatibility
 ---
@@ -219,11 +202,11 @@ print(response.json())
 ### EAGLE3 Paper
 ```bibtex
-@article{wang2024eagle3,
   title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
   author={Wang, Yuhui and others},
-  journal={arXiv preprint arXiv:2501.XXXXX},
-  year={2024}
 }
 ```
@@ -231,8 +214,6 @@ print(response.json())
 ## Additional Resources
-- **Benchmark Results**: [https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md](https://github.com/thoughtworks/baby-shark/blob/main/benchmark/docs/mtbench_results.md)
-- **Training Guide**: [https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md](https://github.com/thoughtworks/baby-shark/blob/main/training/docs/EXPERIMENT_EVOLUTION.md)
 - **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
 ---
@@ -245,4 +226,4 @@ apache-2.0
 ## Contact
-For questions or issues, please contact ThoughtWorks or open an issue in the [baby-shark repository](https://github.com/thoughtworks/baby-shark).

 library_name: transformers
 pipeline_tag: text-generation
 tags:
 - speculative-decoding
 - eagle3
 - glm
 - draft-model
 - text-generation
 ---
 # EAGLE3 Draft Model for GLM-4.7-Flash
 ## Model Overview
+This EAGLE3 draft model accelerates inference for [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) through speculative decoding. The draft model predicts multiple tokens ahead, achieving **1.39× TPOT speedup** for single requests and **1.70× throughput improvement** under concurrent load.
 **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) - Mixture-of-Experts language model with 3B active parameters
 **Draft Model Size**: 277.4 MB
 - **FlashInfer Compatible**: head_dim=128 ✓
 - **Acceptance Rate**: 40.0% (MT-Bench, B=1)
+- **Speedup**: 1.39× TPOT (B=1), 1.70× throughput (B=32)
+- **Hardware**: Optimized for single GPU (TP=1) deployment
 ---
 **Mixed Diversity** — 54K samples
 Composition:
 - 45% ShareGPT
 - 35% UltraChat
 - 20% PerfectBlend
 Average tokens per sample: 1300
 ### Hyperparameters
 | Learning Rate | 1e-4 |
 | Warmup Ratio | 0.03 |
 | Max Length | 1024 |
 ### Training Results
+- **Training Acceptance Rate**: 79.2% at position k=0 (first draft token; inference average across all 6 positions is ~40%)
 ---
 | TTFT (ms) | 76.1 | 74.74 | **1.02×** |
 | TPOT (ms) | 8.18 | 5.89 | **1.39×** |
 | Throughput (tok/s) | 120.3 | 167.75 | **1.39×** |
+| Acceptance Rate (%) | — | **40.0%** | — |
+| Acceptance Length | — | **2.4** | — |
 ### Batch Size 32 (Concurrent Load - Throughput Optimization)
 | Metric | Baseline | EAGLE3 | Speedup |
 |--------|----------|--------|---------|
+| TTFT (ms) | 2988 | 3210 | 0.93× |
+| TPOT (ms) | 22.57 | 17.33 | **1.30×** |
+| Throughput (tok/s) | 258.61 | 440.15 | **1.70×** |
+| Acceptance Rate (%) | — | **40.0%†** | — |
+| Acceptance Length | — | **2.4†** | — |
+†Same server session as B=1; concurrent benchmark does not collect per-request accept stats.
 **Key Insight**: Batch size 1 optimizes for interactive latency (TPOT matters most), while batch size 32 optimizes for serving capacity (throughput matters most).
   --trust-remote-code \
   --port 30000 \
   --enable-metrics
 ```
 ### Python API
 ## Limitations
 - Requires SGLang backend with EAGLE3 support
 - Optimized for TP=1 inference (single GPU deployment)
 - FlashInfer backend recommended for optimal performance
 ---
 ### EAGLE3 Paper
 ```bibtex
+@article{wang2025eagle3,
   title={EAGLE-3: Lossless Acceleration of LLM Decoding by Adaptive Draft Heads},
   author={Wang, Yuhui and others},
+  journal={arXiv preprint arXiv:2503.01840},
+  year={2025}
 }
 ```
 ## Additional Resources
 - **Target Model**: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
 ---
 ## Contact
+For questions or issues, open a discussion on the [model page](https://huggingface.co/thoughtworks/GLM-4.7-Flash-Eagle3/discussions).