thoughtworks
/

GLM-4.7-FP8-Eagle3

@@ -1,9 +1,9 @@
 ---
 library_name: transformers
-license: apache-2.0
 language:
   - en
-base_model: THUDM/GLM-4.7
 pipeline_tag: text-generation
 tags:
   - eagle3
@@ -19,11 +19,11 @@ tags:
 # EAGLE3 Draft Head — GLM-4.7-FP8
-A lightweight EAGLE3 draft head for [GLM-4.7](https://huggingface.co/THUDM/GLM-4.7) (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
 GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.
-**Blog post**: [TODO: link after publication]
 ## Usage
@@ -37,7 +37,7 @@ Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for GLM-4.7 Eagl
 pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
 python -m sglang.launch_server \
-    --model-path THUDM/GLM-4.7 \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
     --speculative-num-steps 3 \
@@ -52,7 +52,7 @@ python -m sglang.launch_server \
 ```bash
 python -m sglang.launch_server \
-    --model-path THUDM/GLM-4.7 \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
     --speculative-num-steps 3 \
@@ -153,7 +153,7 @@ The final fine-tuning stage uses training data where the assistant responses wer
 ## License
-This draft head is released under Apache 2.0. Please verify the [GLM-4.7 license](https://huggingface.co/THUDM/GLM-4.7) for the target model.
 ## Citation

 ---
 library_name: transformers
+license: mit
 language:
   - en
+base_model: zai-org/GLM-4.7-FP8
 pipeline_tag: text-generation
 tags:
   - eagle3
 # EAGLE3 Draft Head — GLM-4.7-FP8
+A lightweight EAGLE3 draft head for [GLM-4.7-FP8](https://huggingface.co/zai-org/GLM-4.7-FP8) (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.
 GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.
+**Blog post**: [1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7](https://huggingface.co/blog/lujangusface/tw-eagle3-glm47-fp8)
 ## Usage
 pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'
 python -m sglang.launch_server \
+    --model-path zai-org/GLM-4.7-FP8 \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
     --speculative-num-steps 3 \
 ```bash
 python -m sglang.launch_server \
+    --model-path zai-org/GLM-4.7-FP8 \
     --speculative-algorithm EAGLE3 \
     --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
     --speculative-num-steps 3 \
 ## License
+This draft head is released under the [MIT License](https://opensource.org/licenses/MIT), matching the [GLM-4.7-FP8 license](https://huggingface.co/zai-org/GLM-4.7-FP8).
 ## Citation