thoughtworks
/

Qwen2.5-14B-Instruct-Eagle3

@@ -19,11 +19,13 @@ pipeline_tag: text-generation
 A speculative decoding draft head for [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), trained using the [EAGLE3](https://arxiv.org/abs/2503.01840) method on Google Cloud TPU with the [SpecJAX](https://github.com/tails-mpt/SpecJAX) framework.
-EAGLE3 draft heads accelerate autoregressive generation by proposing multiple tokens per step that a target model then verifies in parallel — typically achieving 2–3× throughput gains with no change in output quality.
 ## Usage
-### SGLang
 ```bash
 python -m sglang.launch_server \
@@ -35,6 +37,21 @@ python -m sglang.launch_server \
     --dtype bfloat16
 ```
 ### Python (SGLang client)
 ```python
@@ -55,7 +72,7 @@ llm = sgl.LLM(
 | Parameter | Value |
 |-----------|-------|
 | Framework | [SpecJAX](https://github.com/tails-mpt/SpecJAX) — pure JAX, no Flax/PyTorch |
-| Hardware | Google Cloud TPU v4-32 (4 hosts × 4 chips, TP=4, DP=4) |
 | Dataset | 54K mixed: ShareGPT (45%) + UltraChat-200K (35%) + Open-PerfectBlend (20%) |
 | Epochs | 3 |
 | Steps | 4,983 per epoch |
@@ -84,7 +101,7 @@ Token acceptance rates on generic instruction-following data (ShareGPT-style pro
 | acc_5 | 52.1% |
 | acc_6 | 50.7% |
-*Measured on held-out evaluation data. Actual throughput gains depend on hardware, prompt distribution, and SGLang version.*
 ## Model Architecture
@@ -107,7 +124,7 @@ The draft head is a single-layer transformer that operates on the target model's
 ## License
-This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
 ## References

 A speculative decoding draft head for [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct), trained using the [EAGLE3](https://arxiv.org/abs/2503.01840) method on Google Cloud TPU with the [SpecJAX](https://github.com/tails-mpt/SpecJAX) framework.
+EAGLE3 draft heads accelerate autoregressive generation by proposing multiple tokens per step that a target model then verifies in parallel — typically achieving 2-3x throughput gains with no change in output quality.
 ## Usage
+### SGLang (GPU)
+> **Note**: Qwen2.5 EAGLE3 support requires a small patch to SGLang (adding `set_eagle3_layers_to_capture()` to the Qwen2 model). See the [SpecJAX inference guide](https://github.com/tails-mpt/SpecJAX/tree/main/inference) for details.
 ```bash
 python -m sglang.launch_server \
     --dtype bfloat16
 ```
+### sglang-jax (TPU)
+> **Note**: Requires the same Qwen2 EAGLE3 patch applied to sglang-jax. The sglang-jax EAGLE3 pipeline is functional but not yet performance-optimized.
+```bash
+python -m sgl_jax.launch_server \
+    --model-path Qwen/Qwen2.5-14B-Instruct \
+    --speculative-algorithm EAGLE3 \
+    --speculative-draft-model-path thoughtworks/Qwen2.5-14B-Instruct-Eagle3 \
+    --speculative-eagle-topk 1 \
+    --speculative-num-steps 3 \
+    --speculative-num-draft-tokens 4 \
+    --tp-size 4 --dtype bfloat16
+```
 ### Python (SGLang client)
 ```python
 | Parameter | Value |
 |-----------|-------|
 | Framework | [SpecJAX](https://github.com/tails-mpt/SpecJAX) — pure JAX, no Flax/PyTorch |
+| Hardware | Google Cloud TPU v4-32 (4 hosts x 4 chips, TP=4, DP=4) |
 | Dataset | 54K mixed: ShareGPT (45%) + UltraChat-200K (35%) + Open-PerfectBlend (20%) |
 | Epochs | 3 |
 | Steps | 4,983 per epoch |
 | acc_5 | 52.1% |
 | acc_6 | 50.7% |
+*Measured on held-out evaluation data. Actual throughput gains depend on hardware, prompt distribution, and runtime version.*
 ## Model Architecture
 ## License
+This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base model's license.
 ## References