Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -12,7 +12,16 @@ base_model: Qwen/Qwen3-Coder-Next-FP8
|
|
| 12 |
pipeline_tag: text-generation
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# Qwen3-Coder-Next-FP8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Model Description
|
| 18 |
|
|
@@ -94,42 +103,38 @@ Trained on the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onli
|
|
| 94 |
|
| 95 |
Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
-
- **Format**: `spec_steps | eagle_topk | num_draft_tokens`
|
| 101 |
-
- **516**: 5 speculative steps | top-1 selection | 6 draft tokens
|
| 102 |
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
| Batch | Config | Mean TPS | Median TPS | P05 TPS | P95 TPS | Speedup | Acc Len |
|
| 106 |
-
|-------|--------|----------|------------|---------|---------|---------|---------|
|
| 107 |
| **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
|
| 108 |
-
| |
|
| 109 |
-
| |
|
| 110 |
-
| | **
|
| 111 |
| **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
|
| 112 |
-
| |
|
| 113 |
-
| |
|
| 114 |
-
| | **
|
| 115 |
| **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
|
| 116 |
-
| |
|
| 117 |
-
| |
|
| 118 |
-
| | **
|
| 119 |
| **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
|
| 120 |
-
| |
|
| 121 |
-
| |
|
| 122 |
-
| |
|
| 123 |
|
| 124 |
### Performance Across Different Batch Sizes
|
| 125 |
|
| 126 |
Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
|
| 127 |
|
| 128 |
-
- **Batch Size 1** (Best Case): Up to **1.51× speedup** with
|
| 129 |
|
| 130 |
-
- **Batch Size 8** (Moderate): **1.23× speedup** with
|
| 131 |
|
| 132 |
-
- **Batch Size 16** (Diminishing Returns): **1.09× speedup** with
|
| 133 |
|
| 134 |
- **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
|
| 135 |
|
|
@@ -149,26 +154,121 @@ While Aurora demonstrates strong performance at small-to-moderate batch sizes, t
|
|
| 149 |
|
| 150 |
This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with Qwen3-Coder as the target model.
|
| 151 |
|
| 152 |
-
### Example
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
```python
|
| 157 |
-
from
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
speculative_draft_model_path="togethercomputer/Qwen3-Coder-Next-FP8-EAGLE3",
|
| 163 |
-
speculative_algorithm="EAGLE",
|
| 164 |
-
speculative_num_steps=5,
|
| 165 |
)
|
| 166 |
|
| 167 |
-
|
| 168 |
-
|
| 169 |
prompt="Write a Python function to compute fibonacci numbers:",
|
| 170 |
max_tokens=256,
|
|
|
|
| 171 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
```
|
| 173 |
|
| 174 |
## Limitations
|
|
|
|
| 12 |
pipeline_tag: text-generation
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Aurora-Spec-Qwen3-Coder-Next-FP8
|
| 16 |
+
|
| 17 |
+
<div align="center">
|
| 18 |
+
|
| 19 |
+
[](https://aurora-spec-ai.github.io)
|
| 20 |
+
[](#)
|
| 21 |
+
[](https://huggingface.co/datasets/zelc/onlinesd)
|
| 22 |
+
[](#)
|
| 23 |
+
|
| 24 |
+
</div>
|
| 25 |
|
| 26 |
## Model Description
|
| 27 |
|
|
|
|
| 103 |
|
| 104 |
Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
|
| 105 |
|
| 106 |
+
**Qwen-Coder-Next: end-to-end throughput under varying batch size and lookahead**
|
| 107 |
|
| 108 |
+
We report tokens-per-second (TPS) statistics and speedup relative to the no-speculation baseline.
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
| BS | Config | Mean TPS | P50 TPS | P05 TPS | P95 TPS | Speedup (Mean) | Acc Len |
|
| 111 |
+
|:---:|:---------|:--------:|:-------:|:-------:|:-------:|:--------------:|:-------:|
|
|
|
|
|
|
|
| 112 |
| **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
|
| 113 |
+
| | lookahead 3 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
|
| 114 |
+
| | lookahead 4 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
|
| 115 |
+
| | **lookahead 5** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
|
| 116 |
| **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
|
| 117 |
+
| | lookahead 3 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
|
| 118 |
+
| | lookahead 4 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
|
| 119 |
+
| | **lookahead 5** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
|
| 120 |
| **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
|
| 121 |
+
| | lookahead 3 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
|
| 122 |
+
| | lookahead 4 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
|
| 123 |
+
| | **lookahead 5** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
|
| 124 |
| **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
|
| 125 |
+
| | lookahead 3 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
|
| 126 |
+
| | lookahead 4 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
|
| 127 |
+
| | lookahead 5 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
|
| 128 |
|
| 129 |
### Performance Across Different Batch Sizes
|
| 130 |
|
| 131 |
Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
|
| 132 |
|
| 133 |
+
- **Batch Size 1** (Best Case): Up to **1.51× speedup** with lookahead 5 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
|
| 134 |
|
| 135 |
+
- **Batch Size 8** (Moderate): **1.23× speedup** with lookahead 5 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
|
| 136 |
|
| 137 |
+
- **Batch Size 16** (Diminishing Returns): **1.09× speedup** with lookahead 5 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
|
| 138 |
|
| 139 |
- **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
|
| 140 |
|
|
|
|
| 154 |
|
| 155 |
This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with Qwen3-Coder as the target model.
|
| 156 |
|
| 157 |
+
### Example 1: Python API (Offline Batch Inference)
|
| 158 |
+
|
| 159 |
+
```python
|
| 160 |
+
import sglang as sgl
|
| 161 |
+
|
| 162 |
+
def main():
|
| 163 |
+
# Sample prompts
|
| 164 |
+
prompts = [
|
| 165 |
+
"Write a Python function to compute fibonacci numbers:",
|
| 166 |
+
"Implement a binary search algorithm in Python:",
|
| 167 |
+
"Create a class for a binary tree in Python:",
|
| 168 |
+
]
|
| 169 |
+
|
| 170 |
+
# Create sampling params
|
| 171 |
+
sampling_params = {"temperature": 0.7, "max_new_tokens": 256}
|
| 172 |
+
|
| 173 |
+
# Initialize engine with speculative decoding
|
| 174 |
+
llm = sgl.Engine(
|
| 175 |
+
model_path="Qwen/Qwen3-Coder-Next-FP8",
|
| 176 |
+
speculative_draft_model_path="togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8",
|
| 177 |
+
speculative_algorithm="EAGLE",
|
| 178 |
+
speculative_num_steps=5,
|
| 179 |
+
speculative_eagle_topk=1,
|
| 180 |
+
speculative_num_draft_tokens=6,
|
| 181 |
+
trust_remote_code=True,
|
| 182 |
+
)
|
| 183 |
+
|
| 184 |
+
# Generate with speculative decoding
|
| 185 |
+
outputs = llm.generate(prompts, sampling_params)
|
| 186 |
+
|
| 187 |
+
# Print the outputs
|
| 188 |
+
for prompt, output in zip(prompts, outputs):
|
| 189 |
+
print("=" * 50)
|
| 190 |
+
print(f"Prompt: {prompt}")
|
| 191 |
+
print(f"Generated: {output['text']}")
|
| 192 |
+
|
| 193 |
+
# The __main__ condition is necessary when using spawn to create subprocesses
|
| 194 |
+
if __name__ == "__main__":
|
| 195 |
+
main()
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
### Example 2: Launch Server (Production Use)
|
| 199 |
+
|
| 200 |
+
**Step 1: Start the SGLang server with speculative decoding**
|
| 201 |
+
|
| 202 |
+
```bash
|
| 203 |
+
python -m sglang.launch_server \
|
| 204 |
+
--model-path Qwen/Qwen3-Coder-Next-FP8 \
|
| 205 |
+
--speculative-draft-model-path togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8 \
|
| 206 |
+
--speculative-algorithm EAGLE \
|
| 207 |
+
--speculative-num-steps 5 \
|
| 208 |
+
--speculative-eagle-topk 1 \
|
| 209 |
+
--speculative-num-draft-tokens 6 \
|
| 210 |
+
--trust-remote-code \
|
| 211 |
+
--port 30000 \
|
| 212 |
+
--host 0.0.0.0
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
**Step 2: Send requests to the server**
|
| 216 |
+
|
| 217 |
+
```python
|
| 218 |
+
import requests
|
| 219 |
+
import json
|
| 220 |
+
|
| 221 |
+
# Server endpoint
|
| 222 |
+
url = "http://localhost:30000/v1/completions"
|
| 223 |
+
|
| 224 |
+
# Request payload
|
| 225 |
+
payload = {
|
| 226 |
+
"prompt": "Write a Python function to compute fibonacci numbers:",
|
| 227 |
+
"max_tokens": 256,
|
| 228 |
+
"temperature": 0.7,
|
| 229 |
+
}
|
| 230 |
+
|
| 231 |
+
# Send request
|
| 232 |
+
response = requests.post(url, json=payload)
|
| 233 |
+
result = response.json()
|
| 234 |
+
|
| 235 |
+
print(result["choices"][0]["text"])
|
| 236 |
+
```
|
| 237 |
|
| 238 |
+
Or using OpenAI-compatible client:
|
| 239 |
|
| 240 |
```python
|
| 241 |
+
from openai import OpenAI
|
| 242 |
+
|
| 243 |
+
client = OpenAI(
|
| 244 |
+
base_url="http://localhost:30000/v1",
|
| 245 |
+
api_key="EMPTY"
|
|
|
|
|
|
|
|
|
|
| 246 |
)
|
| 247 |
|
| 248 |
+
response = client.completions.create(
|
| 249 |
+
model="Qwen/Qwen3-Coder-Next-FP8",
|
| 250 |
prompt="Write a Python function to compute fibonacci numbers:",
|
| 251 |
max_tokens=256,
|
| 252 |
+
temperature=0.7,
|
| 253 |
)
|
| 254 |
+
|
| 255 |
+
print(response.choices[0].text)
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
### Local Model Paths
|
| 259 |
+
|
| 260 |
+
If you have downloaded the models locally, replace the HuggingFace model paths with local paths:
|
| 261 |
+
|
| 262 |
+
```bash
|
| 263 |
+
python -m sglang.launch_server \
|
| 264 |
+
--model-path /path/to/Qwen3-Coder-Next-FP8 \
|
| 265 |
+
--speculative-draft-model-path /path/to/Aurora-Spec-Qwen3-Coder-Next-FP8 \
|
| 266 |
+
--speculative-algorithm EAGLE \
|
| 267 |
+
--speculative-num-steps 5 \
|
| 268 |
+
--speculative-eagle-topk 1 \
|
| 269 |
+
--speculative-num-draft-tokens 6 \
|
| 270 |
+
--trust-remote-code \
|
| 271 |
+
--port 30000
|
| 272 |
```
|
| 273 |
|
| 274 |
## Limitations
|