jisenli commited on
Commit
fb6ef3d
·
verified ·
1 Parent(s): 4962383

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +136 -36
README.md CHANGED
@@ -12,7 +12,16 @@ base_model: Qwen/Qwen3-Coder-Next-FP8
12
  pipeline_tag: text-generation
13
  ---
14
 
15
- # Qwen3-Coder-Next-FP8-EAGLE3
 
 
 
 
 
 
 
 
 
16
 
17
  ## Model Description
18
 
@@ -94,42 +103,38 @@ Trained on the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onli
94
 
95
  Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
96
 
97
- #### Configuration Explanation
98
 
99
- The **Config** column denotes the speculative decoding hyperparameters in a compact notation:
100
- - **Format**: `spec_steps | eagle_topk | num_draft_tokens`
101
- - **516**: 5 speculative steps | top-1 selection | 6 draft tokens
102
 
103
- This configuration controls the trade-off between draft quality (more steps = better quality) and verification overhead (more tokens = more computation).
104
-
105
- | Batch | Config | Mean TPS | Median TPS | P05 TPS | P95 TPS | Speedup | Acc Len |
106
- |-------|--------|----------|------------|---------|---------|---------|---------|
107
  | **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
108
- | | 314 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
109
- | | 415 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
110
- | | **516** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
111
  | **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
112
- | | 314 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
113
- | | 415 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
114
- | | **516** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
115
  | **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
116
- | | 314 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
117
- | | 415 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
118
- | | **516** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
119
  | **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
120
- | | 314 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
121
- | | 415 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
122
- | | 516 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
123
 
124
  ### Performance Across Different Batch Sizes
125
 
126
  Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
127
 
128
- - **Batch Size 1** (Best Case): Up to **1.51× speedup** with 516 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
129
 
130
- - **Batch Size 8** (Moderate): **1.23× speedup** with 516 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
131
 
132
- - **Batch Size 16** (Diminishing Returns): **1.09× speedup** with 516 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
133
 
134
  - **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
135
 
@@ -149,26 +154,121 @@ While Aurora demonstrates strong performance at small-to-moderate batch sizes, t
149
 
150
  This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with Qwen3-Coder as the target model.
151
 
152
- ### Example with SGLang
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
 
154
- **Note: This is a placeholder example. TODO - Verify and update with tested code.**
155
 
156
  ```python
157
- from sglang import Engine
158
-
159
- # Initialize engine with speculative decoding
160
- engine = Engine(
161
- model_path="Qwen/Qwen3-Coder-Next-FP8",
162
- speculative_draft_model_path="togethercomputer/Qwen3-Coder-Next-FP8-EAGLE3",
163
- speculative_algorithm="EAGLE",
164
- speculative_num_steps=5,
165
  )
166
 
167
- # Generate with speculative decoding
168
- output = engine.generate(
169
  prompt="Write a Python function to compute fibonacci numbers:",
170
  max_tokens=256,
 
171
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  ```
173
 
174
  ## Limitations
 
12
  pipeline_tag: text-generation
13
  ---
14
 
15
+ # Aurora-Spec-Qwen3-Coder-Next-FP8
16
+
17
+ <div align="center">
18
+
19
+ [![Website](https://img.shields.io/badge/🌐-Website-blue)](https://aurora-spec-ai.github.io)
20
+ [![Code](https://img.shields.io/badge/💻-Code-green)](#)
21
+ [![Dataset](https://img.shields.io/badge/📊-Dataset-orange)](https://huggingface.co/datasets/zelc/onlinesd)
22
+ [![Paper](https://img.shields.io/badge/📄-Paper-red)](#)
23
+
24
+ </div>
25
 
26
  ## Model Description
27
 
 
103
 
104
  Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
105
 
106
+ **Qwen-Coder-Next: end-to-end throughput under varying batch size and lookahead**
107
 
108
+ We report tokens-per-second (TPS) statistics and speedup relative to the no-speculation baseline.
 
 
109
 
110
+ | BS | Config | Mean TPS | P50 TPS | P05 TPS | P95 TPS | Speedup (Mean) | Acc Len |
111
+ |:---:|:---------|:--------:|:-------:|:-------:|:-------:|:--------------:|:-------:|
 
 
112
  | **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
113
+ | | lookahead 3 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
114
+ | | lookahead 4 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
115
+ | | **lookahead 5** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
116
  | **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
117
+ | | lookahead 3 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
118
+ | | lookahead 4 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
119
+ | | **lookahead 5** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
120
  | **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
121
+ | | lookahead 3 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
122
+ | | lookahead 4 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
123
+ | | **lookahead 5** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
124
  | **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
125
+ | | lookahead 3 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
126
+ | | lookahead 4 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
127
+ | | lookahead 5 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
128
 
129
  ### Performance Across Different Batch Sizes
130
 
131
  Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
132
 
133
+ - **Batch Size 1** (Best Case): Up to **1.51× speedup** with lookahead 5 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
134
 
135
+ - **Batch Size 8** (Moderate): **1.23× speedup** with lookahead 5 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
136
 
137
+ - **Batch Size 16** (Diminishing Returns): **1.09× speedup** with lookahead 5 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
138
 
139
  - **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
140
 
 
154
 
155
  This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with Qwen3-Coder as the target model.
156
 
157
+ ### Example 1: Python API (Offline Batch Inference)
158
+
159
+ ```python
160
+ import sglang as sgl
161
+
162
+ def main():
163
+ # Sample prompts
164
+ prompts = [
165
+ "Write a Python function to compute fibonacci numbers:",
166
+ "Implement a binary search algorithm in Python:",
167
+ "Create a class for a binary tree in Python:",
168
+ ]
169
+
170
+ # Create sampling params
171
+ sampling_params = {"temperature": 0.7, "max_new_tokens": 256}
172
+
173
+ # Initialize engine with speculative decoding
174
+ llm = sgl.Engine(
175
+ model_path="Qwen/Qwen3-Coder-Next-FP8",
176
+ speculative_draft_model_path="togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8",
177
+ speculative_algorithm="EAGLE",
178
+ speculative_num_steps=5,
179
+ speculative_eagle_topk=1,
180
+ speculative_num_draft_tokens=6,
181
+ trust_remote_code=True,
182
+ )
183
+
184
+ # Generate with speculative decoding
185
+ outputs = llm.generate(prompts, sampling_params)
186
+
187
+ # Print the outputs
188
+ for prompt, output in zip(prompts, outputs):
189
+ print("=" * 50)
190
+ print(f"Prompt: {prompt}")
191
+ print(f"Generated: {output['text']}")
192
+
193
+ # The __main__ condition is necessary when using spawn to create subprocesses
194
+ if __name__ == "__main__":
195
+ main()
196
+ ```
197
+
198
+ ### Example 2: Launch Server (Production Use)
199
+
200
+ **Step 1: Start the SGLang server with speculative decoding**
201
+
202
+ ```bash
203
+ python -m sglang.launch_server \
204
+ --model-path Qwen/Qwen3-Coder-Next-FP8 \
205
+ --speculative-draft-model-path togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8 \
206
+ --speculative-algorithm EAGLE \
207
+ --speculative-num-steps 5 \
208
+ --speculative-eagle-topk 1 \
209
+ --speculative-num-draft-tokens 6 \
210
+ --trust-remote-code \
211
+ --port 30000 \
212
+ --host 0.0.0.0
213
+ ```
214
+
215
+ **Step 2: Send requests to the server**
216
+
217
+ ```python
218
+ import requests
219
+ import json
220
+
221
+ # Server endpoint
222
+ url = "http://localhost:30000/v1/completions"
223
+
224
+ # Request payload
225
+ payload = {
226
+ "prompt": "Write a Python function to compute fibonacci numbers:",
227
+ "max_tokens": 256,
228
+ "temperature": 0.7,
229
+ }
230
+
231
+ # Send request
232
+ response = requests.post(url, json=payload)
233
+ result = response.json()
234
+
235
+ print(result["choices"][0]["text"])
236
+ ```
237
 
238
+ Or using OpenAI-compatible client:
239
 
240
  ```python
241
+ from openai import OpenAI
242
+
243
+ client = OpenAI(
244
+ base_url="http://localhost:30000/v1",
245
+ api_key="EMPTY"
 
 
 
246
  )
247
 
248
+ response = client.completions.create(
249
+ model="Qwen/Qwen3-Coder-Next-FP8",
250
  prompt="Write a Python function to compute fibonacci numbers:",
251
  max_tokens=256,
252
+ temperature=0.7,
253
  )
254
+
255
+ print(response.choices[0].text)
256
+ ```
257
+
258
+ ### Local Model Paths
259
+
260
+ If you have downloaded the models locally, replace the HuggingFace model paths with local paths:
261
+
262
+ ```bash
263
+ python -m sglang.launch_server \
264
+ --model-path /path/to/Qwen3-Coder-Next-FP8 \
265
+ --speculative-draft-model-path /path/to/Aurora-Spec-Qwen3-Coder-Next-FP8 \
266
+ --speculative-algorithm EAGLE \
267
+ --speculative-num-steps 5 \
268
+ --speculative-eagle-topk 1 \
269
+ --speculative-num-draft-tokens 6 \
270
+ --trust-remote-code \
271
+ --port 30000
272
  ```
273
 
274
  ## Limitations