togethercomputer
/

Aurora-Spec-Qwen3-Coder-Next-FP8

@@ -12,7 +12,16 @@ base_model: Qwen/Qwen3-Coder-Next-FP8
 pipeline_tag: text-generation
 ---
-# Qwen3-Coder-Next-FP8-EAGLE3
 ## Model Description
@@ -94,42 +103,38 @@ Trained on the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onli
 Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
-#### Configuration Explanation
-The **Config** column denotes the speculative decoding hyperparameters in a compact notation:
-- **Format**: `spec_steps | eagle_topk | num_draft_tokens`
-- **516**: 5 speculative steps | top-1 selection | 6 draft tokens
-This configuration controls the trade-off between draft quality (more steps = better quality) and verification overhead (more tokens = more computation).
-| Batch | Config | Mean TPS | Median TPS | P05 TPS | P95 TPS | Speedup | Acc Len |
-|-------|--------|----------|------------|---------|---------|---------|---------|
 | **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
-| | 314 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
-| | 415 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
-| | **516** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
 | **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
-| | 314 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
-| | 415 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
-| | **516** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
 | **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
-| | 314 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
-| | 415 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
-| | **516** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
 | **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
-| | 314 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
-| | 415 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
-| | 516 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
 ### Performance Across Different Batch Sizes
 Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
-- **Batch Size 1** (Best Case): Up to **1.51× speedup** with 516 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
-- **Batch Size 8** (Moderate): **1.23× speedup** with 516 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
-- **Batch Size 16** (Diminishing Returns): **1.09× speedup** with 516 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
 - **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
@@ -149,26 +154,121 @@ While Aurora demonstrates strong performance at small-to-moderate batch sizes, t
 This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with Qwen3-Coder as the target model.
-### Example with SGLang
-**Note: This is a placeholder example. TODO - Verify and update with tested code.**
 ```python
-from sglang import Engine
-# Initialize engine with speculative decoding
-engine = Engine(
-    model_path="Qwen/Qwen3-Coder-Next-FP8",
-    speculative_draft_model_path="togethercomputer/Qwen3-Coder-Next-FP8-EAGLE3",
-    speculative_algorithm="EAGLE",
-    speculative_num_steps=5,
 )
-# Generate with speculative decoding
-output = engine.generate(
     prompt="Write a Python function to compute fibonacci numbers:",
     max_tokens=256,
 )
 ```
 ## Limitations

 pipeline_tag: text-generation
 ---
+# Aurora-Spec-Qwen3-Coder-Next-FP8
+<div align="center">
+[![Website](https://img.shields.io/badge/🌐-Website-blue)](https://aurora-spec-ai.github.io)
+[![Code](https://img.shields.io/badge/💻-Code-green)](#)
+[![Dataset](https://img.shields.io/badge/📊-Dataset-orange)](https://huggingface.co/datasets/zelc/onlinesd)
+[![Paper](https://img.shields.io/badge/📄-Paper-red)](#)
+</div>
 ## Model Description
 Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
+**Qwen-Coder-Next: end-to-end throughput under varying batch size and lookahead**
+We report tokens-per-second (TPS) statistics and speedup relative to the no-speculation baseline.
+| BS | Config | Mean TPS | P50 TPS | P05 TPS | P95 TPS | Speedup (Mean) | Acc Len |
+|:---:|:---------|:--------:|:-------:|:-------:|:-------:|:--------------:|:-------:|
 | **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
+| | lookahead 3 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
+| | lookahead 4 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
+| | **lookahead 5** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
 | **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
+| | lookahead 3 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
+| | lookahead 4 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
+| | **lookahead 5** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
 | **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
+| | lookahead 3 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
+| | lookahead 4 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
+| | **lookahead 5** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
 | **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
+| | lookahead 3 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
+| | lookahead 4 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
+| | lookahead 5 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
 ### Performance Across Different Batch Sizes
 Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
+- **Batch Size 1** (Best Case): Up to **1.51× speedup** with lookahead 5 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
+- **Batch Size 8** (Moderate): **1.23× speedup** with lookahead 5 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
+- **Batch Size 16** (Diminishing Returns): **1.09× speedup** with lookahead 5 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
 - **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
 This model is designed to be used as a draft model in EAGLE3 speculative decoding pipelines with Qwen3-Coder as the target model.
+### Example 1: Python API (Offline Batch Inference)
+```python
+import sglang as sgl
+def main():
+    # Sample prompts
+    prompts = [
+        "Write a Python function to compute fibonacci numbers:",
+        "Implement a binary search algorithm in Python:",
+        "Create a class for a binary tree in Python:",
+    ]
+    # Create sampling params
+    sampling_params = {"temperature": 0.7, "max_new_tokens": 256}
+    # Initialize engine with speculative decoding
+    llm = sgl.Engine(
+        model_path="Qwen/Qwen3-Coder-Next-FP8",
+        speculative_draft_model_path="togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8",
+        speculative_algorithm="EAGLE",
+        speculative_num_steps=5,
+        speculative_eagle_topk=1,
+        speculative_num_draft_tokens=6,
+        trust_remote_code=True,
+    )
+    # Generate with speculative decoding
+    outputs = llm.generate(prompts, sampling_params)
+    # Print the outputs
+    for prompt, output in zip(prompts, outputs):
+        print("=" * 50)
+        print(f"Prompt: {prompt}")
+        print(f"Generated: {output['text']}")
+# The __main__ condition is necessary when using spawn to create subprocesses
+if __name__ == "__main__":
+    main()
+```
+### Example 2: Launch Server (Production Use)
+**Step 1: Start the SGLang server with speculative decoding**
+```bash
+python -m sglang.launch_server \
+    --model-path Qwen/Qwen3-Coder-Next-FP8 \
+    --speculative-draft-model-path togethercomputer/Aurora-Spec-Qwen3-Coder-Next-FP8 \
+    --speculative-algorithm EAGLE \
+    --speculative-num-steps 5 \
+    --speculative-eagle-topk 1 \
+    --speculative-num-draft-tokens 6 \
+    --trust-remote-code \
+    --port 30000 \
+    --host 0.0.0.0
+```
+**Step 2: Send requests to the server**
+```python
+import requests
+import json
+# Server endpoint
+url = "http://localhost:30000/v1/completions"
+# Request payload
+payload = {
+    "prompt": "Write a Python function to compute fibonacci numbers:",
+    "max_tokens": 256,
+    "temperature": 0.7,
+}
+# Send request
+response = requests.post(url, json=payload)
+result = response.json()
+print(result["choices"][0]["text"])
+```
+Or using OpenAI-compatible client:
 ```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:30000/v1",
+    api_key="EMPTY"
 )
+response = client.completions.create(
+    model="Qwen/Qwen3-Coder-Next-FP8",
     prompt="Write a Python function to compute fibonacci numbers:",
     max_tokens=256,
+    temperature=0.7,
 )
+print(response.choices[0].text)
+```
+### Local Model Paths
+If you have downloaded the models locally, replace the HuggingFace model paths with local paths:
+```bash
+python -m sglang.launch_server \
+    --model-path /path/to/Qwen3-Coder-Next-FP8 \
+    --speculative-draft-model-path /path/to/Aurora-Spec-Qwen3-Coder-Next-FP8 \
+    --speculative-algorithm EAGLE \
+    --speculative-num-steps 5 \
+    --speculative-eagle-topk 1 \
+    --speculative-num-draft-tokens 6 \
+    --trust-remote-code \
+    --port 30000
 ```
 ## Limitations