togethercomputer
/

Aurora-Spec-Qwen3-Coder-Next-FP8

@@ -90,17 +90,58 @@ Trained on the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onli
 ## Benchmarks
-### Speculative Decoding Performance
-| Metric | Baseline (No Speculator) | This Model | Improvement |
-|--------|--------------------------|------------|-------------|
-| **Average Accept Length** | - | 3.1 | - |
-| **Throughput** | ~50 tokens/s | ~150 tokens/s | ~3x |
-| **Training Steps** | - | 10,000 (80k requests) | - |
-**Baseline**: Target model without speculative decoding (no draft model).
-The model achieves consistent **3.1x average speculative accept length**, meaning on average 3.1 draft tokens are accepted per verification step, significantly reducing the number of target model forward passes required.
 **Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.

 ## Benchmarks
+### End-to-End Throughput Performance
+Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
+#### Configuration Explanation
+The **Config** column denotes the speculative decoding hyperparameters in a compact notation:
+- **Format**: `spec_steps | eagle_topk | num_draft_tokens`
+- **516**: 5 speculative steps | top-1 selection | 6 draft tokens
+This configuration controls the trade-off between draft quality (more steps = better quality) and verification overhead (more tokens = more computation).
+| Batch | Config | Mean TPS | Median TPS | P05 TPS | P95 TPS | Speedup | Acc Len |
+|-------|--------|----------|------------|---------|---------|---------|---------|
+| **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
+| | 314 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
+| | 415 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
+| | **516** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
+| **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
+| | 314 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
+| | 415 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
+| | **516** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
+| **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
+| | 314 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
+| | 415 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
+| | **516** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
+| **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
+| | 314 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
+| | 415 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
+| | 516 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
+### Performance Across Different Batch Sizes
+Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
+- **Batch Size 1** (Best Case): Up to **1.51× speedup** with 516 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
+- **Batch Size 8** (Moderate): **1.23× speedup** with 516 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
+- **Batch Size 16** (Diminishing Returns): **1.09× speedup** with 516 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
+- **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
+**Metrics Explained**:
+- **TPS**: Tokens per second (throughput)
+- **Acc Len**: Average accept length (number of draft tokens accepted per verification step)
+- **Speedup**: Relative to the no-speculation baseline
+- **P05/P95**: 5th and 95th percentile throughput values
+### Future Work
+While Aurora demonstrates strong performance at small-to-moderate batch sizes, there is room for improvement in high-batch scenarios. Future work includes making speculative decoding faster for hybrid models, potentially by using **multistep kernels** that can more efficiently handle the verification overhead in batched settings.
 **Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.