Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -90,17 +90,58 @@ Trained on the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onli
|
|
| 90 |
|
| 91 |
## Benchmarks
|
| 92 |
|
| 93 |
-
###
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|--------|--------------------------|------------|-------------|
|
| 97 |
-
| **Average Accept Length** | - | 3.1 | - |
|
| 98 |
-
| **Throughput** | ~50 tokens/s | ~150 tokens/s | ~3x |
|
| 99 |
-
| **Training Steps** | - | 10,000 (80k requests) | - |
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
**Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.
|
| 106 |
|
|
|
|
| 90 |
|
| 91 |
## Benchmarks
|
| 92 |
|
| 93 |
+
### End-to-End Throughput Performance
|
| 94 |
|
| 95 |
+
Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
#### Configuration Explanation
|
| 98 |
|
| 99 |
+
The **Config** column denotes the speculative decoding hyperparameters in a compact notation:
|
| 100 |
+
- **Format**: `spec_steps | eagle_topk | num_draft_tokens`
|
| 101 |
+
- **516**: 5 speculative steps | top-1 selection | 6 draft tokens
|
| 102 |
+
|
| 103 |
+
This configuration controls the trade-off between draft quality (more steps = better quality) and verification overhead (more tokens = more computation).
|
| 104 |
+
|
| 105 |
+
| Batch | Config | Mean TPS | Median TPS | P05 TPS | P95 TPS | Speedup | Acc Len |
|
| 106 |
+
|-------|--------|----------|------------|---------|---------|---------|---------|
|
| 107 |
+
| **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
|
| 108 |
+
| | 314 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
|
| 109 |
+
| | 415 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
|
| 110 |
+
| | **516** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
|
| 111 |
+
| **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
|
| 112 |
+
| | 314 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
|
| 113 |
+
| | 415 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
|
| 114 |
+
| | **516** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
|
| 115 |
+
| **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
|
| 116 |
+
| | 314 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
|
| 117 |
+
| | 415 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
|
| 118 |
+
| | **516** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
|
| 119 |
+
| **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
|
| 120 |
+
| | 314 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
|
| 121 |
+
| | 415 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
|
| 122 |
+
| | 516 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
|
| 123 |
+
|
| 124 |
+
### Performance Across Different Batch Sizes
|
| 125 |
+
|
| 126 |
+
Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
|
| 127 |
+
|
| 128 |
+
- **Batch Size 1** (Best Case): Up to **1.51× speedup** with 516 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
|
| 129 |
+
|
| 130 |
+
- **Batch Size 8** (Moderate): **1.23× speedup** with 516 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
|
| 131 |
+
|
| 132 |
+
- **Batch Size 16** (Diminishing Returns): **1.09× speedup** with 516 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
|
| 133 |
+
|
| 134 |
+
- **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
|
| 135 |
+
|
| 136 |
+
**Metrics Explained**:
|
| 137 |
+
- **TPS**: Tokens per second (throughput)
|
| 138 |
+
- **Acc Len**: Average accept length (number of draft tokens accepted per verification step)
|
| 139 |
+
- **Speedup**: Relative to the no-speculation baseline
|
| 140 |
+
- **P05/P95**: 5th and 95th percentile throughput values
|
| 141 |
+
|
| 142 |
+
### Future Work
|
| 143 |
+
|
| 144 |
+
While Aurora demonstrates strong performance at small-to-moderate batch sizes, there is room for improvement in high-batch scenarios. Future work includes making speculative decoding faster for hybrid models, potentially by using **multistep kernels** that can more efficiently handle the verification overhead in batched settings.
|
| 145 |
|
| 146 |
**Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.
|
| 147 |
|