jisenli commited on
Commit
4962383
·
verified ·
1 Parent(s): ae73168

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +49 -8
README.md CHANGED
@@ -90,17 +90,58 @@ Trained on the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onli
90
 
91
  ## Benchmarks
92
 
93
- ### Speculative Decoding Performance
94
 
95
- | Metric | Baseline (No Speculator) | This Model | Improvement |
96
- |--------|--------------------------|------------|-------------|
97
- | **Average Accept Length** | - | 3.1 | - |
98
- | **Throughput** | ~50 tokens/s | ~150 tokens/s | ~3x |
99
- | **Training Steps** | - | 10,000 (80k requests) | - |
100
 
101
- **Baseline**: Target model without speculative decoding (no draft model).
102
 
103
- The model achieves consistent **3.1x average speculative accept length**, meaning on average 3.1 draft tokens are accepted per verification step, significantly reducing the number of target model forward passes required.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  **Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.
106
 
 
90
 
91
  ## Benchmarks
92
 
93
+ ### End-to-End Throughput Performance
94
 
95
+ Measured on a holdout dataset from the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onlinesd/viewer/code) using the final Aurora checkpoint.
 
 
 
 
96
 
97
+ #### Configuration Explanation
98
 
99
+ The **Config** column denotes the speculative decoding hyperparameters in a compact notation:
100
+ - **Format**: `spec_steps | eagle_topk | num_draft_tokens`
101
+ - **516**: 5 speculative steps | top-1 selection | 6 draft tokens
102
+
103
+ This configuration controls the trade-off between draft quality (more steps = better quality) and verification overhead (more tokens = more computation).
104
+
105
+ | Batch | Config | Mean TPS | Median TPS | P05 TPS | P95 TPS | Speedup | Acc Len |
106
+ |-------|--------|----------|------------|---------|---------|---------|---------|
107
+ | **1** | w/o spec | 176.4 | 178.0 | 172.3 | 178.4 | -- | -- |
108
+ | | 314 | 252.1 | 254.8 | 208.8 | 291.6 | 1.43× | 2.67 |
109
+ | | 415 | 263.1 | 264.0 | 211.8 | 312.7 | 1.49× | 2.91 |
110
+ | | **516** | **265.7** | **264.8** | **208.7** | **320.5** | **1.51×** | **3.06** |
111
+ | **8** | w/o spec | 119.8 | 121.5 | 104.8 | 134.6 | -- | -- |
112
+ | | 314 | 141.0 | 138.9 | 110.4 | 178.5 | 1.18× | 2.67 |
113
+ | | 415 | 142.5 | 141.2 | 110.3 | 181.6 | 1.19× | 2.91 |
114
+ | | **516** | **146.3** | **143.5** | **109.6** | **189.5** | **1.23×** | **3.07** |
115
+ | **16** | w/o spec | 99.6 | 102.1 | 74.5 | 119.2 | -- | -- |
116
+ | | 314 | 104.0 | 100.5 | 75.6 | 151.9 | 1.04× | 2.67 |
117
+ | | 415 | 105.6 | 101.1 | 77.5 | 149.7 | 1.06× | 2.92 |
118
+ | | **516** | **107.6** | **103.7** | **75.7** | **156.6** | **1.09×** | **3.06** |
119
+ | **32** | w/o spec | 85.0 | 88.7 | 54.5 | 104.5 | -- | -- |
120
+ | | 314 | 78.9 | 72.8 | 53.0 | 122.3 | 0.93× | 2.68 |
121
+ | | 415 | 79.5 | 73.7 | 52.9 | 124.7 | 0.94× | 2.91 |
122
+ | | 516 | 80.3 | 72.6 | 52.8 | 130.7 | 0.94× | 3.06 |
123
+
124
+ ### Performance Across Different Batch Sizes
125
+
126
+ Aurora provides the **largest gains at small-to-moderate batch sizes**, with up to **1.51× speedup at batch size 1**, demonstrating the effectiveness of speculative decoding for latency-critical scenarios. The benefits diminish as batch size increases:
127
+
128
+ - **Batch Size 1** (Best Case): Up to **1.51× speedup** with 516 configuration (3.06 average accept length). At low batch sizes, the cost of draft generation and verification is well amortized by reduced target model forward passes.
129
+
130
+ - **Batch Size 8** (Moderate): **1.23× speedup** with 516 configuration (3.07 average accept length). Speculative decoding still provides meaningful throughput improvements for moderate batching.
131
+
132
+ - **Batch Size 16** (Diminishing Returns): **1.09× speedup** with 516 configuration (3.06 average accept length). Benefits become marginal as verification overhead increases relative to baseline throughput.
133
+
134
+ - **Batch Size 32** (Negative Returns): At large batch sizes, **verification overhead dominates** and speculative decoding becomes slightly slower than the baseline (0.93-0.94×). The target model's batch processing efficiency outweighs the benefits of skipping forward passes.
135
+
136
+ **Metrics Explained**:
137
+ - **TPS**: Tokens per second (throughput)
138
+ - **Acc Len**: Average accept length (number of draft tokens accepted per verification step)
139
+ - **Speedup**: Relative to the no-speculation baseline
140
+ - **P05/P95**: 5th and 95th percentile throughput values
141
+
142
+ ### Future Work
143
+
144
+ While Aurora demonstrates strong performance at small-to-moderate batch sizes, there is room for improvement in high-batch scenarios. Future work includes making speculative decoding faster for hybrid models, potentially by using **multistep kernels** that can more efficiently handle the verification overhead in batched settings.
145
 
146
  **Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.
147