Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -73,6 +73,7 @@ This model was trained **from scratch** using Aurora, an inference-time training
|
|
| 73 |
|
| 74 |
### Training Configuration
|
| 75 |
|
|
|
|
| 76 |
- **Training Steps**: 10,000 steps over 80,000 inference requests
|
| 77 |
- **Learning Rate**: 1e-4
|
| 78 |
- **TTT Length**: 5 tokens
|
|
@@ -91,12 +92,14 @@ Trained on the [OnlineSD Code Dataset](https://huggingface.co/datasets/zelc/onli
|
|
| 91 |
|
| 92 |
### Speculative Decoding Performance
|
| 93 |
|
| 94 |
-
| Metric | Baseline | This Model | Improvement |
|
| 95 |
-
|--------|----------|------------|-------------|
|
| 96 |
| **Average Accept Length** | - | 3.1 | - |
|
| 97 |
| **Throughput** | ~50 tokens/s | ~150 tokens/s | ~3x |
|
| 98 |
| **Training Steps** | - | 10,000 (80k requests) | - |
|
| 99 |
|
|
|
|
|
|
|
| 100 |
The model achieves consistent **3.1x average speculative accept length**, meaning on average 3.1 draft tokens are accepted per verification step, significantly reducing the number of target model forward passes required.
|
| 101 |
|
| 102 |
**Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.
|
|
@@ -107,14 +110,16 @@ This model is designed to be used as a draft model in EAGLE3 speculative decodin
|
|
| 107 |
|
| 108 |
### Example with SGLang
|
| 109 |
|
|
|
|
|
|
|
| 110 |
```python
|
| 111 |
from sglang import Engine
|
| 112 |
|
| 113 |
# Initialize engine with speculative decoding
|
| 114 |
engine = Engine(
|
| 115 |
model_path="Qwen/Qwen3-Coder-Next-FP8",
|
| 116 |
-
|
| 117 |
-
speculative_algorithm="
|
| 118 |
speculative_num_steps=5,
|
| 119 |
)
|
| 120 |
|
|
@@ -134,17 +139,14 @@ output = engine.generate(
|
|
| 134 |
|
| 135 |
## Citation
|
| 136 |
|
| 137 |
-
**Note: This is a placeholder citation. The paper is currently under review.**
|
| 138 |
-
|
| 139 |
If you use this model, please cite:
|
| 140 |
|
| 141 |
```bibtex
|
| 142 |
@article{aurora2026,
|
| 143 |
title={When RL Meets Adaptive Speculative Training: A Unified Training-Serving System},
|
| 144 |
author={Wang, Junxiong and Bie, Fengxiang and Li, Jisen and Zhou, Zhongzhu and Shao, Zelei and Wang, Yubo and Liu, Yinghui and Wu, Qingyang and May, Avner and Zhang, Yineng and Song, Shuaiwen Leon and Zhang, Ce and Athiwaratkun, Ben and Xu, Chenfeng and Wu, Xiaoxia},
|
| 145 |
-
journal={
|
| 146 |
year={2026},
|
| 147 |
-
note={Under review},
|
| 148 |
url={https://aurora-spec.github.io}
|
| 149 |
}
|
| 150 |
```
|
|
|
|
| 73 |
|
| 74 |
### Training Configuration
|
| 75 |
|
| 76 |
+
- **Hardware**: NVIDIA H200 GPU
|
| 77 |
- **Training Steps**: 10,000 steps over 80,000 inference requests
|
| 78 |
- **Learning Rate**: 1e-4
|
| 79 |
- **TTT Length**: 5 tokens
|
|
|
|
| 92 |
|
| 93 |
### Speculative Decoding Performance
|
| 94 |
|
| 95 |
+
| Metric | Baseline (No Speculator) | This Model | Improvement |
|
| 96 |
+
|--------|--------------------------|------------|-------------|
|
| 97 |
| **Average Accept Length** | - | 3.1 | - |
|
| 98 |
| **Throughput** | ~50 tokens/s | ~150 tokens/s | ~3x |
|
| 99 |
| **Training Steps** | - | 10,000 (80k requests) | - |
|
| 100 |
|
| 101 |
+
**Baseline**: Target model without speculative decoding (no draft model).
|
| 102 |
+
|
| 103 |
The model achieves consistent **3.1x average speculative accept length**, meaning on average 3.1 draft tokens are accepted per verification step, significantly reducing the number of target model forward passes required.
|
| 104 |
|
| 105 |
**Notably**, this performance is achieved with a model trained **from scratch** - it learns entirely through Aurora's online training process, demonstrating the effectiveness of inference-time training without expensive pre-training.
|
|
|
|
| 110 |
|
| 111 |
### Example with SGLang
|
| 112 |
|
| 113 |
+
**Note: This is a placeholder example. TODO - Verify and update with tested code.**
|
| 114 |
+
|
| 115 |
```python
|
| 116 |
from sglang import Engine
|
| 117 |
|
| 118 |
# Initialize engine with speculative decoding
|
| 119 |
engine = Engine(
|
| 120 |
model_path="Qwen/Qwen3-Coder-Next-FP8",
|
| 121 |
+
speculative_draft_model_path="togethercomputer/Qwen3-Coder-Next-FP8-EAGLE3",
|
| 122 |
+
speculative_algorithm="EAGLE",
|
| 123 |
speculative_num_steps=5,
|
| 124 |
)
|
| 125 |
|
|
|
|
| 139 |
|
| 140 |
## Citation
|
| 141 |
|
|
|
|
|
|
|
| 142 |
If you use this model, please cite:
|
| 143 |
|
| 144 |
```bibtex
|
| 145 |
@article{aurora2026,
|
| 146 |
title={When RL Meets Adaptive Speculative Training: A Unified Training-Serving System},
|
| 147 |
author={Wang, Junxiong and Bie, Fengxiang and Li, Jisen and Zhou, Zhongzhu and Shao, Zelei and Wang, Yubo and Liu, Yinghui and Wu, Qingyang and May, Avner and Zhang, Yineng and Song, Shuaiwen Leon and Zhang, Ce and Athiwaratkun, Ben and Xu, Chenfeng and Wu, Xiaoxia},
|
| 148 |
+
journal={Preprint, forthcoming},
|
| 149 |
year={2026},
|
|
|
|
| 150 |
url={https://aurora-spec.github.io}
|
| 151 |
}
|
| 152 |
```
|