Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
7c8644c
1
Parent(s): f838d6f
added reference to diffusion model in next steps
Browse files
app/src/content/chapters/conclusions.mdx
CHANGED
|
@@ -4,7 +4,7 @@ TODO: Table with answers to the questions (ablation sections)
|
|
| 4 |
|
| 5 |
### Next Steps
|
| 6 |
|
| 7 |
-
The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2–10x inference speedups over autoregressive approaches
|
| 8 |
|
| 9 |
While we answered several questions about best practices for synthetic data generation in this work, many remain open:
|
| 10 |
|
|
|
|
| 4 |
|
| 5 |
### Next Steps
|
| 6 |
|
| 7 |
+
The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2–10x inference speedups over autoregressive approaches. Recent models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
|
| 8 |
|
| 9 |
While we answered several questions about best practices for synthetic data generation in this work, many remain open:
|
| 10 |
|