finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on Feb 17

Commit

7c8644c

1 Parent(s): f838d6f

added reference to diffusion model in next steps

Files changed (1) hide show

app/src/content/chapters/conclusions.mdx CHANGED Viewed

@@ -4,7 +4,7 @@ TODO: Table with answers to the questions (ablation sections)
 ### Next Steps
-The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2–10x inference speedups over autoregressive approaches, but they are still experimental and lack native support in production serving frameworks like vLLM and SGLang. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
 While we answered several questions about best practices for synthetic data generation in this work, many remain open:

 ### Next Steps
+The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2–10x inference speedups over autoregressive approaches. Recent models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
 While we answered several questions about best practices for synthetic data generation in this work, many remain open: