finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 25

Commit

62145f9

1 Parent(s): 4fad4ec

add mercury to conclusions

Browse files

Files changed (3) hide show

app/src/content/bibliography.bib +8 -0
app/src/content/chapters/5-infrastructure.mdx +6 -0
app/src/content/chapters/6-conclusions.mdx +1 -2

app/src/content/bibliography.bib CHANGED Viewed

@@ -370,6 +370,14 @@
   url           = {https://arxiv.org/abs/2602.06036}
 }
 % Training
 @inproceedings{adamw,
   title         = {Decoupled Weight Decay Regularization},

   url           = {https://arxiv.org/abs/2602.06036}
 }
+@misc{mercury2,
+  title         = {Introducing Mercury 2},
+  author        = {Inception Labs},
+  year          = {2026},
+  note          = {Blog post},
+  url           = {https://www.inceptionlabs.ai/blog/introducing-mercury-2}
+}
 % Training
 @inproceedings{adamw,
   title         = {Decoupled Weight Decay Regularization},

app/src/content/chapters/5-infrastructure.mdx CHANGED Viewed

@@ -447,6 +447,12 @@ With a trillion-parameter model you won't be generating billions of tokens per h
 #### Visualizing Throughput
 To get an intuition for what these throughput numbers feel like, <FigRef target="inference-throughput" /> lets you pick a model and scale up the number of GPUs. Each page represents roughly 500 tokens of generated text. At high enough throughput, pages roll up into books (250 pages each), and books into bookshelves (250 books each).
 <Wide>

 #### Visualizing Throughput
+{/*
+Further improvement ideas:
+- add a second model below so we can compare. Suggest something cool for the numbers below.
+- Also add some animations (page turning, flapping books,  bookshelfes books coming in and out)
+*/}
 To get an intuition for what these throughput numbers feel like, <FigRef target="inference-throughput" /> lets you pick a model and scale up the number of GPUs. Each page represents roughly 500 tokens of generated text. At high enough throughput, pages roll up into books (250 pages each), and books into bookshelves (250 books each).
 <Wide>

app/src/content/chapters/6-conclusions.mdx CHANGED Viewed

@@ -4,7 +4,7 @@ We ran 65 experiments, generated over 750 billion tokens, and spent more than 74
 ### Next Steps
-The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2–10x inference speedups over autoregressive approaches. Recent models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
 While we answered several questions about best practices for synthetic data generation in this work, many remain open:
@@ -16,4 +16,3 @@ While we answered several questions about best practices for synthetic data gene
 - **Scaling to larger models**: REWIRE [@rewire] reports larger gains for bigger models trained on their data. Can we reproduce this?
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Scaling to more data**: Our ablations trained for 21B tokens. It remains unclear how these findings transfer to larger scales, both in model parameters and data.

 ### Next Steps
+The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2–10x inference speedups over autoregressive approaches. Recent models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. [Mercury 2](https://www.inceptionlabs.ai/blog/introducing-mercury-2) [@mercury2] pushes this further, reaching over 1,000 tokens per second on NVIDIA Blackwell GPUs through parallel refinement rather than sequential decoding, with 5x+ speedups over autoregressive baselines. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
 While we answered several questions about best practices for synthetic data generation in this work, many remain open:
 - **Scaling to larger models**: REWIRE [@rewire] reports larger gains for bigger models trained on their data. Can we reproduce this?
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Scaling to more data**: Our ablations trained for 21B tokens. It remains unclear how these findings transfer to larger scales, both in model parameters and data.