joelniklaus HF Staff commited on
Commit
77b87fe
·
1 Parent(s): e46c12a

made the writing style more engaging and improved transitions based on Lewis' feedback

Browse files
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -33,20 +33,16 @@ During SmolLM2 [@smollm2] training, the model was decent at coding and math but
33
 
34
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
35
 
36
- In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1.1 trillion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
37
 
38
  Here's the plan:
39
  <Sidenote>
40
  The sections are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
41
  </Sidenote>
42
 
43
- We start with the [Infrastructure](#infrastructure) needed for synthetic data generation at scale. This includes some extensions we made to the datatrove library and crucially detailed throughput benchmarking of popular models you might want to use for synthetic data generation. This is super important to get the most data for your bucks.
44
 
45
- We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
46
-
47
- Finally we present the suite of 90 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
48
-
49
- Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
50
 
51
  <HtmlEmbed
52
  id="finephrase-vs-baselines"
 
33
 
34
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
35
 
36
+ In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1 trillion tokens and spent {'>'}111,000 GPU hours (~12.7 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
37
 
38
  Here's the plan:
39
  <Sidenote>
40
  The sections are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
41
  </Sidenote>
42
 
43
+ We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
44
 
45
+ Here's a preview of where we end up: FinePhrase clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains the journey to get there.
 
 
 
 
46
 
47
  <HtmlEmbed
48
  id="finephrase-vs-baselines"
app/src/content/chapters/2-setup.mdx CHANGED
@@ -6,11 +6,11 @@ import ReadingTime from "../../components/ReadingTime.astro";
6
 
7
  <ReadingTime words={1243} visuals={0} />
8
 
9
- Recent work like WRAP [@wrap], Nemotron-CC [@nemotroncc], REWIRE [@rewire], and BeyondWeb [@beyondweb] has shown that rephrasing web content into higher-quality formats can outperform training on raw data alone. But the field still lacks a clear framework for what "rephrasing" actually means and a systematic investigation of which factors make it work. That's what we discuss in this section.
10
 
11
  ### What is Rephrasing?
12
 
13
- At its core, **rephrasing** means running existing documents through a language model to produce variants that keep the meaning but change the presentation. That sounds simple, but the design space is huge. A document could be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different capabilities: tutorials may help step-by-step reasoning, FAQs might boost question-answering, and math reformulations could strengthen quantitative skills. Which transformations actually work, and when? That's what we set out to answer.
14
 
15
  ### Three Axes of Synthetic Data
16
 
@@ -22,24 +22,24 @@ We think about synthetic data generation along three axes:
22
 
23
  Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best rephrasing strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? When does aggressive transformation help versus hurt?
24
 
25
- ### Research Questions
26
 
27
- FinePhrase tackles these questions through systematic experimentation across all three axes:
28
 
29
  1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance.
30
  2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
31
  3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
32
  4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
33
 
34
- ### Rephrasing Setup
35
 
36
- We rephrase documents using instruction-tuned models ranging from 270M to 27B parameters (primarily Gemma-3 [@gemma3] variants) on filtered web corpora including FineWeb-Edu [@fineweb] and DCLM [@datacomp], processing roughly 20B input tokens per quality tier. Our pipeline runs documents through customizable prompt templates that transform raw web text into structured formats (articles, tutorials, FAQs, discussions, commentaries) as well as distillation and continuation tasks inspired by prior work, producing between ~XB and XXB output tokens depending on the strategy.
37
 
38
  For inference we use vLLM [@vllm] with tensor parallelism, chunked prefill, and speculative decoding [@speculativedecoding] (n-gram prompt lookup with ~7 draft tokens, acceptance rates around 0.7). Every rephrased document gets scored by both the FineWeb-Edu classifier and the DCLM quality scorer, and we track token counts, quality score deltas, and metadata including thinking traces when available. The whole thing runs distributed across 100 parallel tasks on a SLURM cluster with checkpointing, targeting 10B tokens of synthetic data for downstream ablations.
39
 
40
  ### Source Datasets
41
 
42
- We compare against several baseline datasets for pretraining and data rephrasing. We use "source data" and "seed data" interchangeably throughout.
43
 
44
  <Accordion title="DCLM" open>
45
  A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM (DataComp-LM) enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
@@ -63,9 +63,9 @@ We compare against several baseline datasets for pretraining and data rephrasing
63
  A method for recycling the web with guided rewriting that enriches low-quality documents discarded by filtering pipelines. Mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales across 22 tasks [@rewire].
64
  </Accordion>
65
 
66
- ### Ablation Setup
67
 
68
- Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
69
  <Sidenote>
70
  Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
71
  </Sidenote>
@@ -78,10 +78,12 @@ Since our model is small and trained on only 20B tokens, we use the **cloze form
78
  - **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
79
 
80
 
81
- ### A Note on Model Collapse
82
 
83
- A common misconception is that any use of synthetic data inevitably degrades model performance. This stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
84
 
85
- In practice, nobody trains models this way. Real-world synthetic data pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically for specific purposes rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data.
86
 
87
  The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
 
 
 
6
 
7
  <ReadingTime words={1243} visuals={0} />
8
 
9
+ Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, and BeyondWeb [@beyondweb] tries continuation and summarization. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
10
 
11
  ### What is Rephrasing?
12
 
13
+ **Rephrasing** means running existing documents through a language model to produce variants that keep the meaning but change the presentation. That sounds simple, but the design space is huge. A document could be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different capabilities: tutorials may help step-by-step reasoning, FAQs might boost question-answering, and math reformulations could strengthen quantitative skills. Which transformations actually work, and when? That's what we set out to answer.
14
 
15
  ### Three Axes of Synthetic Data
16
 
 
22
 
23
  Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best rephrasing strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? When does aggressive transformation help versus hurt?
24
 
25
+ ### What We Want to Find Out
26
 
27
+ Here are the concrete questions we're trying to answer:
28
 
29
  1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance.
30
  2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
31
  3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
32
  4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
33
 
34
+ ### How We Run Rephrasing
35
 
36
+ In practice, we rephrase documents using instruction-tuned models ranging from 270M to 27B parameters (primarily Gemma-3 [@gemma3] variants) on filtered web corpora including FineWeb-Edu [@fineweb] and DCLM [@datacomp], processing roughly 20B input tokens per quality tier. Our pipeline runs documents through customizable prompt templates that transform raw web text into structured formats (articles, tutorials, FAQs, discussions, commentaries) as well as distillation and continuation tasks inspired by prior work.
37
 
38
  For inference we use vLLM [@vllm] with tensor parallelism, chunked prefill, and speculative decoding [@speculativedecoding] (n-gram prompt lookup with ~7 draft tokens, acceptance rates around 0.7). Every rephrased document gets scored by both the FineWeb-Edu classifier and the DCLM quality scorer, and we track token counts, quality score deltas, and metadata including thinking traces when available. The whole thing runs distributed across 100 parallel tasks on a SLURM cluster with checkpointing, targeting 10B tokens of synthetic data for downstream ablations.
39
 
40
  ### Source Datasets
41
 
42
+ Before diving into experiments, here's a quick overview of the datasets we compare against. We use "source data" and "seed data" interchangeably throughout.
43
 
44
  <Accordion title="DCLM" open>
45
  A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM (DataComp-LM) enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
 
63
  A method for recycling the web with guided rewriting that enriches low-quality documents discarded by filtering pipelines. Mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales across 22 tasks [@rewire].
64
  </Accordion>
65
 
66
+ ### How We Measure Success
67
 
68
+ To evaluate each configuration, we follow the ablation methodology from FineWeb [@fineweb]: train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) on 20B tokens and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
69
  <Sidenote>
70
  Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
71
  </Sidenote>
 
78
  - **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
79
 
80
 
81
+ ### But Wait, What About Model Collapse?
82
 
83
+ You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
84
 
85
+ In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data.
86
 
87
  The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
88
+
89
+ With all that context out of the way, let's get to the fun part: the experiments.
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -30,7 +30,7 @@ Notes:
30
 
31
  <ReadingTime words={2063} visuals={14} />
32
 
33
- With the infrastructure and setup in place, we now systematically work through our research questions. <FigRef target="experiment-overview" /> shows the full landscape of our experiments as a flow from source datasets through prompt strategies to model families. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
34
 
35
  <HtmlEmbed
36
  id="experiment-overview"
@@ -41,7 +41,7 @@ With the infrastructure and setup in place, we now systematically work through o
41
 
42
  ### How Do Existing Datasets Compare?
43
 
44
- We train on eight datasets under identical conditions and compare their final evaluation performance. DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.
45
 
46
  <HtmlEmbed
47
  id="baselines-comparison"
@@ -62,11 +62,11 @@ We train on eight datasets under identical conditions and compare their final ev
62
  }}
63
  />
64
 
65
- The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
66
 
67
  #### Which Individual Prompts Match DCLM?
68
 
69
- We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source. Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.
70
 
71
  <Sidenote>
72
  The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
@@ -93,11 +93,11 @@ The BeyondWeb dataset was never released and the paper omits key details, yet cl
93
  }}
94
  />
95
 
96
- Can we design prompts that consistently beat DCLM?
97
 
98
  ### Can New Prompts Beat DCLM?
99
 
100
- Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), and [narrative](#narrative) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
101
 
102
  <HtmlEmbed
103
  id="new-prompts"
@@ -119,11 +119,11 @@ Since most existing prompts fail to beat DCLM, we designed nine novel prompt for
119
  }}
120
  />
121
 
122
- We used Gemma-3-1B for all experiments so far. Can we do even better by changing the rephrasing model?
123
 
124
  ### Impact of the Rephrasing Model
125
 
126
- We want to know whether using a stronger model leads to better synthetic data. We look at this dimension from three angles: model size, model family, and model generation.
127
 
128
  #### Does the model size matter?
129
 
@@ -132,7 +132,7 @@ For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1
132
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
133
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
134
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
135
- The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), larger models do not improve synthetic data quality.
136
 
137
  <Sidenote>
138
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
@@ -186,11 +186,11 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
186
  }}
187
  />
188
 
189
- On high-quality source data, we see no evidence that larger models help. But REWIRE claims large models are needed specifically for low-quality data. Does that claim hold?
190
 
191
  #### Do we need better models for rephrasing low-quality data?
192
 
193
- The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
194
 
195
  <HtmlEmbed
196
  id="size-quality"
@@ -238,11 +238,11 @@ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large
238
  }}
239
  />
240
 
241
- Since model size barely matters, does the model family make a difference?
242
 
243
  #### Does the model family matter?
244
 
245
- We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all eight prompts (see <FigRef target="model-family" />).
246
 
247
  <Sidenote>
248
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
@@ -346,11 +346,14 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
346
  }}
347
  />
348
 
349
- SmolLM2 is already a year old. Are newer model generations better?
 
 
 
350
 
351
  #### Does the model generation matter?
352
 
353
- We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt. While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />) especially cumulative from version 1.5 to 3.
354
 
355
  <HtmlEmbed
356
  id="model-generation"
@@ -374,15 +377,15 @@ We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and
374
  **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
375
  </Note>
376
 
377
- We've explored the model dimension thoroughly. Now, what difference do the dataset choices make?
378
 
379
  ### Impact of the Dataset Choices
380
 
381
- So far we've always mixed synthetic data with a <Glossary term="source dataset" definition="The original dataset that gets rephrased by the language model to produce synthetic data." /> and a <Glossary term="mix-in dataset" definition="The non-synthetic dataset mixed with the rephrased data during training. This can be the same as or different from the source dataset." />. But what role do these different datasets play?
382
 
383
  #### Is synthetic data enough?
384
 
385
- We compare synthetic-only training vs mixed training (synthetic + source) for [faq](#faq) and [tutorial](#tutorial) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
386
 
387
  <HtmlEmbed
388
  id="synthetic-only"
@@ -412,11 +415,11 @@ We compare synthetic-only training vs mixed training (synthetic + source) for [f
412
  }}
413
  />
414
 
415
- So synthetic data alone does not seem to be enough. But how much does the specific choice of mix-in dataset affect performance?
416
 
417
  #### Does the mix-in dataset matter?
418
 
419
- We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data. DCLM outperforms other mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
420
 
421
  <HtmlEmbed
422
  id="mixin-dataset"
@@ -452,11 +455,11 @@ We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, th
452
  }}
453
  />
454
 
455
- The mix-in dataset matters enormously. But what about the source dataset we feed to the rephrasing model?
456
 
457
  #### Does the source dataset matter?
458
 
459
- We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
460
 
461
  <HtmlEmbed
462
  id="source-dataset-mixin-source"
@@ -514,11 +517,11 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
514
  }}
515
  />
516
 
517
- This is exciting because it shows the potential of upcycling low-quality data through rephrasing with format prompts. Can we squeeze out more performance by increasing diversity in the synthetic portion?
518
 
519
  #### Does increased diversity help?
520
 
521
- We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies. None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
522
 
523
  <Sidenote>
524
  Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
@@ -574,11 +577,11 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
574
  **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
575
  </Note>
576
 
577
- We've covered prompts, models, and datasets. One last question: how sensitive is all of this to small details in the prompt itself?
578
 
579
  ### Do Typos in the Prompt Hurt?
580
 
581
- We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale. Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />).
582
 
583
  <HtmlEmbed
584
  id="typos-effect"
@@ -597,7 +600,7 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
597
 
598
  ### Takeaways
599
 
600
- Here are the key takeaways from our experiments:
601
 
602
  - **Q: How do existing datasets compare?**<br/>
603
  A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
@@ -624,4 +627,6 @@ Here are the key takeaways from our experiments:
624
  - **Q: Do typos in the prompt hurt?**<br/>
625
  A: No. Typos have no negative effect on downstream performance.
626
 
627
- So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving. A 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
 
 
 
30
 
31
  <ReadingTime words={2063} visuals={14} />
32
 
33
+ Time to put all of this to the test. We ran 90 experiments to systematically answer our questions, and the journey took some unexpected turns. <FigRef target="experiment-overview" /> shows the full landscape: source datasets flowing through prompt strategies to model families. We start by seeing how existing datasets stack up, then dissect what makes their prompts tick. From there we design our own prompts, explore how the rephrasing model affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
34
 
35
  <HtmlEmbed
36
  id="experiment-overview"
 
41
 
42
  ### How Do Existing Datasets Compare?
43
 
44
+ First things first: where does the bar sit? We train on eight datasets under identical conditions and compare their final evaluation performance. DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, SYNTH, and EssentialWeb, fall notably behind. DCLM is the strongest baseline and becomes our target to beat for everything that follows.
45
 
46
  <HtmlEmbed
47
  id="baselines-comparison"
 
62
  }}
63
  />
64
 
65
+ Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?
66
 
67
  #### Which Individual Prompts Match DCLM?
68
 
69
+ We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source. Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline.
70
 
71
  <Sidenote>
72
  The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
 
93
  }}
94
  />
95
 
96
+ That's a pretty underwhelming hit rate. Can we do better with our own prompts?
97
 
98
  ### Can New Prompts Beat DCLM?
99
 
100
+ Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level (see <FigRef target="new-prompts" />). The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
101
 
102
  <HtmlEmbed
103
  id="new-prompts"
 
119
  }}
120
  />
121
 
122
+ So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
123
 
124
  ### Impact of the Rephrasing Model
125
 
126
+ We look at this from three angles: model size, model family, and model generation.
127
 
128
  #### Does the model size matter?
129
 
 
132
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
133
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
134
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
135
+ The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
136
 
137
  <Sidenote>
138
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
 
186
  }}
187
  />
188
 
189
+ That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
190
 
191
  #### Do we need better models for rephrasing low-quality data?
192
 
193
+ REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
194
 
195
  <HtmlEmbed
196
  id="size-quality"
 
238
  }}
239
  />
240
 
241
+ So model size doesn't matter much. But what if you're using the wrong model family entirely?
242
 
243
  #### Does the model family matter?
244
 
245
+ We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts. The result here is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt (see <FigRef target="model-family" />).
246
 
247
  <Sidenote>
248
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
 
346
  }}
347
  />
348
 
349
+ SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
350
+ <Sidenote>
351
+ [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
352
+ </Sidenote>
353
 
354
  #### Does the model generation matter?
355
 
356
+ We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt. The differences are small, but there is a consistent upward trend: newer versions lead to slightly higher evaluation performance (see <FigRef target="model-generation" />), especially cumulative from version 1.5 to 3.
357
 
358
  <HtmlEmbed
359
  id="model-generation"
 
377
  **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
378
  </Note>
379
 
380
+ We've thoroughly explored the model dimension. The next obvious question: how much do the dataset choices matter?
381
 
382
  ### Impact of the Dataset Choices
383
 
384
+ So far we've always mixed synthetic data with a <Glossary term="source dataset" definition="The original dataset that gets rephrased by the language model to produce synthetic data." /> and a <Glossary term="mix-in dataset" definition="The non-synthetic dataset mixed with the rephrased data during training. This can be the same as or different from the source dataset." />. But do we even need the original data? And if so, which dataset should we mix in?
385
 
386
  #### Is synthetic data enough?
387
 
388
+ The dream scenario would be generating all your training data synthetically, no curation needed. We test this by comparing synthetic-only training vs mixed training (synthetic + source) for [faq](#faq) and [tutorial](#tutorial) prompts on DCLM and FineWeb-Edu-HQ sources. Unfortunately, synthetic-only training falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixing consistently improves over both the synthetic-only and original-data-only baselines.
389
 
390
  <HtmlEmbed
391
  id="synthetic-only"
 
415
  }}
416
  />
417
 
418
+ OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
419
 
420
  #### Does the mix-in dataset matter?
421
 
422
+ We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data. DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
423
 
424
  <HtmlEmbed
425
  id="mixin-dataset"
 
455
  }}
456
  />
457
 
458
+ If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
459
 
460
  #### Does the source dataset matter?
461
 
462
+ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). This is exciting: it means you can rephrase even low-quality data and still get competitive results, as long as you pair it with a strong mix-in dataset.
463
 
464
  <HtmlEmbed
465
  id="source-dataset-mixin-source"
 
517
  }}
518
  />
519
 
520
+ That opens up a much larger pool of source data to draw from. But can we squeeze out even more performance by increasing diversity in the synthetic portion?
521
 
522
  #### Does increased diversity help?
523
 
524
+ We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies. None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds (see <FigRef target="diversity" />). This was a bit disappointing. That said, our ablations train on only 20B tokens, so diversity benefits may emerge at larger scales where the model can better exploit the varied signal.
525
 
526
  <Sidenote>
527
  Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
 
577
  **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
578
  </Note>
579
 
580
+ We've covered prompts, models, and datasets. One last fun question: how sensitive is all of this to tiny details in the prompt itself?
581
 
582
  ### Do Typos in the Prompt Hurt?
583
 
584
+ While implementing the REWIRE prompt, we noticed it contained several typos and grammatical errors. So we cleaned it up and ran both versions. The result? Typos don't hurt at all. For the 1B model, the typo-laden [original](#guided_rewrite_original) actually performs slightly better than the [improved version](#guided_rewrite_improved) (see <FigRef target="typos-effect" />). So much for prompt polish.
585
 
586
  <HtmlEmbed
587
  id="typos-effect"
 
600
 
601
  ### Takeaways
602
 
603
+ Let's step back and summarize what we learned:
604
 
605
  - **Q: How do existing datasets compare?**<br/>
606
  A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
 
627
  - **Q: Do typos in the prompt hurt?**<br/>
628
  A: No. Typos have no negative effect on downstream performance.
629
 
630
+ So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving: a 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
631
+
632
+ Now let's look more closely at *why* these things work the way they do.
app/src/content/chapters/4-analyses.mdx CHANGED
@@ -8,15 +8,15 @@ import ReadingTime from "../../components/ReadingTime.astro";
8
 
9
  <ReadingTime words={1433} visuals={6} />
10
 
11
- The experiments above tell us *what* works. Now we zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
12
 
13
  ### Is More Compute Worth It?
14
 
15
- GPU time across our 90 experiments varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
16
 
17
- **The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
18
 
19
- **For practitioners, the message is clear: invest in prompt design, not model size.** A well-chosen prompt on a 1B model will outperform a generic prompt on a 27B model at a tiny fraction of the cost. The only scenario where larger models might be justified is for complex prompts (like Guided Rewrite) that require more capable instruction following, but even there the gains are marginal.
20
 
21
  <Wide>
22
  <HtmlEmbed
@@ -27,11 +27,11 @@ GPU time across our 90 experiments varies by two orders of magnitude: the cheape
27
  />
28
  </Wide>
29
 
30
- The cheapest configurations still take over a week of GPU time, and we only know which ones work *after* rephrasing 10B tokens and then training the model. A cheap proxy metric that predicts downstream performance would let us fail fast and iterate on prompts without running the full pipeline each time. Can existing quality scores fill that role?
31
 
32
  ### Can Quality Scores Predict Performance?
33
 
34
- The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
35
 
36
  [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
37
 
@@ -43,7 +43,7 @@ The FineWeb-Edu-score and DCLM-score are effective quality filters for human-wri
43
  **The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
44
  */}
45
 
46
- **Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **For synthetic data, there is no shortcut: you have to train models and evaluate them.**
47
 
48
  {/*
49
  Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
@@ -75,7 +75,7 @@ The correlation matrix tells us that quality scores are weak predictors, but not
75
  />
76
  </Wide>
77
 
78
- If quality scores designed for filtering web data can't predict synthetic data performance, maybe looking at the outputs more directly can. Does the verbosity of the rephrasing model predict downstream performance?
79
 
80
  ### Do Chatty Models Make Better Data?
81
 
@@ -103,11 +103,11 @@ But does this variation actually affect downstream performance? Our prompts prod
103
  />
104
  </Wide>
105
 
106
- So output length doesn't predict quality. But output *diversity* might. We found a surprising case where a model that follows instructions poorly actually produces better training data.
107
 
108
  ### Math Rephrasing: When "Worse" Outputs Win
109
 
110
- We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
111
 
112
  **Qwen3 produced beautiful, structured outputs:**
113
 
@@ -156,7 +156,7 @@ SmolLM2's quality distribution was actually reasonable:
156
  | Partial | 30+ tokens but missing structure | 25% |
157
  | Poor | {'<'}30 tokens | 8% |
158
 
159
- For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
160
 
161
  <Note title="Summary: Analyses" variant="info">
162
  **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
@@ -164,3 +164,5 @@ For pretraining data, diversity beats consistency. Models that don't follow inst
164
  **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
165
  **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
166
  </Note>
 
 
 
8
 
9
  <ReadingTime words={1433} visuals={6} />
10
 
11
+ The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
12
 
13
  ### Is More Compute Worth It?
14
 
15
+ Running 90 experiments is not cheap. GPU time varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
16
 
17
+ **The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time *decreasing* performance.
18
 
19
+ **The message is clear: invest in prompt design, not model size.** A well-chosen prompt on a 1B model will outperform a generic prompt on a 27B model at a tiny fraction of the cost. The only scenario where larger models might be justified is for complex prompts (like Guided Rewrite) that require more capable instruction following, but even there the gains are marginal.
20
 
21
  <Wide>
22
  <HtmlEmbed
 
27
  />
28
  </Wide>
29
 
30
+ Even the cheapest configurations still take over a week of GPU time, and we only know which ones work *after* rephrasing 10B tokens and then training a model. Wouldn't it be nice if we could just score the rephrased outputs directly and skip the expensive train-then-evaluate loop?
31
 
32
  ### Can Quality Scores Predict Performance?
33
 
34
+ FineWeb-Edu-score and DCLM-score are great quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and iterate on prompts without running the full pipeline each time. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
35
 
36
  [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
37
 
 
43
  **The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
44
  */}
45
 
46
+ **Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **The bottom line: for synthetic data, there is no shortcut. You have to train models and evaluate them.**
47
 
48
  {/*
49
  Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
 
75
  />
76
  </Wide>
77
 
78
+ So quality scores designed for filtering web data don't transfer to synthetic data. Maybe looking at the outputs more directly helps. For instance, does the length of the rephrased output tell us anything?
79
 
80
  ### Do Chatty Models Make Better Data?
81
 
 
103
  />
104
  </Wide>
105
 
106
+ So output length doesn't predict quality either. But we stumbled onto something more interesting while looking at output *diversity*: a case where a model that follows instructions poorly actually produces better training data.
107
 
108
  ### Math Rephrasing: When "Worse" Outputs Win
109
 
110
+ This was one of our most surprising findings. We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
111
 
112
  **Qwen3 produced beautiful, structured outputs:**
113
 
 
156
  | Partial | 30+ tokens but missing structure | 25% |
157
  | Poor | {'<'}30 tokens | 8% |
158
 
159
+ The lesson: for pretraining data, diversity beats consistency. A model that doesn't follow instructions perfectly can actually produce better training data than one that does. This also helps explain why SmolLM2 dominates the model family comparison: it produces more varied outputs, which may matter more than precise instruction following.
160
 
161
  <Note title="Summary: Analyses" variant="info">
162
  **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
 
164
  **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
165
  **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
166
  </Note>
167
+
168
+ With the experiments and analyses behind us, let's talk about the infrastructure that made all of this possible.
app/src/content/chapters/5-infrastructure.mdx CHANGED
@@ -9,13 +9,13 @@ import ReadingTime from "../../components/ReadingTime.astro";
9
 
10
  <ReadingTime words={4780} visuals={9} />
11
 
12
- Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment can take weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is months of GPU time.
13
 
14
- Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
15
 
16
  We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
17
 
18
- <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
19
 
20
  <HtmlEmbed
21
  id="datatrove-pipeline"
@@ -202,9 +202,9 @@ Need multiple samples per document? Set `rollouts_per_document` in your `Inferen
202
 
203
  ### Throughput Benchmarking
204
 
205
- For synthetic data generation, we may run language model inference for millions of GPU hours. Finding a configuration that maximizes throughput is critical: it can accelerate generation by days and save thousands of dollars. In this section, we describe our experiments to identify optimal parameters for a selection of popular models.
206
 
207
- The entire benchmarking code (experiment launcher, analysis scripts, and sample configs) is available as a [DataTrove inference benchmark example](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark).
208
 
209
  #### Benchmarking setup
210
 
@@ -260,7 +260,7 @@ Failure modes are automatically classified:
260
  - **timeout**: SLURM time limit exceeded (configuration too slow)
261
  - **server_fail**: vLLM server failed to start (e.g., engine core initialization failure, insufficient GPU memory for the model at the given tp)
262
 
263
- #### Scale of the experiment
264
 
265
  The benchmark config defines **801 unique configurations** across 8 experiment groups (18 models with ~23 configurations each via the tiered approach):
266
 
@@ -287,7 +287,7 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
287
 
288
  #### What these numbers mean in practice
289
 
290
- Let's make this concrete. Each of our ablation experiments rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token experiment takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 experiments, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
291
 
292
  These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
293
 
@@ -314,7 +314,7 @@ We also experimented with non-standard block sizes (not 16), fp8 kv-cache quanti
314
  </Sidenote>
315
 
316
 
317
- Before we look into the analysis in more detail, here is some background on memory-bound vs compute-bound inference and speculative decoding.
318
 
319
  <Accordion title="Background: Memory-bound vs compute-bound inference">
320
 
@@ -397,7 +397,7 @@ The fundamental insight is that **optimization gains depend on identifying the b
397
 
398
  #### Scaling to larger models
399
 
400
- The benchmarks above focus on maximizing tokens per second per GPU, which is exactly what you want when generating trillions of tokens for pretraining data. But for post-training, the picture looks different: you probably want bigger models to generate data for hard problems (reasoning, math, code), and you care less about the total number of tokens generated. Quality per token matters more than volume.
401
 
402
  For these use cases, DataTrove scales to models with hundreds of billions (or even a trillion) parameters via multi-node Slurm execution. Here's an example running [Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) [@kimik2] (1T total parameters, 32B active) on the [s1K dataset](https://huggingface.co/datasets/simplescaling/s1K-1.1) [@s1k] to generate solutions to math and reasoning problems:
403
 
@@ -432,7 +432,9 @@ Further improvement ideas:
432
  - Clean it up a bit to make it less cluttered
433
  */}
434
 
435
- To get an intuition for what these throughput numbers feel like, <FigRef target="inference-throughput" /> lets you pick a model and scale up the number of GPUs. Each page represents roughly 500 tokens of generated text. At high enough throughput, pages roll up into books (250 pages each), and books into bookshelves (250 books each).
 
 
436
 
437
  <Wide>
438
  <HtmlEmbed
 
9
 
10
  <ReadingTime words={4780} visuals={9} />
11
 
12
+ Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment takes weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is literally months of GPU time.
13
 
14
+ Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the raw generation speed is no longer the bottleneck. The hard part is the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
15
 
16
  We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
17
 
18
+ <FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's walk through it.
19
 
20
  <HtmlEmbed
21
  id="datatrove-pipeline"
 
202
 
203
  ### Throughput Benchmarking
204
 
205
+ With the pipeline in place, we turned to a question that can save (or waste) enormous amounts of money: how do you squeeze the most tokens per second out of each model? At the scale we're operating, even a 20% throughput improvement saves days of GPU time per experiment.
206
 
207
+ We ran a systematic benchmarking sweep across 18 models and open-sourced the entire setup (experiment launcher, analysis scripts, and sample configs) as a [DataTrove inference benchmark example](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark).
208
 
209
  #### Benchmarking setup
210
 
 
260
  - **timeout**: SLURM time limit exceeded (configuration too slow)
261
  - **server_fail**: vLLM server failed to start (e.g., engine core initialization failure, insufficient GPU memory for the model at the given tp)
262
 
263
+ #### Scale of the sweep
264
 
265
  The benchmark config defines **801 unique configurations** across 8 experiment groups (18 models with ~23 configurations each via the tiered approach):
266
 
 
287
 
288
  #### What these numbers mean in practice
289
 
290
+ Let's make this concrete with some back-of-the-envelope math. Each of our ablation experiments rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token experiment takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**. That's a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 experiments, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
291
 
292
  These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
293
 
 
314
  </Sidenote>
315
 
316
 
317
+ To understand why some models benefit more than others, let's briefly review the concepts of memory-bound vs compute-bound inference and speculative decoding.
318
 
319
  <Accordion title="Background: Memory-bound vs compute-bound inference">
320
 
 
397
 
398
  #### Scaling to larger models
399
 
400
+ Everything above focuses on maximizing tokens per second per GPU, which is exactly what you want when generating trillions of tokens for pretraining data. But for post-training, the picture is different: you probably want bigger models to generate data for hard problems (reasoning, math, code), and you care less about total volume. Quality per token matters more than throughput.
401
 
402
  For these use cases, DataTrove scales to models with hundreds of billions (or even a trillion) parameters via multi-node Slurm execution. Here's an example running [Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) [@kimik2] (1T total parameters, 32B active) on the [s1K dataset](https://huggingface.co/datasets/simplescaling/s1K-1.1) [@s1k] to generate solutions to math and reasoning problems:
403
 
 
432
  - Clean it up a bit to make it less cluttered
433
  */}
434
 
435
+ To get a feel for what these throughput numbers actually mean, <FigRef target="inference-throughput" /> lets you pick a model and scale up the number of GPUs. Each page represents roughly 500 tokens of generated text. At high enough throughput, pages roll up into books (250 pages each), and books into bookshelves (250 books each).
436
+
437
+ With all these infrastructure pieces in place, we have everything we need to build FinePhrase: the right prompts, the right model, and the machinery to run it all at scale.
438
 
439
  <Wide>
440
  <HtmlEmbed
app/src/content/chapters/6-finephrase.mdx CHANGED
@@ -11,15 +11,15 @@ import finephraseProgressImg from "../assets/image/finephrase-progress.png";
11
 
12
  <ReadingTime words={1693} visuals={10} />
13
 
14
- We ran 90 experiments to figure out what works. Now we apply those findings to build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases all XXX million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing XXX billion tokens of synthetic pretraining data.
15
 
16
- The recipe is simple: take the best model (SmolLM2-1.7B-Instruct), the best prompts (FAQ, Math, Table, Tutorial), the optimized inference settings from our throughput benchmarks, and the battle-tested DataTrove infrastructure. Launch 100 parallel Slurm workers, each running on a single H100 GPU with suffix-32 speculative decoding. Let it run for about two weeks.
17
 
18
  To get a sense of the scale: our infrastructure benchmarks showed that SmolLM2-1.7B-Instruct achieves ~9,200 tokens per second per GPU with suffix-32 speculative decoding. With 100 GPUs running in parallel, that is ~920,000 tokens per second, or about 3.3 billion tokens per hour. Rephrasing ~339 million documents four times (once per prompt) at an average of ~XXX tokens per document means roughly XXX trillion tokens of total generation. At our throughput rate, that takes approximately XXX GPU-days, or about XXX wall-clock days with 100 GPUs.
19
 
20
  ### The Recipe
21
 
22
- Every configuration choice traces back to a finding from our experiments or infrastructure benchmarks:
23
 
24
  - **Model**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), which dominated all other model families across every prompt in our [model family comparison](#does-the-model-family-matter)
25
  - **Prompts**: [FAQ](#faq), [Math](#math), [Table](#table), and [Tutorial](#tutorial), the four prompts that [consistently beat DCLM](#can-new-prompts-beat-dclm) in our experiments
@@ -106,7 +106,7 @@ datacard_pipeline = [
106
 
107
  ### Improvements to DataTrove
108
 
109
- Building FinePhrase was not just about running inference at scale. It required hardening DataTrove's inference pipeline to handle the realities of processing 339 million documents across 100 parallel workers over two weeks. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
110
 
111
  #### Graceful error handling for bad documents
112
 
@@ -126,7 +126,7 @@ The first version of `skip_bad_requests` had a subtle problem: skipped documents
126
 
127
  #### Hardening Hub uploads against transient failures
128
 
129
- With 100 workers writing to the same Hugging Face Hub repository, transient failures are not rare, they are guaranteed. We encountered three distinct failure modes and fixed each one:
130
 
131
  - **Commit races** ([PR #448](https://github.com/huggingface/datatrove/pull/448)): Two workers commit simultaneously and one gets `412 Precondition Failed` with "A commit has happened since." The fix adds retry logic with exponential backoff to the `DiskWriter`, which all Hub-writing paths go through.
132
  - **Transient server errors** ([PR #463](https://github.com/huggingface/datatrove/pull/463)): `503 Service Unavailable` and other transient API errors were not retried consistently. This PR normalizes retry logic across `DiskWriter` and `HuggingFaceDatasetWriter` so all transient errors are handled uniformly.
 
11
 
12
  <ReadingTime words={1693} visuals={10} />
13
 
14
+ With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 340 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing XXX billion tokens of synthetic pretraining data.
15
 
16
+ The recipe writes itself from the experiments: take the best model (SmolLM2-1.7B-Instruct), the best prompts (FAQ, Math, Table, Tutorial), the optimized inference settings from our throughput benchmarks, and the DataTrove infrastructure. Launch 100 parallel Slurm workers, each running on a single H100 GPU with suffix-32 speculative decoding. Let it run for about two weeks on spare compute on our cluster.
17
 
18
  To get a sense of the scale: our infrastructure benchmarks showed that SmolLM2-1.7B-Instruct achieves ~9,200 tokens per second per GPU with suffix-32 speculative decoding. With 100 GPUs running in parallel, that is ~920,000 tokens per second, or about 3.3 billion tokens per hour. Rephrasing ~339 million documents four times (once per prompt) at an average of ~XXX tokens per document means roughly XXX trillion tokens of total generation. At our throughput rate, that takes approximately XXX GPU-days, or about XXX wall-clock days with 100 GPUs.
19
 
20
  ### The Recipe
21
 
22
+ Every configuration choice traces directly back to a finding from our experiments or infrastructure benchmarks:
23
 
24
  - **Model**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), which dominated all other model families across every prompt in our [model family comparison](#does-the-model-family-matter)
25
  - **Prompts**: [FAQ](#faq), [Math](#math), [Table](#table), and [Tutorial](#tutorial), the four prompts that [consistently beat DCLM](#can-new-prompts-beat-dclm) in our experiments
 
106
 
107
  ### Improvements to DataTrove
108
 
109
+ Building FinePhrase wasn't just about running inference at scale. Processing 339 million documents across 100 parallel workers for two weeks stress-tests infrastructure in ways that small experiments never do. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
110
 
111
  #### Graceful error handling for bad documents
112
 
 
126
 
127
  #### Hardening Hub uploads against transient failures
128
 
129
+ With 100 workers writing to the same Hugging Face Hub repository, transient failures aren't rare, they're guaranteed. We encountered three distinct failure modes and fixed each one:
130
 
131
  - **Commit races** ([PR #448](https://github.com/huggingface/datatrove/pull/448)): Two workers commit simultaneously and one gets `412 Precondition Failed` with "A commit has happened since." The fix adds retry logic with exponential backoff to the `DiskWriter`, which all Hub-writing paths go through.
132
  - **Transient server errors** ([PR #463](https://github.com/huggingface/datatrove/pull/463)): `503 Service Unavailable` and other transient API errors were not retried consistently. This PR normalizes retry logic across `DiskWriter` and `HuggingFaceDatasetWriter` so all transient errors are handled uniformly.
app/src/content/chapters/7-conclusions.mdx CHANGED
@@ -4,13 +4,13 @@ import ReadingTime from "../../components/ReadingTime.astro";
4
 
5
  <ReadingTime words={624} visuals={0} />
6
 
7
- We ran 90 experiments, generated over 1.1 trillion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
8
 
9
- ### Next Steps
10
 
11
- The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 210x inference speedups over autoregressive approaches. Recent models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. [Mercury 2](https://www.inceptionlabs.ai/blog/introducing-mercury-2) [@mercury2] pushes this further, reaching over 1,000 tokens per second on NVIDIA Blackwell GPUs through parallel refinement rather than sequential decoding, with 5x+ speedups over autoregressive baselines. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
12
 
13
- While we answered several questions about best practices for synthetic data generation in this work, many remain open:
14
 
15
  - **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
16
  - **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
@@ -21,4 +21,6 @@ While we answered several questions about best practices for synthetic data gene
21
  - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
22
  - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
23
  - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
24
- - **Larger ablations and mixtures**: We want to run more extensive mixture experiments, exploring how synthetic data interacts with source data at scale, in line with the recent [smol-data](https://huggingface.co/spaces/HuggingFaceTB/smol-data) effort.
 
 
 
4
 
5
  <ReadingTime words={624} visuals={0} />
6
 
7
+ We ran 90 experiments, generated over 1 trillion tokens, and spent more than 111,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there: a 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. Template diversity matters more than template polish, and a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. There is no reliable proxy metric that can replace training and evaluating a model, so there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so you can build on these findings without reinventing the plumbing.
8
 
9
+ ### What's Next?
10
 
11
+ The biggest bottleneck to scaling synthetic data experiments is the compute cost of generation itself. Producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down significantly. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2-10x inference speedups over autoregressive approaches. Models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. [Mercury 2](https://www.inceptionlabs.ai/blog/introducing-mercury-2) [@mercury2] pushes this further, reaching over 1,000 tokens per second on NVIDIA Blackwell GPUs through parallel refinement rather than sequential decoding, with 5x+ speedups over autoregressive baselines. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
12
 
13
+ Beyond faster generation, we answered several questions about best practices but many remain wide open:
14
 
15
  - **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
16
  - **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
 
21
  - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
22
  - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
23
  - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
24
+ - **Larger ablations and mixtures**: We want to run more extensive mixture experiments, exploring how synthetic data interacts with source data at scale, in line with the recent [smol-data](https://huggingface.co/spaces/HuggingFaceTB/smol-data) effort.
25
+
26
+ The playbook is open. Build on it.