joelniklaus HF Staff commited on
Commit
1ef404f
·
1 Parent(s): ae68e5f

updated texts based on new results

Browse files
app/src/content/article.mdx CHANGED
@@ -1,9 +1,9 @@
1
  ---
2
- title: 'The Synthetic Data Playbook:<br/> Generating Billions of the Finest Tokens'
3
  subtitle: >-
4
  How to turn noisy web text into state-of-the-art pretraining data with the
5
  right prompts, models, and infrastructure
6
- description: 'The Synthetic Data Playbook: Generating Billions of the Finest Tokens'
7
  authors:
8
  - name: Joel Niklaus
9
 
 
1
  ---
2
+ title: 'The Synthetic Data Playbook:<br/> Generating Trillions of the Finest Tokens'
3
  subtitle: >-
4
  How to turn noisy web text into state-of-the-art pretraining data with the
5
  right prompts, models, and infrastructure
6
+ description: 'The Synthetic Data Playbook: Generating Trillions of the Finest Tokens'
7
  authors:
8
  - name: Joel Niklaus
9
 
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -27,7 +27,7 @@ Synthetic data also plays a central role in post-training via *distillation*, wh
27
 
28
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
29
 
30
- In this blog post we take a journey to answer all these questions systematically. We ran 65 experiments, generated over 750 billion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
31
 
32
  Here's the plan:
33
  <Sidenote>
@@ -38,7 +38,7 @@ We start with the [Infrastructure](#infrastructure) needed for synthetic data ge
38
 
39
  We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
40
 
41
- Finally we present the suite of 65 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
42
 
43
  Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
44
 
 
27
 
28
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
29
 
30
+ In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1.1 trillion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
31
 
32
  Here's the plan:
33
  <Sidenote>
 
38
 
39
  We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
40
 
41
+ Finally we present the suite of 90 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
42
 
43
  Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
44
 
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -4,28 +4,20 @@ import Sidenote from "../../components/Sidenote.astro";
4
  import Glossary from "../../components/Glossary.astro";
5
  import FigRef from "../../components/FigRef.astro";
6
 
7
- {/* TODO: mention the currently running finephrase rephrasing with smollm2 */}
8
  {/* TODO: read through entire blog post and make improvements */}
9
- {/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
10
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
11
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
12
  {/* TODO: brainstorm better banner, be artsy */}
13
- {/* TODO: add essential web as baseline (raw) */}
14
  {/* TODO: run variance experiments with pretraining from scratch */}
15
- {/* TODO: run scaling experiments with longer pretraining phase */}
16
- {/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
17
- {/* TODO: try multiple rollouts and scoring */}
18
  {/* TODO: banner idea: 1T tokens = 8M books
19
  5cm pro buech = 400km
20
 
21
  Denn chönntme die büecher ufenandstaple und d distanz zeige ufenere charte bspw. Oder mit öppis vergliiche.
22
  Oder für jedes buech en punkt mache
23
  */}
24
- {/* TODO: improve the diagram for the infrastructure at the start of the section */}
25
  {/* TODO: final configuration for finephrase at the end of infra section: visualization of how many pages (500 tokens) (use page emojis flying from left to right) we can generate (real time), user can configure with a slider the number of GPUs */}
26
- {/* TODO: only explain datatrove additions when we need them (for generating the final finephrase) */}
27
- {/* TODO: move infrastructure section after analyses as precursor and explanation for finephrase */}
28
- {/* TODO: future work say we want to run larger ablations and mixture experiments in line with the recent smol-data release */}
29
  {/* TODO: baselines mixed with fw-edu-hq usually improve upon just baselines, but not sure if/how to present this */}
30
 
31
  {/*
@@ -62,6 +54,7 @@ We train on eight datasets under identical conditions and compare their final ev
62
  nemotron_hq_synth: "Nemotron-HQ-Synth",
63
  rewire: "REWIRE",
64
  synth_query_reasoning_answer: "SYNTH",
 
65
  "ultra-fineweb": "Ultra-FineWeb"
66
  }
67
  }}
 
4
  import Glossary from "../../components/Glossary.astro";
5
  import FigRef from "../../components/FigRef.astro";
6
 
 
7
  {/* TODO: read through entire blog post and make improvements */}
 
8
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
9
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
10
  {/* TODO: brainstorm better banner, be artsy */}
11
+ {/* TODO: expected reading time in total and per chapter */}
12
  {/* TODO: run variance experiments with pretraining from scratch */}
13
+ {/* TODO: go through the blog post and update the scale numbers for finephrase dataset */}
 
 
14
  {/* TODO: banner idea: 1T tokens = 8M books
15
  5cm pro buech = 400km
16
 
17
  Denn chönntme die büecher ufenandstaple und d distanz zeige ufenere charte bspw. Oder mit öppis vergliiche.
18
  Oder für jedes buech en punkt mache
19
  */}
 
20
  {/* TODO: final configuration for finephrase at the end of infra section: visualization of how many pages (500 tokens) (use page emojis flying from left to right) we can generate (real time), user can configure with a slider the number of GPUs */}
 
 
 
21
  {/* TODO: baselines mixed with fw-edu-hq usually improve upon just baselines, but not sure if/how to present this */}
22
 
23
  {/*
 
54
  nemotron_hq_synth: "Nemotron-HQ-Synth",
55
  rewire: "REWIRE",
56
  synth_query_reasoning_answer: "SYNTH",
57
+ essentialweb_raw: "EssentialWeb",
58
  "ultra-fineweb": "Ultra-FineWeb"
59
  }
60
  }}
app/src/content/chapters/4-analyses.mdx CHANGED
@@ -9,7 +9,7 @@ The experiments above tell us *what* works. Now we zoom out and ask *why*. We lo
9
 
10
  ### Is More Compute Worth It?
11
 
12
- GPU time across our 65 experiments varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
13
 
14
  **The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
15
 
@@ -20,7 +20,7 @@ GPU time across our 65 experiments varies by two orders of magnitude: the cheape
20
  id="cost-efficiency"
21
  src="cost-efficiency.html"
22
  data="rephrasing_metadata.json"
23
- desc="GPU time (log scale) vs downstream performance for all 65 experiments. The dashed line shows the Pareto frontier of most efficient configurations. Hover over points for details."
24
  />
25
  </Wide>
26
 
@@ -28,19 +28,19 @@ The cheapest configurations still take over a week of GPU time, and we only know
28
 
29
  ### Can Quality Scores Predict Performance?
30
 
31
- The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 65 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
32
 
33
  [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
34
 
35
- **DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (ρ = 0.60, p {'<'} 0.001), followed by the output DCLM-score (ρ = 0.55). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (ρ = 0.51–0.58) and reading comprehension (ρ = 0.47–0.51).
36
 
37
- **Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (ρ = 0.35, p {'<'} 0.01), but the output edu-score (the score of the rephrased data) shows essentially no correlation (ρ = 0.04, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all.
38
 
39
  {/*
40
- **The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 65 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
41
  */}
42
 
43
- **Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.55–0.60) are still only moderate, explaining roughly 30% of the variance at best. **For synthetic data, there is no shortcut: you have to train models and evaluate them.**
44
 
45
  {/*
46
  Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
@@ -53,7 +53,7 @@ article/commentary/discussion/tutorial-1b-hq, tutorial-12b-hq, faq-1b-lq, faq-12
53
  id="score-correlation"
54
  src="score-correlation.html"
55
  data="rephrasing_metadata.json"
56
- desc="Spearman rank correlations between quality score metrics and downstream benchmark performance across 65 rephrasing experiments. Blue cells indicate positive correlations, red cells negative. Significance: *** p<0.001, ** p<0.01, * p<0.05."
57
  />
58
  </Wide>
59
 
 
9
 
10
  ### Is More Compute Worth It?
11
 
12
+ GPU time across our 90 experiments varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
13
 
14
  **The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
15
 
 
20
  id="cost-efficiency"
21
  src="cost-efficiency.html"
22
  data="rephrasing_metadata.json"
23
+ desc="GPU time (log scale) vs downstream performance for all 90 experiments. The dashed line shows the Pareto frontier of most efficient configurations. Hover over points for details."
24
  />
25
  </Wide>
26
 
 
28
 
29
  ### Can Quality Scores Predict Performance?
30
 
31
+ The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
32
 
33
  [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
34
 
35
+ **DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (ρ = 0.61, p {'<'} 0.001), followed by the output DCLM-score (ρ = 0.56). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (ρ = 0.47–0.54) and reading comprehension (ρ = 0.49–0.52).
36
 
37
+ **Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (ρ = 0.27, p {'<'} 0.05), but the output edu-score (the score of the rephrased data) shows essentially no correlation (ρ = 0.08, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all.
38
 
39
  {/*
40
+ **The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
41
  */}
42
 
43
+ **Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **For synthetic data, there is no shortcut: you have to train models and evaluate them.**
44
 
45
  {/*
46
  Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
 
53
  id="score-correlation"
54
  src="score-correlation.html"
55
  data="rephrasing_metadata.json"
56
+ desc="Spearman rank correlations between quality score metrics and downstream benchmark performance across 83 rephrasing experiments. Blue cells indicate positive correlations, red cells negative. Significance: *** p<0.001, ** p<0.01, * p<0.05."
57
  />
58
  </Wide>
59
 
app/src/content/chapters/7-conclusions.mdx CHANGED
@@ -1,6 +1,6 @@
1
  ## Conclusions
2
 
3
- We ran 65 experiments, generated over 750 billion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
4
 
5
  ### Next Steps
6
 
 
1
  ## Conclusions
2
 
3
+ We ran 90 experiments, generated over 1.1 trillion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
4
 
5
  ### Next Steps
6