Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
1ef404f
1
Parent(s): ae68e5f
updated texts based on new results
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
---
|
| 2 |
-
title: 'The Synthetic Data Playbook:<br/> Generating
|
| 3 |
subtitle: >-
|
| 4 |
How to turn noisy web text into state-of-the-art pretraining data with the
|
| 5 |
right prompts, models, and infrastructure
|
| 6 |
-
description: 'The Synthetic Data Playbook: Generating
|
| 7 |
authors:
|
| 8 |
- name: Joel Niklaus
|
| 9 |
|
|
|
|
| 1 |
---
|
| 2 |
+
title: 'The Synthetic Data Playbook:<br/> Generating Trillions of the Finest Tokens'
|
| 3 |
subtitle: >-
|
| 4 |
How to turn noisy web text into state-of-the-art pretraining data with the
|
| 5 |
right prompts, models, and infrastructure
|
| 6 |
+
description: 'The Synthetic Data Playbook: Generating Trillions of the Finest Tokens'
|
| 7 |
authors:
|
| 8 |
- name: Joel Niklaus
|
| 9 |
|
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -27,7 +27,7 @@ Synthetic data also plays a central role in post-training via *distillation*, wh
|
|
| 27 |
|
| 28 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 29 |
|
| 30 |
-
In this blog post we take a journey to answer all these questions systematically. We ran
|
| 31 |
|
| 32 |
Here's the plan:
|
| 33 |
<Sidenote>
|
|
@@ -38,7 +38,7 @@ We start with the [Infrastructure](#infrastructure) needed for synthetic data ge
|
|
| 38 |
|
| 39 |
We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
|
| 40 |
|
| 41 |
-
Finally we present the suite of
|
| 42 |
|
| 43 |
Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
|
| 44 |
|
|
|
|
| 27 |
|
| 28 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 29 |
|
| 30 |
+
In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1.1 trillion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
|
| 31 |
|
| 32 |
Here's the plan:
|
| 33 |
<Sidenote>
|
|
|
|
| 38 |
|
| 39 |
We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
|
| 40 |
|
| 41 |
+
Finally we present the suite of 90 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
|
| 42 |
|
| 43 |
Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
|
| 44 |
|
app/src/content/chapters/3-experiments.mdx
CHANGED
|
@@ -4,28 +4,20 @@ import Sidenote from "../../components/Sidenote.astro";
|
|
| 4 |
import Glossary from "../../components/Glossary.astro";
|
| 5 |
import FigRef from "../../components/FigRef.astro";
|
| 6 |
|
| 7 |
-
{/* TODO: mention the currently running finephrase rephrasing with smollm2 */}
|
| 8 |
{/* TODO: read through entire blog post and make improvements */}
|
| 9 |
-
{/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
|
| 10 |
{/* TODO: Integrate decay experiment as another analysis for proxy */}
|
| 11 |
{/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
|
| 12 |
{/* TODO: brainstorm better banner, be artsy */}
|
| 13 |
-
{/* TODO:
|
| 14 |
{/* TODO: run variance experiments with pretraining from scratch */}
|
| 15 |
-
{/* TODO:
|
| 16 |
-
{/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
|
| 17 |
-
{/* TODO: try multiple rollouts and scoring */}
|
| 18 |
{/* TODO: banner idea: 1T tokens = 8M books
|
| 19 |
5cm pro buech = 400km
|
| 20 |
|
| 21 |
Denn chönntme die büecher ufenandstaple und d distanz zeige ufenere charte bspw. Oder mit öppis vergliiche.
|
| 22 |
Oder für jedes buech en punkt mache
|
| 23 |
*/}
|
| 24 |
-
{/* TODO: improve the diagram for the infrastructure at the start of the section */}
|
| 25 |
{/* TODO: final configuration for finephrase at the end of infra section: visualization of how many pages (500 tokens) (use page emojis flying from left to right) we can generate (real time), user can configure with a slider the number of GPUs */}
|
| 26 |
-
{/* TODO: only explain datatrove additions when we need them (for generating the final finephrase) */}
|
| 27 |
-
{/* TODO: move infrastructure section after analyses as precursor and explanation for finephrase */}
|
| 28 |
-
{/* TODO: future work say we want to run larger ablations and mixture experiments in line with the recent smol-data release */}
|
| 29 |
{/* TODO: baselines mixed with fw-edu-hq usually improve upon just baselines, but not sure if/how to present this */}
|
| 30 |
|
| 31 |
{/*
|
|
@@ -62,6 +54,7 @@ We train on eight datasets under identical conditions and compare their final ev
|
|
| 62 |
nemotron_hq_synth: "Nemotron-HQ-Synth",
|
| 63 |
rewire: "REWIRE",
|
| 64 |
synth_query_reasoning_answer: "SYNTH",
|
|
|
|
| 65 |
"ultra-fineweb": "Ultra-FineWeb"
|
| 66 |
}
|
| 67 |
}}
|
|
|
|
| 4 |
import Glossary from "../../components/Glossary.astro";
|
| 5 |
import FigRef from "../../components/FigRef.astro";
|
| 6 |
|
|
|
|
| 7 |
{/* TODO: read through entire blog post and make improvements */}
|
|
|
|
| 8 |
{/* TODO: Integrate decay experiment as another analysis for proxy */}
|
| 9 |
{/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
|
| 10 |
{/* TODO: brainstorm better banner, be artsy */}
|
| 11 |
+
{/* TODO: expected reading time in total and per chapter */}
|
| 12 |
{/* TODO: run variance experiments with pretraining from scratch */}
|
| 13 |
+
{/* TODO: go through the blog post and update the scale numbers for finephrase dataset */}
|
|
|
|
|
|
|
| 14 |
{/* TODO: banner idea: 1T tokens = 8M books
|
| 15 |
5cm pro buech = 400km
|
| 16 |
|
| 17 |
Denn chönntme die büecher ufenandstaple und d distanz zeige ufenere charte bspw. Oder mit öppis vergliiche.
|
| 18 |
Oder für jedes buech en punkt mache
|
| 19 |
*/}
|
|
|
|
| 20 |
{/* TODO: final configuration for finephrase at the end of infra section: visualization of how many pages (500 tokens) (use page emojis flying from left to right) we can generate (real time), user can configure with a slider the number of GPUs */}
|
|
|
|
|
|
|
|
|
|
| 21 |
{/* TODO: baselines mixed with fw-edu-hq usually improve upon just baselines, but not sure if/how to present this */}
|
| 22 |
|
| 23 |
{/*
|
|
|
|
| 54 |
nemotron_hq_synth: "Nemotron-HQ-Synth",
|
| 55 |
rewire: "REWIRE",
|
| 56 |
synth_query_reasoning_answer: "SYNTH",
|
| 57 |
+
essentialweb_raw: "EssentialWeb",
|
| 58 |
"ultra-fineweb": "Ultra-FineWeb"
|
| 59 |
}
|
| 60 |
}}
|
app/src/content/chapters/4-analyses.mdx
CHANGED
|
@@ -9,7 +9,7 @@ The experiments above tell us *what* works. Now we zoom out and ask *why*. We lo
|
|
| 9 |
|
| 10 |
### Is More Compute Worth It?
|
| 11 |
|
| 12 |
-
GPU time across our
|
| 13 |
|
| 14 |
**The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
|
| 15 |
|
|
@@ -20,7 +20,7 @@ GPU time across our 65 experiments varies by two orders of magnitude: the cheape
|
|
| 20 |
id="cost-efficiency"
|
| 21 |
src="cost-efficiency.html"
|
| 22 |
data="rephrasing_metadata.json"
|
| 23 |
-
desc="GPU time (log scale) vs downstream performance for all
|
| 24 |
/>
|
| 25 |
</Wide>
|
| 26 |
|
|
@@ -28,19 +28,19 @@ The cheapest configurations still take over a week of GPU time, and we only know
|
|
| 28 |
|
| 29 |
### Can Quality Scores Predict Performance?
|
| 30 |
|
| 31 |
-
The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our
|
| 32 |
|
| 33 |
[^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
|
| 34 |
|
| 35 |
-
**DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (ρ = 0.
|
| 36 |
|
| 37 |
-
**Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (ρ = 0.
|
| 38 |
|
| 39 |
{/*
|
| 40 |
-
**The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all
|
| 41 |
*/}
|
| 42 |
|
| 43 |
-
**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.
|
| 44 |
|
| 45 |
{/*
|
| 46 |
Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
|
|
@@ -53,7 +53,7 @@ article/commentary/discussion/tutorial-1b-hq, tutorial-12b-hq, faq-1b-lq, faq-12
|
|
| 53 |
id="score-correlation"
|
| 54 |
src="score-correlation.html"
|
| 55 |
data="rephrasing_metadata.json"
|
| 56 |
-
desc="Spearman rank correlations between quality score metrics and downstream benchmark performance across
|
| 57 |
/>
|
| 58 |
</Wide>
|
| 59 |
|
|
|
|
| 9 |
|
| 10 |
### Is More Compute Worth It?
|
| 11 |
|
| 12 |
+
GPU time across our 90 experiments varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
|
| 13 |
|
| 14 |
**The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
|
| 15 |
|
|
|
|
| 20 |
id="cost-efficiency"
|
| 21 |
src="cost-efficiency.html"
|
| 22 |
data="rephrasing_metadata.json"
|
| 23 |
+
desc="GPU time (log scale) vs downstream performance for all 90 experiments. The dashed line shows the Pareto frontier of most efficient configurations. Hover over points for details."
|
| 24 |
/>
|
| 25 |
</Wide>
|
| 26 |
|
|
|
|
| 28 |
|
| 29 |
### Can Quality Scores Predict Performance?
|
| 30 |
|
| 31 |
+
The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
|
| 32 |
|
| 33 |
[^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
|
| 34 |
|
| 35 |
+
**DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (ρ = 0.61, p {'<'} 0.001), followed by the output DCLM-score (ρ = 0.56). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (ρ = 0.47–0.54) and reading comprehension (ρ = 0.49–0.52).
|
| 36 |
|
| 37 |
+
**Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (ρ = 0.27, p {'<'} 0.05), but the output edu-score (the score of the rephrased data) shows essentially no correlation (ρ = −0.08, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all.
|
| 38 |
|
| 39 |
{/*
|
| 40 |
+
**The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
|
| 41 |
*/}
|
| 42 |
|
| 43 |
+
**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **For synthetic data, there is no shortcut: you have to train models and evaluate them.**
|
| 44 |
|
| 45 |
{/*
|
| 46 |
Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
|
|
|
|
| 53 |
id="score-correlation"
|
| 54 |
src="score-correlation.html"
|
| 55 |
data="rephrasing_metadata.json"
|
| 56 |
+
desc="Spearman rank correlations between quality score metrics and downstream benchmark performance across 83 rephrasing experiments. Blue cells indicate positive correlations, red cells negative. Significance: *** p<0.001, ** p<0.01, * p<0.05."
|
| 57 |
/>
|
| 58 |
</Wide>
|
| 59 |
|
app/src/content/chapters/7-conclusions.mdx
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
## Conclusions
|
| 2 |
|
| 3 |
-
We ran
|
| 4 |
|
| 5 |
### Next Steps
|
| 6 |
|
|
|
|
| 1 |
## Conclusions
|
| 2 |
|
| 3 |
+
We ran 90 experiments, generated over 1.1 trillion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
|
| 4 |
|
| 5 |
### Next Steps
|
| 6 |
|