finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 13 days ago

Commit

1ef404f

1 Parent(s): ae68e5f

updated texts based on new results

Browse files

Files changed (5) hide show

app/src/content/article.mdx +2 -2
app/src/content/chapters/1-introduction.mdx +2 -2
app/src/content/chapters/3-experiments.mdx +3 -10
app/src/content/chapters/4-analyses.mdx +8 -8
app/src/content/chapters/7-conclusions.mdx +1 -1

app/src/content/article.mdx CHANGED Viewed

@@ -1,9 +1,9 @@
 ---
-title: 'The Synthetic Data Playbook:<br/> Generating Billions of the Finest Tokens'
 subtitle: >-
   How to turn noisy web text into state-of-the-art pretraining data with the
   right prompts, models, and infrastructure
-description: 'The Synthetic Data Playbook: Generating Billions of the Finest Tokens'
 authors:
   - name: Joel Niklaus

 ---
+title: 'The Synthetic Data Playbook:<br/> Generating Trillions of the Finest Tokens'
 subtitle: >-
   How to turn noisy web text into state-of-the-art pretraining data with the
   right prompts, models, and infrastructure
+description: 'The Synthetic Data Playbook: Generating Trillions of the Finest Tokens'
 authors:
   - name: Joel Niklaus

app/src/content/chapters/1-introduction.mdx CHANGED Viewed

@@ -27,7 +27,7 @@ Synthetic data also plays a central role in post-training via *distillation*, wh
 However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
-In this blog post we take a journey to answer all these questions systematically. We ran 65 experiments, generated over 750 billion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
 Here's the plan:
 <Sidenote>
@@ -38,7 +38,7 @@ We start with the [Infrastructure](#infrastructure) needed for synthetic data ge
 We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
-Finally we present the suite of 65 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
 Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.

 However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
+In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1.1 trillion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
 Here's the plan:
 <Sidenote>
 We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
+Finally we present the suite of 90 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
 Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -4,28 +4,20 @@ import Sidenote from "../../components/Sidenote.astro";
 import Glossary from "../../components/Glossary.astro";
 import FigRef from "../../components/FigRef.astro";
-{/* TODO: mention the currently running finephrase rephrasing with smollm2 */}
 {/* TODO: read through entire blog post and make improvements */}
-{/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
 {/* TODO: Integrate decay experiment as another analysis for proxy */}
 {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
 {/* TODO: brainstorm better banner, be artsy */}
-{/* TODO: add essential web as baseline (raw) */}
 {/* TODO: run variance experiments with pretraining from scratch */}
-{/* TODO: run scaling experiments with longer pretraining phase */}
-{/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
-{/* TODO: try multiple rollouts and scoring */}
 {/* TODO: banner idea: 1T tokens = 8M books
 5cm pro buech = 400km
 Denn chönntme die büecher ufenandstaple und d distanz zeige ufenere charte bspw. Oder mit öppis vergliiche.
 Oder für jedes buech en punkt mache
  */}
-{/* TODO: improve the diagram for the infrastructure at the start of the section */}
 {/* TODO: final configuration for finephrase at the end of infra section: visualization of how many pages (500 tokens) (use page emojis flying from left to right) we can generate (real time), user can configure with a slider the number of GPUs */}
-{/* TODO: only explain datatrove additions when we need them (for generating the final finephrase) */}
-{/* TODO: move infrastructure section after analyses as precursor and explanation for finephrase */}
-{/* TODO: future work say we want to run larger ablations and mixture experiments in line with the recent smol-data release */}
 {/* TODO: baselines mixed with fw-edu-hq usually improve upon just baselines, but not sure if/how to present this */}
 {/*
@@ -62,6 +54,7 @@ We train on eight datasets under identical conditions and compare their final ev
       nemotron_hq_synth: "Nemotron-HQ-Synth",
       rewire: "REWIRE",
       synth_query_reasoning_answer: "SYNTH",
       "ultra-fineweb": "Ultra-FineWeb"
     }
   }}

 import Glossary from "../../components/Glossary.astro";
 import FigRef from "../../components/FigRef.astro";
 {/* TODO: read through entire blog post and make improvements */}
 {/* TODO: Integrate decay experiment as another analysis for proxy */}
 {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
 {/* TODO: brainstorm better banner, be artsy */}
+{/* TODO: expected reading time in total and per chapter */}
 {/* TODO: run variance experiments with pretraining from scratch */}
+{/* TODO: go through the blog post and update the scale numbers for finephrase dataset */}
 {/* TODO: banner idea: 1T tokens = 8M books
 5cm pro buech = 400km
 Denn chönntme die büecher ufenandstaple und d distanz zeige ufenere charte bspw. Oder mit öppis vergliiche.
 Oder für jedes buech en punkt mache
  */}
 {/* TODO: final configuration for finephrase at the end of infra section: visualization of how many pages (500 tokens) (use page emojis flying from left to right) we can generate (real time), user can configure with a slider the number of GPUs */}
 {/* TODO: baselines mixed with fw-edu-hq usually improve upon just baselines, but not sure if/how to present this */}
 {/*
       nemotron_hq_synth: "Nemotron-HQ-Synth",
       rewire: "REWIRE",
       synth_query_reasoning_answer: "SYNTH",
+      essentialweb_raw: "EssentialWeb",
       "ultra-fineweb": "Ultra-FineWeb"
     }
   }}

app/src/content/chapters/4-analyses.mdx CHANGED Viewed

@@ -9,7 +9,7 @@ The experiments above tell us *what* works. Now we zoom out and ask *why*. We lo
 ### Is More Compute Worth It?
-GPU time across our 65 experiments varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
 **The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
@@ -20,7 +20,7 @@ GPU time across our 65 experiments varies by two orders of magnitude: the cheape
   id="cost-efficiency"
   src="cost-efficiency.html"
   data="rephrasing_metadata.json"
-  desc="GPU time (log scale) vs downstream performance for all 65 experiments. The dashed line shows the Pareto frontier of most efficient configurations. Hover over points for details."
 />
 </Wide>
@@ -28,19 +28,19 @@ The cheapest configurations still take over a week of GPU time, and we only know
 ### Can Quality Scores Predict Performance?
-The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 65 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
 [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
-**DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (ρ = 0.60, p {'<'} 0.001), followed by the output DCLM-score (ρ = 0.55). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (ρ = 0.51–0.58) and reading comprehension (ρ = 0.47–0.51).
-**Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (ρ = 0.35, p {'<'} 0.01), but the output edu-score (the score of the rephrased data) shows essentially no correlation (ρ = 0.04, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all.
 {/*
-**The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 65 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
 */}
-**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.55–0.60) are still only moderate, explaining roughly 30% of the variance at best. **For synthetic data, there is no shortcut: you have to train models and evaluate them.**
 {/*
 Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
@@ -53,7 +53,7 @@ article/commentary/discussion/tutorial-1b-hq, tutorial-12b-hq, faq-1b-lq, faq-12
   id="score-correlation"
   src="score-correlation.html"
   data="rephrasing_metadata.json"
-  desc="Spearman rank correlations between quality score metrics and downstream benchmark performance across 65 rephrasing experiments. Blue cells indicate positive correlations, red cells negative. Significance: *** p<0.001, ** p<0.01, * p<0.05."
 />
 </Wide>

 ### Is More Compute Worth It?
+GPU time across our 90 experiments varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
 **The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
   id="cost-efficiency"
   src="cost-efficiency.html"
   data="rephrasing_metadata.json"
+  desc="GPU time (log scale) vs downstream performance for all 90 experiments. The dashed line shows the Pareto frontier of most efficient configurations. Hover over points for details."
 />
 </Wide>
 ### Can Quality Scores Predict Performance?
+The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
 [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
+**DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (ρ = 0.61, p {'<'} 0.001), followed by the output DCLM-score (ρ = 0.56). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (ρ = 0.47–0.54) and reading comprehension (ρ = 0.49–0.52).
+**Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (ρ = 0.27, p {'<'} 0.05), but the output edu-score (the score of the rephrased data) shows essentially no correlation (ρ = −0.08, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all.
 {/*
+**The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
 */}
+**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **For synthetic data, there is no shortcut: you have to train models and evaluate them.**
 {/*
 Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
   id="score-correlation"
   src="score-correlation.html"
   data="rephrasing_metadata.json"
+  desc="Spearman rank correlations between quality score metrics and downstream benchmark performance across 83 rephrasing experiments. Blue cells indicate positive correlations, red cells negative. Significance: *** p<0.001, ** p<0.01, * p<0.05."
 />
 </Wide>

app/src/content/chapters/7-conclusions.mdx CHANGED Viewed

@@ -1,6 +1,6 @@
 ## Conclusions
-We ran 65 experiments, generated over 750 billion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
 ### Next Steps

 ## Conclusions
+We ran 90 experiments, generated over 1.1 trillion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
 ### Next Steps