finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 15 days ago

Commit

77b87fe

1 Parent(s): e46c12a

made the writing style more engaging and improved transitions based on Lewis' feedback

Browse files

Files changed (7) hide show

app/src/content/chapters/1-introduction.mdx +3 -7
app/src/content/chapters/2-setup.mdx +14 -12
app/src/content/chapters/3-experiments.mdx +33 -28
app/src/content/chapters/4-analyses.mdx +13 -11
app/src/content/chapters/5-infrastructure.mdx +12 -10
app/src/content/chapters/6-finephrase.mdx +5 -5
app/src/content/chapters/7-conclusions.mdx +7 -5

app/src/content/chapters/1-introduction.mdx CHANGED Viewed

@@ -33,20 +33,16 @@ During SmolLM2 [@smollm2] training, the model was decent at coding and math but
 However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
-In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1.1 trillion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
 Here's the plan:
 <Sidenote>
 The sections are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
 </Sidenote>
-We start with the [Infrastructure](#infrastructure) needed for synthetic data generation at scale. This includes some extensions we made to the datatrove library and crucially detailed throughput benchmarking of popular models you might want to use for synthetic data generation. This is super important to get the most data for your bucks.
-We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
-Finally we present the suite of 90 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
-Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
 <HtmlEmbed
   id="finephrase-vs-baselines"

 However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
+In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1 trillion tokens and spent {'>'}111,000 GPU hours (~12.7 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
 Here's the plan:
 <Sidenote>
 The sections are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
 </Sidenote>
+We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
+Here's a preview of where we end up: FinePhrase clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains the journey to get there.
 <HtmlEmbed
   id="finephrase-vs-baselines"

app/src/content/chapters/2-setup.mdx CHANGED Viewed

@@ -6,11 +6,11 @@ import ReadingTime from "../../components/ReadingTime.astro";
 <ReadingTime words={1243} visuals={0} />
-Recent work like WRAP [@wrap], Nemotron-CC [@nemotroncc], REWIRE [@rewire], and BeyondWeb [@beyondweb] has shown that rephrasing web content into higher-quality formats can outperform training on raw data alone. But the field still lacks a clear framework for what "rephrasing" actually means and a systematic investigation of which factors make it work. That's what we discuss in this section.
 ### What is Rephrasing?
-At its core, **rephrasing** means running existing documents through a language model to produce variants that keep the meaning but change the presentation. That sounds simple, but the design space is huge. A document could be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different capabilities: tutorials may help step-by-step reasoning, FAQs might boost question-answering, and math reformulations could strengthen quantitative skills. Which transformations actually work, and when? That's what we set out to answer.
 ### Three Axes of Synthetic Data
@@ -22,24 +22,24 @@ We think about synthetic data generation along three axes:
 Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best rephrasing strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? When does aggressive transformation help versus hurt?
-### Research Questions
-FinePhrase tackles these questions through systematic experimentation across all three axes:
 1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance.
 2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
 3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
 4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
-### Rephrasing Setup
-We rephrase documents using instruction-tuned models ranging from 270M to 27B parameters (primarily Gemma-3 [@gemma3] variants) on filtered web corpora including FineWeb-Edu [@fineweb] and DCLM [@datacomp], processing roughly 20B input tokens per quality tier. Our pipeline runs documents through customizable prompt templates that transform raw web text into structured formats (articles, tutorials, FAQs, discussions, commentaries) as well as distillation and continuation tasks inspired by prior work, producing between ~XB and XXB output tokens depending on the strategy.
 For inference we use vLLM [@vllm] with tensor parallelism, chunked prefill, and speculative decoding [@speculativedecoding] (n-gram prompt lookup with ~7 draft tokens, acceptance rates around 0.7). Every rephrased document gets scored by both the FineWeb-Edu classifier and the DCLM quality scorer, and we track token counts, quality score deltas, and metadata including thinking traces when available. The whole thing runs distributed across 100 parallel tasks on a SLURM cluster with checkpointing, targeting 10B tokens of synthetic data for downstream ablations.
 ### Source Datasets
-We compare against several baseline datasets for pretraining and data rephrasing. We use "source data" and "seed data" interchangeably throughout.
 <Accordion title="DCLM" open>
   A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM (DataComp-LM) enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
@@ -63,9 +63,9 @@ We compare against several baseline datasets for pretraining and data rephrasing
   A method for recycling the web with guided rewriting that enriches low-quality documents discarded by filtering pipelines. Mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales across 22 tasks [@rewire].
 </Accordion>
-### Ablation Setup
-Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
 <Sidenote>
 Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
 </Sidenote>
@@ -78,10 +78,12 @@ Since our model is small and trained on only 20B tokens, we use the **cloze form
 - **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
-### A Note on Model Collapse
-A common misconception is that any use of synthetic data inevitably degrades model performance. This stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
-In practice, nobody trains models this way. Real-world synthetic data pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically for specific purposes rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data.
 The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.

 <ReadingTime words={1243} visuals={0} />
+Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, and BeyondWeb [@beyondweb] tries continuation and summarization. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
 ### What is Rephrasing?
+**Rephrasing** means running existing documents through a language model to produce variants that keep the meaning but change the presentation. That sounds simple, but the design space is huge. A document could be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different capabilities: tutorials may help step-by-step reasoning, FAQs might boost question-answering, and math reformulations could strengthen quantitative skills. Which transformations actually work, and when? That's what we set out to answer.
 ### Three Axes of Synthetic Data
 Prior work has explored these dimensions mostly in isolation. But their interactions are where the interesting questions live. Does the best rephrasing strategy depend on source quality? Can small models rephrase high-quality data effectively, or do you need bigger models to salvage noisy documents? When does aggressive transformation help versus hurt?
+### What We Want to Find Out
+Here are the concrete questions we're trying to answer:
 1. **Which rephrasing strategies work best?** We compare prompts from prior work (REWIRE's guided rewriting, Nemotron's QA pairs and knowledge extraction) against novel formats (tutorials, FAQs, tables, math reformulations) to find which transformations consistently improve downstream performance.
 2. **How do generator model properties affect quality?** We test across model families (Gemma [@gemma3], Llama [@llama3], Qwen [@qwen3], Granite [@granite3], Falcon [@falcon3], SmolLM [@smollm2]), model generations (Qwen 1.5 [@qwen] through Qwen 3 [@qwen3]), and scales (270M to 27B parameters).
 3. **When does source data quality matter?** We rephrase both high-quality (FineWeb-Edu-HQ [@fineweb], DCLM [@datacomp]) and low-quality (FineWeb-Edu-LQ, Cosmopedia [@cosmopedia]) sources to test whether rephrasing recovers value from noisy documents or just amplifies existing quality differences.
 4. **How do synthetic and original data interact?** We compare synthetic-only training against mixing synthetic with original data, vary the choice of mix-in dataset, and test whether combining multiple prompts or model families increases diversity enough to replace original data entirely.
+### How We Run Rephrasing
+In practice, we rephrase documents using instruction-tuned models ranging from 270M to 27B parameters (primarily Gemma-3 [@gemma3] variants) on filtered web corpora including FineWeb-Edu [@fineweb] and DCLM [@datacomp], processing roughly 20B input tokens per quality tier. Our pipeline runs documents through customizable prompt templates that transform raw web text into structured formats (articles, tutorials, FAQs, discussions, commentaries) as well as distillation and continuation tasks inspired by prior work.
 For inference we use vLLM [@vllm] with tensor parallelism, chunked prefill, and speculative decoding [@speculativedecoding] (n-gram prompt lookup with ~7 draft tokens, acceptance rates around 0.7). Every rephrased document gets scored by both the FineWeb-Edu classifier and the DCLM quality scorer, and we track token counts, quality score deltas, and metadata including thinking traces when available. The whole thing runs distributed across 100 parallel tasks on a SLURM cluster with checkpointing, targeting 10B tokens of synthetic data for downstream ablations.
 ### Source Datasets
+Before diving into experiments, here's a quick overview of the datasets we compare against. We use "source data" and "seed data" interchangeably throughout.
 <Accordion title="DCLM" open>
   A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM (DataComp-LM) enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
   A method for recycling the web with guided rewriting that enriches low-quality documents discarded by filtering pipelines. Mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales across 22 tasks [@rewire].
 </Accordion>
+### How We Measure Success
+To evaluate each configuration, we follow the ablation methodology from FineWeb [@fineweb]: train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) on 20B tokens and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
 <Sidenote>
 Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
 </Sidenote>
 - **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
+### But Wait, What About Model Collapse?
+You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
+In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data.
 The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
+With all that context out of the way, let's get to the fun part: the experiments.

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -30,7 +30,7 @@ Notes:
 <ReadingTime words={2063} visuals={14} />
-With the infrastructure and setup in place, we now systematically work through our research questions. <FigRef target="experiment-overview" /> shows the full landscape of our experiments as a flow from source datasets through prompt strategies to model families. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
 <HtmlEmbed
   id="experiment-overview"
@@ -41,7 +41,7 @@ With the infrastructure and setup in place, we now systematically work through o
 ### How Do Existing Datasets Compare?
-We train on eight datasets under identical conditions and compare their final evaluation performance. DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.
 <HtmlEmbed
   id="baselines-comparison"
@@ -62,11 +62,11 @@ We train on eight datasets under identical conditions and compare their final ev
   }}
 />
-The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
 #### Which Individual Prompts Match DCLM?
-We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source. Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.
 <Sidenote>
 The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
@@ -93,11 +93,11 @@ The BeyondWeb dataset was never released and the paper omits key details, yet cl
   }}
 />
-Can we design prompts that consistently beat DCLM?
 ### Can New Prompts Beat DCLM?
-Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), and [narrative](#narrative) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
 <HtmlEmbed
   id="new-prompts"
@@ -119,11 +119,11 @@ Since most existing prompts fail to beat DCLM, we designed nine novel prompt for
   }}
 />
-We used Gemma-3-1B for all experiments so far. Can we do even better by changing the rephrasing model?
 ### Impact of the Rephrasing Model
-We want to know whether using a stronger model leads to better synthetic data. We look at this dimension from three angles: model size, model family, and model generation.
 #### Does the model size matter?
@@ -132,7 +132,7 @@ For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1
 SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
 The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
 This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
-The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), larger models do not improve synthetic data quality.
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
@@ -186,11 +186,11 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
   }}
 />
-On high-quality source data, we see no evidence that larger models help. But REWIRE claims large models are needed specifically for low-quality data. Does that claim hold?
 #### Do we need better models for rephrasing low-quality data?
-The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
 <HtmlEmbed
   id="size-quality"
@@ -238,11 +238,11 @@ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large
   }}
 />
-Since model size barely matters, does the model family make a difference?
 #### Does the model family matter?
-We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all eight prompts (see <FigRef target="model-family" />).
 <Sidenote>
 We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
@@ -346,11 +346,14 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
   }}
 />
-SmolLM2 is already a year old. Are newer model generations better?
 #### Does the model generation matter?
-We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt. While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />) especially cumulative from version 1.5 to 3.
 <HtmlEmbed
   id="model-generation"
@@ -374,15 +377,15 @@ We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and
 **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
 </Note>
-We've explored the model dimension thoroughly. Now, what difference do the dataset choices make?
 ### Impact of the Dataset Choices
-So far we've always mixed synthetic data with a <Glossary term="source dataset" definition="The original dataset that gets rephrased by the language model to produce synthetic data." /> and a <Glossary term="mix-in dataset" definition="The non-synthetic dataset mixed with the rephrased data during training. This can be the same as or different from the source dataset." />. But what role do these different datasets play?
 #### Is synthetic data enough?
-We compare synthetic-only training vs mixed training (synthetic + source) for [faq](#faq) and [tutorial](#tutorial) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
 <HtmlEmbed
   id="synthetic-only"
@@ -412,11 +415,11 @@ We compare synthetic-only training vs mixed training (synthetic + source) for [f
   }}
 />
-So synthetic data alone does not seem to be enough. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
-We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data. DCLM outperforms other mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
 <HtmlEmbed
   id="mixin-dataset"
@@ -452,11 +455,11 @@ We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, th
   }}
 />
-The mix-in dataset matters enormously. But what about the source dataset we feed to the rephrasing model?
 #### Does the source dataset matter?
-We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
 <HtmlEmbed
   id="source-dataset-mixin-source"
@@ -514,11 +517,11 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
   }}
 />
-This is exciting because it shows the potential of upcycling low-quality data through rephrasing with format prompts. Can we squeeze out more performance by increasing diversity in the synthetic portion?
 #### Does increased diversity help?
-We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies. None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
 <Sidenote>
 Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
@@ -574,11 +577,11 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
 **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
 </Note>
-We've covered prompts, models, and datasets. One last question: how sensitive is all of this to small details in the prompt itself?
 ### Do Typos in the Prompt Hurt?
-We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale. Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />).
 <HtmlEmbed
   id="typos-effect"
@@ -597,7 +600,7 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
 ### Takeaways
-Here are the key takeaways from our experiments:
 - **Q: How do existing datasets compare?**<br/>
   A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
@@ -624,4 +627,6 @@ Here are the key takeaways from our experiments:
 - **Q: Do typos in the prompt hurt?**<br/>
   A: No. Typos have no negative effect on downstream performance.
-So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving. A 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.

 <ReadingTime words={2063} visuals={14} />
+Time to put all of this to the test. We ran 90 experiments to systematically answer our questions, and the journey took some unexpected turns. <FigRef target="experiment-overview" /> shows the full landscape: source datasets flowing through prompt strategies to model families. We start by seeing how existing datasets stack up, then dissect what makes their prompts tick. From there we design our own prompts, explore how the rephrasing model affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
 <HtmlEmbed
   id="experiment-overview"
 ### How Do Existing Datasets Compare?
+First things first: where does the bar sit? We train on eight datasets under identical conditions and compare their final evaluation performance. DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, SYNTH, and EssentialWeb, fall notably behind. DCLM is the strongest baseline and becomes our target to beat for everything that follows.
 <HtmlEmbed
   id="baselines-comparison"
   }}
 />
+Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?
 #### Which Individual Prompts Match DCLM?
+We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source. Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline.
 <Sidenote>
 The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
   }}
 />
+That's a pretty underwhelming hit rate. Can we do better with our own prompts?
 ### Can New Prompts Beat DCLM?
+Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level (see <FigRef target="new-prompts" />). The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
 <HtmlEmbed
   id="new-prompts"
   }}
 />
+So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
 ### Impact of the Rephrasing Model
+We look at this from three angles: model size, model family, and model generation.
 #### Does the model size matter?
 SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
 The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
 This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
+The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
   }}
 />
+That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
 #### Do we need better models for rephrasing low-quality data?
+REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
 <HtmlEmbed
   id="size-quality"
   }}
 />
+So model size doesn't matter much. But what if you're using the wrong model family entirely?
 #### Does the model family matter?
+We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts. The result here is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt (see <FigRef target="model-family" />).
 <Sidenote>
 We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
   }}
 />
+SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
+<Sidenote>
+[SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
+</Sidenote>
 #### Does the model generation matter?
+We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt. The differences are small, but there is a consistent upward trend: newer versions lead to slightly higher evaluation performance (see <FigRef target="model-generation" />), especially cumulative from version 1.5 to 3.
 <HtmlEmbed
   id="model-generation"
 **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
 </Note>
+We've thoroughly explored the model dimension. The next obvious question: how much do the dataset choices matter?
 ### Impact of the Dataset Choices
+So far we've always mixed synthetic data with a <Glossary term="source dataset" definition="The original dataset that gets rephrased by the language model to produce synthetic data." /> and a <Glossary term="mix-in dataset" definition="The non-synthetic dataset mixed with the rephrased data during training. This can be the same as or different from the source dataset." />. But do we even need the original data? And if so, which dataset should we mix in?
 #### Is synthetic data enough?
+The dream scenario would be generating all your training data synthetically, no curation needed. We test this by comparing synthetic-only training vs mixed training (synthetic + source) for [faq](#faq) and [tutorial](#tutorial) prompts on DCLM and FineWeb-Edu-HQ sources. Unfortunately, synthetic-only training falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixing consistently improves over both the synthetic-only and original-data-only baselines.
 <HtmlEmbed
   id="synthetic-only"
   }}
 />
+OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
+We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data. DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
 <HtmlEmbed
   id="mixin-dataset"
   }}
 />
+If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
 #### Does the source dataset matter?
+We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). This is exciting: it means you can rephrase even low-quality data and still get competitive results, as long as you pair it with a strong mix-in dataset.
 <HtmlEmbed
   id="source-dataset-mixin-source"
   }}
 />
+That opens up a much larger pool of source data to draw from. But can we squeeze out even more performance by increasing diversity in the synthetic portion?
 #### Does increased diversity help?
+We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies. None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds (see <FigRef target="diversity" />). This was a bit disappointing. That said, our ablations train on only 20B tokens, so diversity benefits may emerge at larger scales where the model can better exploit the varied signal.
 <Sidenote>
 Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
 **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
 </Note>
+We've covered prompts, models, and datasets. One last fun question: how sensitive is all of this to tiny details in the prompt itself?
 ### Do Typos in the Prompt Hurt?
+While implementing the REWIRE prompt, we noticed it contained several typos and grammatical errors. So we cleaned it up and ran both versions. The result? Typos don't hurt at all. For the 1B model, the typo-laden [original](#guided_rewrite_original) actually performs slightly better than the [improved version](#guided_rewrite_improved) (see <FigRef target="typos-effect" />). So much for prompt polish.
 <HtmlEmbed
   id="typos-effect"
 ### Takeaways
+Let's step back and summarize what we learned:
 - **Q: How do existing datasets compare?**<br/>
   A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
 - **Q: Do typos in the prompt hurt?**<br/>
   A: No. Typos have no negative effect on downstream performance.
+So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving: a 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
+Now let's look more closely at *why* these things work the way they do.

app/src/content/chapters/4-analyses.mdx CHANGED Viewed

@@ -8,15 +8,15 @@ import ReadingTime from "../../components/ReadingTime.astro";
 <ReadingTime words={1433} visuals={6} />
-The experiments above tell us *what* works. Now we zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
 ### Is More Compute Worth It?
-GPU time across our 90 experiments varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
-**The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time decreasing performance.
-**For practitioners, the message is clear: invest in prompt design, not model size.** A well-chosen prompt on a 1B model will outperform a generic prompt on a 27B model at a tiny fraction of the cost. The only scenario where larger models might be justified is for complex prompts (like Guided Rewrite) that require more capable instruction following, but even there the gains are marginal.
 <Wide>
 <HtmlEmbed
@@ -27,11 +27,11 @@ GPU time across our 90 experiments varies by two orders of magnitude: the cheape
 />
 </Wide>
-The cheapest configurations still take over a week of GPU time, and we only know which ones work *after* rephrasing 10B tokens and then training the model. A cheap proxy metric that predicts downstream performance would let us fail fast and iterate on prompts without running the full pipeline each time. Can existing quality scores fill that role?
 ### Can Quality Scores Predict Performance?
-The FineWeb-Edu-score and DCLM-score are effective quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and skip the train-then-evaluate loop entirely. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
 [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
@@ -43,7 +43,7 @@ The FineWeb-Edu-score and DCLM-score are effective quality filters for human-wri
 **The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
 */}
-**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **For synthetic data, there is no shortcut: you have to train models and evaluate them.**
 {/*
 Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
@@ -75,7 +75,7 @@ The correlation matrix tells us that quality scores are weak predictors, but not
 />
 </Wide>
-If quality scores designed for filtering web data can't predict synthetic data performance, maybe looking at the outputs more directly can. Does the verbosity of the rephrasing model predict downstream performance?
 ### Do Chatty Models Make Better Data?
@@ -103,11 +103,11 @@ But does this variation actually affect downstream performance? Our prompts prod
 />
 </Wide>
-So output length doesn't predict quality. But output *diversity* might. We found a surprising case where a model that follows instructions poorly actually produces better training data.
 ### Math Rephrasing: When "Worse" Outputs Win
-We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
 **Qwen3 produced beautiful, structured outputs:**
@@ -156,7 +156,7 @@ SmolLM2's quality distribution was actually reasonable:
 | Partial | 30+ tokens but missing structure | 25% |
 | Poor | {'<'}30 tokens | 8% |
-For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
 <Note title="Summary: Analyses" variant="info">
 **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
@@ -164,3 +164,5 @@ For pretraining data, diversity beats consistency. Models that don't follow inst
 **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
 **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
 </Note>

 <ReadingTime words={1433} visuals={6} />
+The experiments tell us *what* works. Now let's zoom out and ask *why*. We look at the cost of running these experiments, whether cheap proxy metrics can replace expensive training runs, what the rephrased outputs actually look like, and why a messier model sometimes wins.
 ### Is More Compute Worth It?
+Running 90 experiments is not cheap. GPU time varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. <FigRef target="cost-efficiency" /> plots each experiment's downstream performance against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations.
+**The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time *decreasing* performance.
+**The message is clear: invest in prompt design, not model size.** A well-chosen prompt on a 1B model will outperform a generic prompt on a 27B model at a tiny fraction of the cost. The only scenario where larger models might be justified is for complex prompts (like Guided Rewrite) that require more capable instruction following, but even there the gains are marginal.
 <Wide>
 <HtmlEmbed
 />
 </Wide>
+Even the cheapest configurations still take over a week of GPU time, and we only know which ones work *after* rephrasing 10B tokens and then training a model. Wouldn't it be nice if we could just score the rephrased outputs directly and skip the expensive train-then-evaluate loop?
 ### Can Quality Scores Predict Performance?
+FineWeb-Edu-score and DCLM-score are great quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and iterate on prompts without running the full pipeline each time. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] <FigRef target="score-correlation" /> shows the full correlation matrix.
 [^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
 **The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (ρ = 0.60) and PIQA (ρ = 0.58), while being *negatively* correlated with math (ρ = −0.39) and reading comprehension (ρ = −0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (ρ = 0.65 for HellaSwag, ρ = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (ρ = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
 */}
+**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (ρ ≈ 0.56–0.61) are still only moderate, explaining roughly 30% of the variance at best. **The bottom line: for synthetic data, there is no shortcut. You have to train models and evaluate them.**
 {/*
 Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
 />
 </Wide>
+So quality scores designed for filtering web data don't transfer to synthetic data. Maybe looking at the outputs more directly helps. For instance, does the length of the rephrased output tell us anything?
 ### Do Chatty Models Make Better Data?
 />
 </Wide>
+So output length doesn't predict quality either. But we stumbled onto something more interesting while looking at output *diversity*: a case where a model that follows instructions poorly actually produces better training data.
 ### Math Rephrasing: When "Worse" Outputs Win
+This was one of our most surprising findings. We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
 **Qwen3 produced beautiful, structured outputs:**
 | Partial | 30+ tokens but missing structure | 25% |
 | Poor | {'<'}30 tokens | 8% |
+The lesson: for pretraining data, diversity beats consistency. A model that doesn't follow instructions perfectly can actually produce better training data than one that does. This also helps explain why SmolLM2 dominates the model family comparison: it produces more varied outputs, which may matter more than precise instruction following.
 <Note title="Summary: Analyses" variant="info">
 **Cost**: Small models with simple prompts dominate the Pareto frontier. Invest in prompt design, not model size.<br/>
 **Verbosity**: Output length has no meaningful relationship with performance. What matters is content, not compression ratio.<br/>
 **Diversity**: Template collapse hurts more than noisy outputs. A messier model that produces varied text can outperform a polished one that repeats the same template.
 </Note>
+With the experiments and analyses behind us, let's talk about the infrastructure that made all of this possible.

app/src/content/chapters/5-infrastructure.mdx CHANGED Viewed

@@ -9,13 +9,13 @@ import ReadingTime from "../../components/ReadingTime.astro";
 <ReadingTime words={4780} visuals={9} />
-Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment can take weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is months of GPU time.
-Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the bottleneck isn't the generation itself but the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
 We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
-<FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's dive in!
 <HtmlEmbed
   id="datatrove-pipeline"
@@ -202,9 +202,9 @@ Need multiple samples per document? Set `rollouts_per_document` in your `Inferen
 ### Throughput Benchmarking
-For synthetic data generation, we may run language model inference for millions of GPU hours. Finding a configuration that maximizes throughput is critical: it can accelerate generation by days and save thousands of dollars. In this section, we describe our experiments to identify optimal parameters for a selection of popular models.
-The entire benchmarking code (experiment launcher, analysis scripts, and sample configs) is available as a [DataTrove inference benchmark example](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark).
 #### Benchmarking setup
@@ -260,7 +260,7 @@ Failure modes are automatically classified:
 - **timeout**: SLURM time limit exceeded (configuration too slow)
 - **server_fail**: vLLM server failed to start (e.g., engine core initialization failure, insufficient GPU memory for the model at the given tp)
-#### Scale of the experiment
 The benchmark config defines **801 unique configurations** across 8 experiment groups (18 models with ~23 configurations each via the tiered approach):
@@ -287,7 +287,7 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
 #### What these numbers mean in practice
-Let's make this concrete. Each of our ablation experiments rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token experiment takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**, a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 experiments, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
 These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
@@ -314,7 +314,7 @@ We also experimented with non-standard block sizes (not 16), fp8 kv-cache quanti
 </Sidenote>
-Before we look into the analysis in more detail, here is some background on memory-bound vs compute-bound inference and speculative decoding.
 <Accordion title="Background: Memory-bound vs compute-bound inference">
@@ -397,7 +397,7 @@ The fundamental insight is that **optimization gains depend on identifying the b
 #### Scaling to larger models
-The benchmarks above focus on maximizing tokens per second per GPU, which is exactly what you want when generating trillions of tokens for pretraining data. But for post-training, the picture looks different: you probably want bigger models to generate data for hard problems (reasoning, math, code), and you care less about the total number of tokens generated. Quality per token matters more than volume.
 For these use cases, DataTrove scales to models with hundreds of billions (or even a trillion) parameters via multi-node Slurm execution. Here's an example running [Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) [@kimik2] (1T total parameters, 32B active) on the [s1K dataset](https://huggingface.co/datasets/simplescaling/s1K-1.1) [@s1k] to generate solutions to math and reasoning problems:
@@ -432,7 +432,9 @@ Further improvement ideas:
 - Clean it up a bit to make it less cluttered
 */}
-To get an intuition for what these throughput numbers feel like, <FigRef target="inference-throughput" /> lets you pick a model and scale up the number of GPUs. Each page represents roughly 500 tokens of generated text. At high enough throughput, pages roll up into books (250 pages each), and books into bookshelves (250 books each).
 <Wide>
 <HtmlEmbed

 <ReadingTime words={4780} visuals={9} />
+Each of our 90 experiments requires rephrasing around 10 billion tokens of web text. Even with KV caching, every output token still needs its own forward pass, and every web document has a few thousand tokens. With the wrong serving configuration, a single experiment takes weeks instead of days. Multiply that by 90 and the difference between a good and bad setup is literally months of GPU time.
+Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vllm) [@vllm] and [SGLang](https://github.com/sgl-project/sglang) [@sglang], the raw generation speed is no longer the bottleneck. The hard part is the *infrastructure* around it: orchestrating thousands of prompts, keeping GPUs saturated, checkpointing outputs, and pushing everything to storage without losing progress when a worker crashes.
 We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
+<FigRef target="datatrove-pipeline" /> gives an overview of the pipeline. Let's walk through it.
 <HtmlEmbed
   id="datatrove-pipeline"
 ### Throughput Benchmarking
+With the pipeline in place, we turned to a question that can save (or waste) enormous amounts of money: how do you squeeze the most tokens per second out of each model? At the scale we're operating, even a 20% throughput improvement saves days of GPU time per experiment.
+We ran a systematic benchmarking sweep across 18 models and open-sourced the entire setup (experiment launcher, analysis scripts, and sample configs) as a [DataTrove inference benchmark example](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark).
 #### Benchmarking setup
 - **timeout**: SLURM time limit exceeded (configuration too slow)
 - **server_fail**: vLLM server failed to start (e.g., engine core initialization failure, insufficient GPU memory for the model at the given tp)
+#### Scale of the sweep
 The benchmark config defines **801 unique configurations** across 8 experiment groups (18 models with ~23 configurations each via the tiered approach):
 #### What these numbers mean in practice
+Let's make this concrete with some back-of-the-envelope math. Each of our ablation experiments rephrases roughly 10 billion tokens. Consider [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), a strong MoE model that balances quality and throughput well. With the baseline vLLM configuration (tp=1, 3,138 tps/gpu), a single 10B-token experiment takes **885 GPU-hours** and costs roughly **2,656 USD** at 3 USD/H100-hour. With the optimized configuration (tp=2, 6,117 tps/gpu), it drops to **454 GPU-hours** and **1,362 USD**. That's a saving of **431 GPU-hours and ~1,300 USD** (49%) from nothing more than picking the right serving parameters. Over 90 experiments, that difference adds up to tens of thousands of GPU-hours and well over 100,000 USD.
 These per-GPU numbers also answer a natural question: how many GPUs does it take to generate **a billion tokens per hour**? With the optimized configurations from our sweep:
 </Sidenote>
+To understand why some models benefit more than others, let's briefly review the concepts of memory-bound vs compute-bound inference and speculative decoding.
 <Accordion title="Background: Memory-bound vs compute-bound inference">
 #### Scaling to larger models
+Everything above focuses on maximizing tokens per second per GPU, which is exactly what you want when generating trillions of tokens for pretraining data. But for post-training, the picture is different: you probably want bigger models to generate data for hard problems (reasoning, math, code), and you care less about total volume. Quality per token matters more than throughput.
 For these use cases, DataTrove scales to models with hundreds of billions (or even a trillion) parameters via multi-node Slurm execution. Here's an example running [Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) [@kimik2] (1T total parameters, 32B active) on the [s1K dataset](https://huggingface.co/datasets/simplescaling/s1K-1.1) [@s1k] to generate solutions to math and reasoning problems:
 - Clean it up a bit to make it less cluttered
 */}
+To get a feel for what these throughput numbers actually mean, <FigRef target="inference-throughput" /> lets you pick a model and scale up the number of GPUs. Each page represents roughly 500 tokens of generated text. At high enough throughput, pages roll up into books (250 pages each), and books into bookshelves (250 books each).
+With all these infrastructure pieces in place, we have everything we need to build FinePhrase: the right prompts, the right model, and the machinery to run it all at scale.
 <Wide>
 <HtmlEmbed

app/src/content/chapters/6-finephrase.mdx CHANGED Viewed

@@ -11,15 +11,15 @@ import finephraseProgressImg from "../assets/image/finephrase-progress.png";
 <ReadingTime words={1693} visuals={10} />
-We ran 90 experiments to figure out what works. Now we apply those findings to build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases all XXX million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing XXX billion tokens of synthetic pretraining data.
-The recipe is simple: take the best model (SmolLM2-1.7B-Instruct), the best prompts (FAQ, Math, Table, Tutorial), the optimized inference settings from our throughput benchmarks, and the battle-tested DataTrove infrastructure. Launch 100 parallel Slurm workers, each running on a single H100 GPU with suffix-32 speculative decoding. Let it run for about two weeks.
 To get a sense of the scale: our infrastructure benchmarks showed that SmolLM2-1.7B-Instruct achieves ~9,200 tokens per second per GPU with suffix-32 speculative decoding. With 100 GPUs running in parallel, that is ~920,000 tokens per second, or about 3.3 billion tokens per hour. Rephrasing ~339 million documents four times (once per prompt) at an average of ~XXX tokens per document means roughly XXX trillion tokens of total generation. At our throughput rate, that takes approximately XXX GPU-days, or about XXX wall-clock days with 100 GPUs.
 ### The Recipe
-Every configuration choice traces back to a finding from our experiments or infrastructure benchmarks:
 - **Model**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), which dominated all other model families across every prompt in our [model family comparison](#does-the-model-family-matter)
 - **Prompts**: [FAQ](#faq), [Math](#math), [Table](#table), and [Tutorial](#tutorial), the four prompts that [consistently beat DCLM](#can-new-prompts-beat-dclm) in our experiments
@@ -106,7 +106,7 @@ datacard_pipeline = [
 ### Improvements to DataTrove
-Building FinePhrase was not just about running inference at scale. It required hardening DataTrove's inference pipeline to handle the realities of processing 339 million documents across 100 parallel workers over two weeks. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
 #### Graceful error handling for bad documents
@@ -126,7 +126,7 @@ The first version of `skip_bad_requests` had a subtle problem: skipped documents
 #### Hardening Hub uploads against transient failures
-With 100 workers writing to the same Hugging Face Hub repository, transient failures are not rare, they are guaranteed. We encountered three distinct failure modes and fixed each one:
 - **Commit races** ([PR #448](https://github.com/huggingface/datatrove/pull/448)): Two workers commit simultaneously and one gets `412 Precondition Failed` with "A commit has happened since." The fix adds retry logic with exponential backoff to the `DiskWriter`, which all Hub-writing paths go through.
 - **Transient server errors** ([PR #463](https://github.com/huggingface/datatrove/pull/463)): `503 Service Unavailable` and other transient API errors were not retried consistently. This PR normalizes retry logic across `DiskWriter` and `HuggingFaceDatasetWriter` so all transient errors are handled uniformly.

 <ReadingTime words={1693} visuals={10} />
+With the experiments done and the infrastructure battle-tested, it's time to put everything together. We take our findings and build [FinePhrase](https://huggingface.co/datasets/HuggingFaceFW/finephrase), a large-scale synthetic dataset that rephrases 340 million documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-350BT) into four structured formats, producing XXX billion tokens of synthetic pretraining data.
+The recipe writes itself from the experiments: take the best model (SmolLM2-1.7B-Instruct), the best prompts (FAQ, Math, Table, Tutorial), the optimized inference settings from our throughput benchmarks, and the DataTrove infrastructure. Launch 100 parallel Slurm workers, each running on a single H100 GPU with suffix-32 speculative decoding. Let it run for about two weeks on spare compute on our cluster.
 To get a sense of the scale: our infrastructure benchmarks showed that SmolLM2-1.7B-Instruct achieves ~9,200 tokens per second per GPU with suffix-32 speculative decoding. With 100 GPUs running in parallel, that is ~920,000 tokens per second, or about 3.3 billion tokens per hour. Rephrasing ~339 million documents four times (once per prompt) at an average of ~XXX tokens per document means roughly XXX trillion tokens of total generation. At our throughput rate, that takes approximately XXX GPU-days, or about XXX wall-clock days with 100 GPUs.
 ### The Recipe
+Every configuration choice traces directly back to a finding from our experiments or infrastructure benchmarks:
 - **Model**: [SmolLM2-1.7B-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct), which dominated all other model families across every prompt in our [model family comparison](#does-the-model-family-matter)
 - **Prompts**: [FAQ](#faq), [Math](#math), [Table](#table), and [Tutorial](#tutorial), the four prompts that [consistently beat DCLM](#can-new-prompts-beat-dclm) in our experiments
 ### Improvements to DataTrove
+Building FinePhrase wasn't just about running inference at scale. Processing 339 million documents across 100 parallel workers for two weeks stress-tests infrastructure in ways that small experiments never do. Every failure mode you can imagine showed up: documents that crash the model, workers racing to commit to the same repo, Slurm jobs dying on startup, and caches corrupting under contention. We merged over a dozen PRs to make this work. Here are the most impactful ones.
 #### Graceful error handling for bad documents
 #### Hardening Hub uploads against transient failures
+With 100 workers writing to the same Hugging Face Hub repository, transient failures aren't rare, they're guaranteed. We encountered three distinct failure modes and fixed each one:
 - **Commit races** ([PR #448](https://github.com/huggingface/datatrove/pull/448)): Two workers commit simultaneously and one gets `412 Precondition Failed` with "A commit has happened since." The fix adds retry logic with exponential backoff to the `DiskWriter`, which all Hub-writing paths go through.
 - **Transient server errors** ([PR #463](https://github.com/huggingface/datatrove/pull/463)): `503 Service Unavailable` and other transient API errors were not retried consistently. This PR normalizes retry logic across `DiskWriter` and `HuggingFaceDatasetWriter` so all transient errors are handled uniformly.

app/src/content/chapters/7-conclusions.mdx CHANGED Viewed

@@ -4,13 +4,13 @@ import ReadingTime from "../../components/ReadingTime.astro";
 <ReadingTime words={624} visuals={0} />
-We ran 90 experiments, generated over 1.1 trillion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
-### Next Steps
-The main bottleneck to scaling synthetic data experiments for pretraining is the compute cost of generating the data itself. For reference, producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2–10x inference speedups over autoregressive approaches. Recent models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. [Mercury 2](https://www.inceptionlabs.ai/blog/introducing-mercury-2) [@mercury2] pushes this further, reaching over 1,000 tokens per second on NVIDIA Blackwell GPUs through parallel refinement rather than sequential decoding, with 5x+ speedups over autoregressive baselines. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
-While we answered several questions about best practices for synthetic data generation in this work, many remain open:
 - **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
 - **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
@@ -21,4 +21,6 @@ While we answered several questions about best practices for synthetic data gene
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
 - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
-- **Larger ablations and mixtures**: We want to run more extensive mixture experiments, exploring how synthetic data interacts with source data at scale, in line with the recent [smol-data](https://huggingface.co/spaces/HuggingFaceTB/smol-data) effort.

 <ReadingTime words={624} visuals={0} />
+We ran 90 experiments, generated over 1 trillion tokens, and spent more than 111,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there: a 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. Template diversity matters more than template polish, and a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. There is no reliable proxy metric that can replace training and evaluating a model, so there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so you can build on these findings without reinventing the plumbing.
+### What's Next?
+The biggest bottleneck to scaling synthetic data experiments is the compute cost of generation itself. Producing the 10B tokens with `Gemma-3-1B-IT` needed for a single ablation takes roughly 3,800 H100 GPU hours. Several avenues could bring this cost down significantly. **Diffusion language models** are promising: their parallel generation capabilities yield reported 2-10x inference speedups over autoregressive approaches. Models like [LLaDA2.1-flash](https://huggingface.co/inclusionAI/LLaDA2.1-flash) show that diffusion LMs can match autoregressive models on standard benchmarks while generating tokens in parallel, and SGLang already supports serving them, but broader ecosystem support (e.g., vLLM) is still missing. DFlash [@dflash] could further speed up generation, though it is currently cumbersome to use and has limited model support. [Mercury 2](https://www.inceptionlabs.ai/blog/introducing-mercury-2) [@mercury2] pushes this further, reaching over 1,000 tokens per second on NVIDIA Blackwell GPUs through parallel refinement rather than sequential decoding, with 5x+ speedups over autoregressive baselines. On the autoregressive side, speculative decoding support in vLLM remains limited (e.g., draft models are not well supported), leaving significant inference speedups on the table.
+Beyond faster generation, we answered several questions about best practices but many remain wide open:
 - **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
 - **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
 - **Automatic prompt optimization**: Does prompt optimization with tools like DSPy [@dspy] improve rephrasing performance?
 - **Longer pretraining**: Our ablations trained for 21B tokens. Do the same findings hold at 100B+ token scales, and do prompt rankings shift with longer training?
 - **Source filtering**: Should we filter documents before or after rephrasing? For instance, applying a math prompt to non-mathematical documents likely wastes compute and adds noise.
+- **Larger ablations and mixtures**: We want to run more extensive mixture experiments, exploring how synthetic data interacts with source data at scale, in line with the recent [smol-data](https://huggingface.co/spaces/HuggingFaceTB/smol-data) effort.
+The playbook is open. Build on it.