finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 9

Commit

7aaccb8

1 Parent(s): d1d52e1

move transition sentences after the charts

Browse files

Files changed (1) hide show

app/src/content/chapters/experiments.mdx +29 -27

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -26,8 +26,6 @@ We train on eight datasets under identical conditions and compare their final ev
 DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see [Baseline Comparison](#baselines-comparison)). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. <mark>TLDR: DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.</mark>
-The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
 <HtmlEmbed
   id="baselines-comparison"
   src="d3-benchmark-comparison.html"
@@ -48,6 +46,8 @@ The synthetic baselines use different prompts internally. Which individual promp
   }}
 />
 #### Dissecting the Synthetic Baselines
 Prior synthetic datasets bundle multiple prompts together. We want to understand what makes them tick.
@@ -62,8 +62,6 @@ We don't have access to the final BeyondWeb dataset, so we reimplemented their [
 Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see [Dissecting Synthetic Baselines](#dissecting-baselines)). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. <mark>TLDR: Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.</mark>
-Can we design prompts that consistently beat DCLM?
 <HtmlEmbed
   id="dissecting-baselines"
   src="d3-benchmark-comparison.html"
@@ -87,6 +85,8 @@ Can we design prompts that consistently beat DCLM?
   }}
 />
 ### Which New Prompts Work Well?
 Since most existing prompts fail to beat DCLM, we designed new prompt formats targeting different skills. <mark>Can any of them outperform the baseline?</mark>
@@ -95,8 +95,6 @@ We test seven novel prompts ([math](#math), [table](#table), [faq](#faq), [tutor
 Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) fall short (see [New Prompt Performance](#new-prompts)). The best-performing prompts all restructure the source content into pedagogically rich formats. <mark>TLDR: Math, table, FAQ, and tutorial prompts beat the DCLM baseline, while article, commentary, and discussion are at or below DCLM level.</mark>
-We used Gemma-3-1B for all experiments so far. Can we do even better by changing the rephrasing model?
 <HtmlEmbed
   id="new-prompts"
   src="d3-benchmark-comparison.html"
@@ -117,6 +115,8 @@ We used Gemma-3-1B for all experiments so far. Can we do even better by changing
   }}
 />
 ### Impact of the Rephrasing Model
 We want to know whether using a stronger model leads to better synthetic data. We look at this dimension from three angles: model size, model family, and model generation.
@@ -129,8 +129,6 @@ We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutoria
 The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see [Model Size](#model-size)). Even for the harder [math](#math) prompt, larger models do not help. <mark>TLDR: Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.</mark>
-On high-quality source data, we see no evidence that larger models help. But REWIRE claims large models are needed specifically for low-quality data. Does that claim hold?
 <HtmlEmbed
   id="model-size"
   src="d3-benchmark-comparison.html"
@@ -164,6 +162,8 @@ On high-quality source data, we see no evidence that larger models help. But REW
   }}
 />
 #### Do we need better models for rephrasing low-quality data?
 The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
@@ -172,8 +172,6 @@ We compare 1B vs 12B models on HQ vs LQ source data across three prompts ([conti
 The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
-Since model size barely matters, does the model family make a difference?
 <HtmlEmbed
   id="size-quality"
   src="d3-benchmark-comparison.html"
@@ -215,6 +213,8 @@ Since model size barely matters, does the model family make a difference?
   }}
 />
 #### Does the model family matter?
 Different model families may be better suited for rephrasing based on their training data. <mark>Do some families produce better synthetic data than others?</mark>
@@ -227,8 +227,6 @@ SmolLM2 consistently and clearly outperforms all others across all four prompts
 We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit rewrite tasks in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
 </Sidenote>
-SmolLM2 is already a year old. Are newer model generations better?
 <HtmlEmbed
   id="model-family"
   src="d3-benchmark-comparison.html"
@@ -288,6 +286,8 @@ SmolLM2 is already a year old. Are newer model generations better?
   }}
 />
 #### Does the model generation matter?
 We've seen that model family matters. But within a family, <mark>do newer versions produce better synthetic data?</mark>
@@ -295,6 +295,7 @@ We've seen that model family matters. But within a family, <mark>do newer versio
 We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt.
 While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see [Model Generation](#model-generation)). <mark>TLDR: Newer model generations tend to produce slightly better synthetic data.</mark>
 <HtmlEmbed
   id="model-generation"
   src="d3-benchmark-comparison.html"
@@ -311,6 +312,7 @@ While the differences are small, we find a consistent trend: newer versions lead
     }
   }}
 />
 <Note title="Summary: Impact of the Rephrasing Model" variant="info">
 **Model size**: 1B is sufficient. Larger models do not help.
 **Model family**: SmolLM2 dominates across all prompts.
@@ -332,8 +334,6 @@ We compare synthetic-only training vs mixed training (synthetic + source) for [t
 Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see [Is Synthetic Data Enough?](#synthetic-only)). Mixed training consistently improves over both the synthetic-only and original-data-only baselines. <mark>TLDR: Synthetic data alone is not enough. Mixing with original data consistently improves performance.</mark>
-So the mix-in dataset clearly matters. But how much does the specific choice of mix-in dataset affect performance?
 <HtmlEmbed
   id="synthetic-only"
   src="d3-benchmark-comparison.html"
@@ -365,15 +365,15 @@ So the mix-in dataset clearly matters. But how much does the specific choice of
   }}
 />
 #### Does the mix-in dataset matter?
 We just saw that mixing in original data is essential. <mark>How much does the choice of mix-in dataset affect performance?</mark>
 We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data.
-The differences are huge. DCLM and FineWeb-Edu-HQ vastly outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see [Mix-in Dataset Effect](#mixin-dataset)). <mark>TLDR: The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.</mark>
-The mix-in dataset matters enormously. But what about the source dataset we feed to the rephrasing model?
 <HtmlEmbed
   id="mixin-dataset"
@@ -411,6 +411,8 @@ The mix-in dataset matters enormously. But what about the source dataset we feed
   }}
 />
 #### Does the source dataset matter?
 We know the mix-in dataset is critical. <mark>But does the quality of the source documents we feed to the rephrasing model also matter?</mark>
@@ -419,8 +421,6 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
 When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see [Source Dataset: Mix-in = Source](#source-dataset-mixin-source)). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see [Source Dataset: Fixed Mix-in](#source-dataset-fixed-mixin)). This corroborates our finding that the mix-in matters much more than the source. <mark>TLDR: Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.</mark>
-We've seen that mixing matters and the mix-in dataset is key. Can we squeeze out more performance by increasing diversity in the synthetic portion?
 <HtmlEmbed
   id="source-dataset-mixin-source"
   src="d3-benchmark-comparison.html"
@@ -483,6 +483,8 @@ We've seen that mixing matters and the mix-in dataset is key. Can we squeeze out
   }}
 />
 #### Does increased diversity help?
 Given that mixing matters, a natural next step is to maximize diversity in the synthetic portion. <mark>Does combining multiple prompts or model families increase performance?</mark>
@@ -497,8 +499,6 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
 <mark>TLDR: At our 20B token scale, diversity does not compound. Mixing datasets averages rather than improves performance, though larger-scale experiments may tell a different story.</mark>
-Let's turn to some unexpected findings from our experiments.
 <HtmlEmbed
   id="diversity"
   src="d3-benchmark-comparison.html"
@@ -509,11 +509,11 @@ Let's turn to some unexpected findings from our experiments.
       "Mixing Prompts": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_1b_hq-fw_edu_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts + FineWeb-Edu-HQ",
-          "mix-fw_edu_hq-math_1b_hq": "Math",
           "mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts (No Source)",
-          "mix-fw_edu_hq-table_1b_hq": "Table",
-          "mix-fw_edu_hq-faq_1b_hq": "FAQ",
-          "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
           dclm: "DCLM",
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
@@ -545,6 +545,8 @@ Let's turn to some unexpected findings from our experiments.
   }}
 />
 ### Do Typos in the Prompt Hurt?
 The original REWIRE prompt contains many typos and grammar errors. <mark>Do these imperfections degrade the quality of the synthetic data?</mark>
@@ -553,8 +555,6 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
 Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see [Effect of Typos](#typos-effect)). <mark>TLDR: Typos in prompts do not hurt downstream performance.</mark>
-Our final experiment explores an even more counterintuitive finding.
 <HtmlEmbed
   id="typos-effect"
   src="d3-benchmark-comparison.html"
@@ -572,6 +572,8 @@ Our final experiment explores an even more counterintuitive finding.
   }}
 />
 {/*
 ### Does edu-score or DCLM-score predict model performance?

 DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see [Baseline Comparison](#baselines-comparison)). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. <mark>TLDR: DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.</mark>
 <HtmlEmbed
   id="baselines-comparison"
   src="d3-benchmark-comparison.html"
   }}
 />
+The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
 #### Dissecting the Synthetic Baselines
 Prior synthetic datasets bundle multiple prompts together. We want to understand what makes them tick.
 Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see [Dissecting Synthetic Baselines](#dissecting-baselines)). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. <mark>TLDR: Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.</mark>
 <HtmlEmbed
   id="dissecting-baselines"
   src="d3-benchmark-comparison.html"
   }}
 />
+Can we design prompts that consistently beat DCLM?
 ### Which New Prompts Work Well?
 Since most existing prompts fail to beat DCLM, we designed new prompt formats targeting different skills. <mark>Can any of them outperform the baseline?</mark>
 Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) fall short (see [New Prompt Performance](#new-prompts)). The best-performing prompts all restructure the source content into pedagogically rich formats. <mark>TLDR: Math, table, FAQ, and tutorial prompts beat the DCLM baseline, while article, commentary, and discussion are at or below DCLM level.</mark>
 <HtmlEmbed
   id="new-prompts"
   src="d3-benchmark-comparison.html"
   }}
 />
+We used Gemma-3-1B for all experiments so far. Can we do even better by changing the rephrasing model?
 ### Impact of the Rephrasing Model
 We want to know whether using a stronger model leads to better synthetic data. We look at this dimension from three angles: model size, model family, and model generation.
 The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see [Model Size](#model-size)). Even for the harder [math](#math) prompt, larger models do not help. <mark>TLDR: Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.</mark>
 <HtmlEmbed
   id="model-size"
   src="d3-benchmark-comparison.html"
   }}
 />
+On high-quality source data, we see no evidence that larger models help. But REWIRE claims large models are needed specifically for low-quality data. Does that claim hold?
 #### Do we need better models for rephrasing low-quality data?
 The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
 The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
 <HtmlEmbed
   id="size-quality"
   src="d3-benchmark-comparison.html"
   }}
 />
+Since model size barely matters, does the model family make a difference?
 #### Does the model family matter?
 Different model families may be better suited for rephrasing based on their training data. <mark>Do some families produce better synthetic data than others?</mark>
 We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit rewrite tasks in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
 </Sidenote>
 <HtmlEmbed
   id="model-family"
   src="d3-benchmark-comparison.html"
   }}
 />
+SmolLM2 is already a year old. Are newer model generations better?
 #### Does the model generation matter?
 We've seen that model family matters. But within a family, <mark>do newer versions produce better synthetic data?</mark>
 We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt.
 While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see [Model Generation](#model-generation)). <mark>TLDR: Newer model generations tend to produce slightly better synthetic data.</mark>
 <HtmlEmbed
   id="model-generation"
   src="d3-benchmark-comparison.html"
     }
   }}
 />
 <Note title="Summary: Impact of the Rephrasing Model" variant="info">
 **Model size**: 1B is sufficient. Larger models do not help.
 **Model family**: SmolLM2 dominates across all prompts.
 Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see [Is Synthetic Data Enough?](#synthetic-only)). Mixed training consistently improves over both the synthetic-only and original-data-only baselines. <mark>TLDR: Synthetic data alone is not enough. Mixing with original data consistently improves performance.</mark>
 <HtmlEmbed
   id="synthetic-only"
   src="d3-benchmark-comparison.html"
   }}
 />
+So the mix-in dataset clearly matters. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
 We just saw that mixing in original data is essential. <mark>How much does the choice of mix-in dataset affect performance?</mark>
 We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data.
+DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see [Mix-in Dataset Effect](#mixin-dataset)). <mark>TLDR: The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.</mark>
 <HtmlEmbed
   id="mixin-dataset"
   }}
 />
+The mix-in dataset matters enormously. But what about the source dataset we feed to the rephrasing model?
 #### Does the source dataset matter?
 We know the mix-in dataset is critical. <mark>But does the quality of the source documents we feed to the rephrasing model also matter?</mark>
 When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see [Source Dataset: Mix-in = Source](#source-dataset-mixin-source)). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see [Source Dataset: Fixed Mix-in](#source-dataset-fixed-mixin)). This corroborates our finding that the mix-in matters much more than the source. <mark>TLDR: Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.</mark>
 <HtmlEmbed
   id="source-dataset-mixin-source"
   src="d3-benchmark-comparison.html"
   }}
 />
+This is exciting because it shows the potential of upcycling low-quality data through rephrasing with format prompts. Can we squeeze out more performance by increasing diversity in the synthetic portion?
 #### Does increased diversity help?
 Given that mixing matters, a natural next step is to maximize diversity in the synthetic portion. <mark>Does combining multiple prompts or model families increase performance?</mark>
 <mark>TLDR: At our 20B token scale, diversity does not compound. Mixing datasets averages rather than improves performance, though larger-scale experiments may tell a different story.</mark>
 <HtmlEmbed
   id="diversity"
   src="d3-benchmark-comparison.html"
       "Mixing Prompts": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_1b_hq-fw_edu_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts + FineWeb-Edu-HQ",
+          "mix-fw_edu_hq-math_1b_hq": "Math + FineWeb-Edu-HQ",
           "mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts (No Source)",
+          "mix-fw_edu_hq-table_1b_hq": "Table + FineWeb-Edu-HQ",
+          "mix-fw_edu_hq-faq_1b_hq": "FAQ + FineWeb-Edu-HQ",
+          "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial + FineWeb-Edu-HQ",
           dclm: "DCLM",
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
   }}
 />
+Let's turn to some unexpected findings from our experiments.
 ### Do Typos in the Prompt Hurt?
 The original REWIRE prompt contains many typos and grammar errors. <mark>Do these imperfections degrade the quality of the synthetic data?</mark>
 Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see [Effect of Typos](#typos-effect)). <mark>TLDR: Typos in prompts do not hurt downstream performance.</mark>
 <HtmlEmbed
   id="typos-effect"
   src="d3-benchmark-comparison.html"
   }}
 />
+Our final experiment explores an even more counterintuitive finding.
 {/*
 ### Does edu-score or DCLM-score predict model performance?