joelniklaus HF Staff commited on
Commit
7aaccb8
·
1 Parent(s): d1d52e1

move transition sentences after the charts

Browse files
app/src/content/chapters/experiments.mdx CHANGED
@@ -26,8 +26,6 @@ We train on eight datasets under identical conditions and compare their final ev
26
 
27
  DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see [Baseline Comparison](#baselines-comparison)). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. <mark>TLDR: DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.</mark>
28
 
29
- The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
30
-
31
  <HtmlEmbed
32
  id="baselines-comparison"
33
  src="d3-benchmark-comparison.html"
@@ -48,6 +46,8 @@ The synthetic baselines use different prompts internally. Which individual promp
48
  }}
49
  />
50
 
 
 
51
  #### Dissecting the Synthetic Baselines
52
 
53
  Prior synthetic datasets bundle multiple prompts together. We want to understand what makes them tick.
@@ -62,8 +62,6 @@ We don't have access to the final BeyondWeb dataset, so we reimplemented their [
62
 
63
  Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see [Dissecting Synthetic Baselines](#dissecting-baselines)). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. <mark>TLDR: Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.</mark>
64
 
65
- Can we design prompts that consistently beat DCLM?
66
-
67
  <HtmlEmbed
68
  id="dissecting-baselines"
69
  src="d3-benchmark-comparison.html"
@@ -87,6 +85,8 @@ Can we design prompts that consistently beat DCLM?
87
  }}
88
  />
89
 
 
 
90
  ### Which New Prompts Work Well?
91
 
92
  Since most existing prompts fail to beat DCLM, we designed new prompt formats targeting different skills. <mark>Can any of them outperform the baseline?</mark>
@@ -95,8 +95,6 @@ We test seven novel prompts ([math](#math), [table](#table), [faq](#faq), [tutor
95
 
96
  Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) fall short (see [New Prompt Performance](#new-prompts)). The best-performing prompts all restructure the source content into pedagogically rich formats. <mark>TLDR: Math, table, FAQ, and tutorial prompts beat the DCLM baseline, while article, commentary, and discussion are at or below DCLM level.</mark>
97
 
98
- We used Gemma-3-1B for all experiments so far. Can we do even better by changing the rephrasing model?
99
-
100
  <HtmlEmbed
101
  id="new-prompts"
102
  src="d3-benchmark-comparison.html"
@@ -117,6 +115,8 @@ We used Gemma-3-1B for all experiments so far. Can we do even better by changing
117
  }}
118
  />
119
 
 
 
120
  ### Impact of the Rephrasing Model
121
 
122
  We want to know whether using a stronger model leads to better synthetic data. We look at this dimension from three angles: model size, model family, and model generation.
@@ -129,8 +129,6 @@ We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutoria
129
 
130
  The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see [Model Size](#model-size)). Even for the harder [math](#math) prompt, larger models do not help. <mark>TLDR: Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.</mark>
131
 
132
- On high-quality source data, we see no evidence that larger models help. But REWIRE claims large models are needed specifically for low-quality data. Does that claim hold?
133
-
134
  <HtmlEmbed
135
  id="model-size"
136
  src="d3-benchmark-comparison.html"
@@ -164,6 +162,8 @@ On high-quality source data, we see no evidence that larger models help. But REW
164
  }}
165
  />
166
 
 
 
167
  #### Do we need better models for rephrasing low-quality data?
168
 
169
  The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
@@ -172,8 +172,6 @@ We compare 1B vs 12B models on HQ vs LQ source data across three prompts ([conti
172
 
173
  The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
174
 
175
- Since model size barely matters, does the model family make a difference?
176
-
177
  <HtmlEmbed
178
  id="size-quality"
179
  src="d3-benchmark-comparison.html"
@@ -215,6 +213,8 @@ Since model size barely matters, does the model family make a difference?
215
  }}
216
  />
217
 
 
 
218
  #### Does the model family matter?
219
 
220
  Different model families may be better suited for rephrasing based on their training data. <mark>Do some families produce better synthetic data than others?</mark>
@@ -227,8 +227,6 @@ SmolLM2 consistently and clearly outperforms all others across all four prompts
227
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit rewrite tasks in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
228
  </Sidenote>
229
 
230
- SmolLM2 is already a year old. Are newer model generations better?
231
-
232
  <HtmlEmbed
233
  id="model-family"
234
  src="d3-benchmark-comparison.html"
@@ -288,6 +286,8 @@ SmolLM2 is already a year old. Are newer model generations better?
288
  }}
289
  />
290
 
 
 
291
  #### Does the model generation matter?
292
 
293
  We've seen that model family matters. But within a family, <mark>do newer versions produce better synthetic data?</mark>
@@ -295,6 +295,7 @@ We've seen that model family matters. But within a family, <mark>do newer versio
295
  We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt.
296
 
297
  While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see [Model Generation](#model-generation)). <mark>TLDR: Newer model generations tend to produce slightly better synthetic data.</mark>
 
298
  <HtmlEmbed
299
  id="model-generation"
300
  src="d3-benchmark-comparison.html"
@@ -311,6 +312,7 @@ While the differences are small, we find a consistent trend: newer versions lead
311
  }
312
  }}
313
  />
 
314
  <Note title="Summary: Impact of the Rephrasing Model" variant="info">
315
  **Model size**: 1B is sufficient. Larger models do not help.
316
  **Model family**: SmolLM2 dominates across all prompts.
@@ -332,8 +334,6 @@ We compare synthetic-only training vs mixed training (synthetic + source) for [t
332
 
333
  Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see [Is Synthetic Data Enough?](#synthetic-only)). Mixed training consistently improves over both the synthetic-only and original-data-only baselines. <mark>TLDR: Synthetic data alone is not enough. Mixing with original data consistently improves performance.</mark>
334
 
335
- So the mix-in dataset clearly matters. But how much does the specific choice of mix-in dataset affect performance?
336
-
337
  <HtmlEmbed
338
  id="synthetic-only"
339
  src="d3-benchmark-comparison.html"
@@ -365,15 +365,15 @@ So the mix-in dataset clearly matters. But how much does the specific choice of
365
  }}
366
  />
367
 
 
 
368
  #### Does the mix-in dataset matter?
369
 
370
  We just saw that mixing in original data is essential. <mark>How much does the choice of mix-in dataset affect performance?</mark>
371
 
372
  We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data.
373
 
374
- The differences are huge. DCLM and FineWeb-Edu-HQ vastly outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see [Mix-in Dataset Effect](#mixin-dataset)). <mark>TLDR: The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.</mark>
375
-
376
- The mix-in dataset matters enormously. But what about the source dataset we feed to the rephrasing model?
377
 
378
  <HtmlEmbed
379
  id="mixin-dataset"
@@ -411,6 +411,8 @@ The mix-in dataset matters enormously. But what about the source dataset we feed
411
  }}
412
  />
413
 
 
 
414
  #### Does the source dataset matter?
415
 
416
  We know the mix-in dataset is critical. <mark>But does the quality of the source documents we feed to the rephrasing model also matter?</mark>
@@ -419,8 +421,6 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
419
 
420
  When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see [Source Dataset: Mix-in = Source](#source-dataset-mixin-source)). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see [Source Dataset: Fixed Mix-in](#source-dataset-fixed-mixin)). This corroborates our finding that the mix-in matters much more than the source. <mark>TLDR: Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.</mark>
421
 
422
- We've seen that mixing matters and the mix-in dataset is key. Can we squeeze out more performance by increasing diversity in the synthetic portion?
423
-
424
  <HtmlEmbed
425
  id="source-dataset-mixin-source"
426
  src="d3-benchmark-comparison.html"
@@ -483,6 +483,8 @@ We've seen that mixing matters and the mix-in dataset is key. Can we squeeze out
483
  }}
484
  />
485
 
 
 
486
  #### Does increased diversity help?
487
 
488
  Given that mixing matters, a natural next step is to maximize diversity in the synthetic portion. <mark>Does combining multiple prompts or model families increase performance?</mark>
@@ -497,8 +499,6 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
497
 
498
  <mark>TLDR: At our 20B token scale, diversity does not compound. Mixing datasets averages rather than improves performance, though larger-scale experiments may tell a different story.</mark>
499
 
500
- Let's turn to some unexpected findings from our experiments.
501
-
502
  <HtmlEmbed
503
  id="diversity"
504
  src="d3-benchmark-comparison.html"
@@ -509,11 +509,11 @@ Let's turn to some unexpected findings from our experiments.
509
  "Mixing Prompts": {
510
  datasetNames: {
511
  "mix-fw_edu_hq-tutorial_1b_hq-fw_edu_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts + FineWeb-Edu-HQ",
512
- "mix-fw_edu_hq-math_1b_hq": "Math",
513
  "mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts (No Source)",
514
- "mix-fw_edu_hq-table_1b_hq": "Table",
515
- "mix-fw_edu_hq-faq_1b_hq": "FAQ",
516
- "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
517
  dclm: "DCLM",
518
  fw_edu_hq: "FineWeb-Edu-HQ"
519
  }
@@ -545,6 +545,8 @@ Let's turn to some unexpected findings from our experiments.
545
  }}
546
  />
547
 
 
 
548
  ### Do Typos in the Prompt Hurt?
549
 
550
  The original REWIRE prompt contains many typos and grammar errors. <mark>Do these imperfections degrade the quality of the synthetic data?</mark>
@@ -553,8 +555,6 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
553
 
554
  Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see [Effect of Typos](#typos-effect)). <mark>TLDR: Typos in prompts do not hurt downstream performance.</mark>
555
 
556
- Our final experiment explores an even more counterintuitive finding.
557
-
558
  <HtmlEmbed
559
  id="typos-effect"
560
  src="d3-benchmark-comparison.html"
@@ -572,6 +572,8 @@ Our final experiment explores an even more counterintuitive finding.
572
  }}
573
  />
574
 
 
 
575
  {/*
576
 
577
  ### Does edu-score or DCLM-score predict model performance?
 
26
 
27
  DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see [Baseline Comparison](#baselines-comparison)). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. <mark>TLDR: DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.</mark>
28
 
 
 
29
  <HtmlEmbed
30
  id="baselines-comparison"
31
  src="d3-benchmark-comparison.html"
 
46
  }}
47
  />
48
 
49
+ The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
50
+
51
  #### Dissecting the Synthetic Baselines
52
 
53
  Prior synthetic datasets bundle multiple prompts together. We want to understand what makes them tick.
 
62
 
63
  Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see [Dissecting Synthetic Baselines](#dissecting-baselines)). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. <mark>TLDR: Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.</mark>
64
 
 
 
65
  <HtmlEmbed
66
  id="dissecting-baselines"
67
  src="d3-benchmark-comparison.html"
 
85
  }}
86
  />
87
 
88
+ Can we design prompts that consistently beat DCLM?
89
+
90
  ### Which New Prompts Work Well?
91
 
92
  Since most existing prompts fail to beat DCLM, we designed new prompt formats targeting different skills. <mark>Can any of them outperform the baseline?</mark>
 
95
 
96
  Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) fall short (see [New Prompt Performance](#new-prompts)). The best-performing prompts all restructure the source content into pedagogically rich formats. <mark>TLDR: Math, table, FAQ, and tutorial prompts beat the DCLM baseline, while article, commentary, and discussion are at or below DCLM level.</mark>
97
 
 
 
98
  <HtmlEmbed
99
  id="new-prompts"
100
  src="d3-benchmark-comparison.html"
 
115
  }}
116
  />
117
 
118
+ We used Gemma-3-1B for all experiments so far. Can we do even better by changing the rephrasing model?
119
+
120
  ### Impact of the Rephrasing Model
121
 
122
  We want to know whether using a stronger model leads to better synthetic data. We look at this dimension from three angles: model size, model family, and model generation.
 
129
 
130
  The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see [Model Size](#model-size)). Even for the harder [math](#math) prompt, larger models do not help. <mark>TLDR: Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.</mark>
131
 
 
 
132
  <HtmlEmbed
133
  id="model-size"
134
  src="d3-benchmark-comparison.html"
 
162
  }}
163
  />
164
 
165
+ On high-quality source data, we see no evidence that larger models help. But REWIRE claims large models are needed specifically for low-quality data. Does that claim hold?
166
+
167
  #### Do we need better models for rephrasing low-quality data?
168
 
169
  The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
 
172
 
173
  The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
174
 
 
 
175
  <HtmlEmbed
176
  id="size-quality"
177
  src="d3-benchmark-comparison.html"
 
213
  }}
214
  />
215
 
216
+ Since model size barely matters, does the model family make a difference?
217
+
218
  #### Does the model family matter?
219
 
220
  Different model families may be better suited for rephrasing based on their training data. <mark>Do some families produce better synthetic data than others?</mark>
 
227
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit rewrite tasks in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
228
  </Sidenote>
229
 
 
 
230
  <HtmlEmbed
231
  id="model-family"
232
  src="d3-benchmark-comparison.html"
 
286
  }}
287
  />
288
 
289
+ SmolLM2 is already a year old. Are newer model generations better?
290
+
291
  #### Does the model generation matter?
292
 
293
  We've seen that model family matters. But within a family, <mark>do newer versions produce better synthetic data?</mark>
 
295
  We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt.
296
 
297
  While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see [Model Generation](#model-generation)). <mark>TLDR: Newer model generations tend to produce slightly better synthetic data.</mark>
298
+
299
  <HtmlEmbed
300
  id="model-generation"
301
  src="d3-benchmark-comparison.html"
 
312
  }
313
  }}
314
  />
315
+
316
  <Note title="Summary: Impact of the Rephrasing Model" variant="info">
317
  **Model size**: 1B is sufficient. Larger models do not help.
318
  **Model family**: SmolLM2 dominates across all prompts.
 
334
 
335
  Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see [Is Synthetic Data Enough?](#synthetic-only)). Mixed training consistently improves over both the synthetic-only and original-data-only baselines. <mark>TLDR: Synthetic data alone is not enough. Mixing with original data consistently improves performance.</mark>
336
 
 
 
337
  <HtmlEmbed
338
  id="synthetic-only"
339
  src="d3-benchmark-comparison.html"
 
365
  }}
366
  />
367
 
368
+ So the mix-in dataset clearly matters. But how much does the specific choice of mix-in dataset affect performance?
369
+
370
  #### Does the mix-in dataset matter?
371
 
372
  We just saw that mixing in original data is essential. <mark>How much does the choice of mix-in dataset affect performance?</mark>
373
 
374
  We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data.
375
 
376
+ DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see [Mix-in Dataset Effect](#mixin-dataset)). <mark>TLDR: The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.</mark>
 
 
377
 
378
  <HtmlEmbed
379
  id="mixin-dataset"
 
411
  }}
412
  />
413
 
414
+ The mix-in dataset matters enormously. But what about the source dataset we feed to the rephrasing model?
415
+
416
  #### Does the source dataset matter?
417
 
418
  We know the mix-in dataset is critical. <mark>But does the quality of the source documents we feed to the rephrasing model also matter?</mark>
 
421
 
422
  When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see [Source Dataset: Mix-in = Source](#source-dataset-mixin-source)). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see [Source Dataset: Fixed Mix-in](#source-dataset-fixed-mixin)). This corroborates our finding that the mix-in matters much more than the source. <mark>TLDR: Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.</mark>
423
 
 
 
424
  <HtmlEmbed
425
  id="source-dataset-mixin-source"
426
  src="d3-benchmark-comparison.html"
 
483
  }}
484
  />
485
 
486
+ This is exciting because it shows the potential of upcycling low-quality data through rephrasing with format prompts. Can we squeeze out more performance by increasing diversity in the synthetic portion?
487
+
488
  #### Does increased diversity help?
489
 
490
  Given that mixing matters, a natural next step is to maximize diversity in the synthetic portion. <mark>Does combining multiple prompts or model families increase performance?</mark>
 
499
 
500
  <mark>TLDR: At our 20B token scale, diversity does not compound. Mixing datasets averages rather than improves performance, though larger-scale experiments may tell a different story.</mark>
501
 
 
 
502
  <HtmlEmbed
503
  id="diversity"
504
  src="d3-benchmark-comparison.html"
 
509
  "Mixing Prompts": {
510
  datasetNames: {
511
  "mix-fw_edu_hq-tutorial_1b_hq-fw_edu_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts + FineWeb-Edu-HQ",
512
+ "mix-fw_edu_hq-math_1b_hq": "Math + FineWeb-Edu-HQ",
513
  "mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts (No Source)",
514
+ "mix-fw_edu_hq-table_1b_hq": "Table + FineWeb-Edu-HQ",
515
+ "mix-fw_edu_hq-faq_1b_hq": "FAQ + FineWeb-Edu-HQ",
516
+ "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial + FineWeb-Edu-HQ",
517
  dclm: "DCLM",
518
  fw_edu_hq: "FineWeb-Edu-HQ"
519
  }
 
545
  }}
546
  />
547
 
548
+ Let's turn to some unexpected findings from our experiments.
549
+
550
  ### Do Typos in the Prompt Hurt?
551
 
552
  The original REWIRE prompt contains many typos and grammar errors. <mark>Do these imperfections degrade the quality of the synthetic data?</mark>
 
555
 
556
  Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see [Effect of Typos](#typos-effect)). <mark>TLDR: Typos in prompts do not hurt downstream performance.</mark>
557
 
 
 
558
  <HtmlEmbed
559
  id="typos-effect"
560
  src="d3-benchmark-comparison.html"
 
572
  }}
573
  />
574
 
575
+ Our final experiment explores an even more counterintuitive finding.
576
+
577
  {/*
578
 
579
  ### Does edu-score or DCLM-score predict model performance?