joelniklaus HF Staff commited on
Commit
09c855a
·
1 Parent(s): 7adb03a

remove titles from charts, add summary note, remove highlighting and rephrase experiment paragraphs

Browse files
app/src/components/HtmlEmbed.astro CHANGED
@@ -1,7 +1,6 @@
1
  ---
2
  interface Props {
3
  src: string;
4
- title?: string;
5
  desc?: string;
6
  caption?: string;
7
  frameless?: boolean;
@@ -13,7 +12,6 @@ interface Props {
13
  }
14
  const {
15
  src,
16
- title,
17
  desc,
18
  caption,
19
  frameless = false,
@@ -69,11 +67,6 @@ const htmlWithId =
69
  {
70
  html ? (
71
  <figure class={`html-embed${wide ? " html-embed--wide" : ""}`} id={id}>
72
- {title && (
73
- <figcaption class="html-embed__title" style={`text-align:${align}`}>
74
- {title}
75
- </figcaption>
76
- )}
77
  <div class={`html-embed__card${frameless ? " is-frameless" : ""}`}>
78
  <div
79
  id={mountId}
@@ -272,20 +265,6 @@ const htmlWithId =
272
  }
273
  }
274
 
275
- .html-embed__title {
276
- text-align: left;
277
- font-weight: 600;
278
- font-size: 0.95rem;
279
- color: var(--text-color);
280
- margin: 0;
281
- padding: 0;
282
- padding-bottom: var(--spacing-1);
283
- position: relative;
284
- display: block;
285
- width: 100%;
286
- background: var(--page-bg);
287
- z-index: var(--z-elevated);
288
- }
289
  .html-embed__card {
290
  background-color: var(--surface-bg);
291
  border: 1px solid var(--border-color);
 
1
  ---
2
  interface Props {
3
  src: string;
 
4
  desc?: string;
5
  caption?: string;
6
  frameless?: boolean;
 
12
  }
13
  const {
14
  src,
 
15
  desc,
16
  caption,
17
  frameless = false,
 
67
  {
68
  html ? (
69
  <figure class={`html-embed${wide ? " html-embed--wide" : ""}`} id={id}>
 
 
 
 
 
70
  <div class={`html-embed__card${frameless ? " is-frameless" : ""}`}>
71
  <div
72
  id={mountId}
 
265
  }
266
  }
267
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  .html-embed__card {
269
  background-color: var(--surface-bg);
270
  border: 1px solid var(--border-color);
app/src/content/chapters/experiments.mdx CHANGED
@@ -6,6 +6,7 @@ import FigRef from "../../components/FigRef.astro";
6
 
7
  {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
8
  {/* TODO: shorten the vllm inference benchmark or put stuff into the appendix */}
 
9
  {/* TODO: add a plot for the table with the benchmark results */}
10
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
11
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
@@ -18,18 +19,13 @@ import FigRef from "../../components/FigRef.astro";
18
 
19
  With the infrastructure and setup in place, we now systematically work through our research questions. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
20
 
21
- ### Baselines
22
 
23
- We start by surveying the landscape. <mark>How do existing pretraining datasets compare when used to train our 1.2B model?</mark>
24
-
25
- We train on eight datasets under identical conditions and compare their final evaluation performance.
26
-
27
- DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. <mark>TLDR: DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.</mark>
28
 
29
  <HtmlEmbed
30
  id="baselines-comparison"
31
  src="d3-benchmark-comparison.html"
32
- title="Baseline Comparison"
33
  desc="Comparison of baseline datasets across different evaluation metrics. Use the dropdown to switch metrics."
34
  config={{
35
  baselines: [],
@@ -48,24 +44,17 @@ DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef ta
48
 
49
  The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
50
 
51
- #### Dissecting the Synthetic Baselines
52
-
53
- Prior synthetic datasets bundle multiple prompts together. We want to understand what makes them tick.
54
-
55
- <mark>Which individual prompts from existing synthetic methods actually match DCLM?</mark>
56
 
57
- We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source.
58
 
59
  <Sidenote>
60
  The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
61
  </Sidenote>
62
 
63
- Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. <mark>TLDR: Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.</mark>
64
-
65
  <HtmlEmbed
66
  id="dissecting-baselines"
67
  src="d3-benchmark-comparison.html"
68
- title="Dissecting Synthetic Baselines"
69
  desc="Individual prompt performance from existing synthetic datasets compared to DCLM and FineWeb-Edu-HQ."
70
  config={{
71
  baselines: ["dclm", "nemotron_hq_synth", "rewire"],
@@ -99,18 +88,13 @@ Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performa
99
 
100
  Can we design prompts that consistently beat DCLM?
101
 
102
- ### Which New Prompts Work Well?
103
-
104
- Since most existing prompts fail to beat DCLM, we designed new prompt formats targeting different skills. <mark>Can any of them outperform the baseline?</mark>
105
 
106
- We test seven novel prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion)) using Gemma-3-1B on FineWeb-Edu-HQ.
107
-
108
- Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) fall short (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats. <mark>TLDR: Math, table, FAQ, and tutorial prompts beat the DCLM baseline, while article, commentary, and discussion are at or below DCLM level.</mark>
109
 
110
  <HtmlEmbed
111
  id="new-prompts"
112
  src="d3-benchmark-comparison.html"
113
- title="New Prompt Performance"
114
  desc="Seven new prompts compared against DCLM and FineWeb-Edu-HQ."
115
  config={{
116
  datasetNames: {
@@ -135,11 +119,7 @@ We want to know whether using a stronger model leads to better synthetic data. W
135
 
136
  #### Does the model size matter?
137
 
138
- A natural assumption is that bigger models produce higher-quality rephrasings. <mark>Do they?</mark>
139
-
140
- We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts.
141
-
142
- The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. <mark>TLDR: Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.</mark>
143
 
144
  <Sidenote>
145
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
@@ -148,7 +128,6 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
148
  <HtmlEmbed
149
  id="model-size"
150
  src="d3-benchmark-comparison.html"
151
- title="Model Size"
152
  desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
153
  config={{
154
  setups: {
@@ -182,16 +161,11 @@ On high-quality source data, we see no evidence that larger models help. But REW
182
 
183
  #### Do we need better models for rephrasing low-quality data?
184
 
185
- The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
186
-
187
- We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
188
-
189
- The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
190
 
191
  <HtmlEmbed
192
  id="size-quality"
193
  src="d3-benchmark-comparison.html"
194
- title="Model Size vs Data Quality"
195
  desc="1B vs 12B model on HQ vs LQ data. Use the Setup dropdown to compare across prompts."
196
  config={{
197
  setups: {
@@ -243,20 +217,15 @@ Since model size barely matters, does the model family make a difference?
243
 
244
  #### Does the model family matter?
245
 
246
- Different model families may be better suited for rephrasing based on their training data. <mark>Do some families produce better synthetic data than others?</mark>
247
-
248
- We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on four prompts. Use the Setup dropdown to compare across prompts.
249
-
250
- SmolLM2 consistently and clearly outperforms all others across all four prompts (see <FigRef target="model-family" />). <mark>TLDR: Model family matters a lot. SmolLM2 dominates, likely due to [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its training data.</mark>
251
 
252
  <Sidenote>
253
- We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit rewrite tasks in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
254
  </Sidenote>
255
 
256
  <HtmlEmbed
257
  id="model-family"
258
  src="d3-benchmark-comparison.html"
259
- title="Model Family"
260
  desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
261
  config={{
262
  setups: {
@@ -316,16 +285,11 @@ SmolLM2 is already a year old. Are newer model generations better?
316
 
317
  #### Does the model generation matter?
318
 
319
- We've seen that model family matters. But within a family, <mark>do newer versions produce better synthetic data?</mark>
320
-
321
- We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt.
322
-
323
- While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />). <mark>TLDR: Newer model generations tend to produce slightly better synthetic data.</mark>
324
 
325
  <HtmlEmbed
326
  id="model-generation"
327
  src="d3-benchmark-comparison.html"
328
- title="Model Generation: Qwen Tutorial"
329
  desc="Qwen model generations (1.5 to 3) on the tutorial prompt."
330
  config={{
331
  datasetNames: {
@@ -340,9 +304,9 @@ While the differences are small, we find a consistent trend: newer versions lead
340
  />
341
 
342
  <Note title="Summary: Impact of the Rephrasing Model" variant="info">
343
- **Model size**: 1B is sufficient. Larger models do not help.
344
- **Model family**: SmolLM2 dominates across all prompts.
345
- **Model generation**: Newer is slightly better.
346
  **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
347
  </Note>
348
 
@@ -354,16 +318,11 @@ So far we've always mixed synthetic data with a <Glossary term="source dataset"
354
 
355
  #### Is synthetic data enough?
356
 
357
- We start with the most fundamental question: <mark>can we train on synthetic data alone, or do we need to mix it with original data?</mark>
358
-
359
- We compare synthetic-only training vs mixed training (synthetic + source) for [tutorial](#tutorial) and [faq](#faq) prompts on DCLM and FineWeb-Edu-HQ sources.
360
-
361
- Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines. <mark>TLDR: Synthetic data alone is not enough. Mixing with original data consistently improves performance.</mark>
362
 
363
  <HtmlEmbed
364
  id="synthetic-only"
365
  src="d3-benchmark-comparison.html"
366
- title="Is Synthetic Data Enough?"
367
  desc="Synthetic-only vs mixed training. Use the Setup dropdown to compare across source datasets."
368
  config={{
369
  setups: {
@@ -391,20 +350,15 @@ Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mi
391
  }}
392
  />
393
 
394
- So the mix-in dataset clearly matters. But how much does the specific choice of mix-in dataset affect performance?
395
 
396
  #### Does the mix-in dataset matter?
397
 
398
- We just saw that mixing in original data is essential. <mark>How much does the choice of mix-in dataset affect performance?</mark>
399
-
400
- We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data.
401
-
402
- DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). <mark>TLDR: The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.</mark>
403
 
404
  <HtmlEmbed
405
  id="mixin-dataset"
406
  src="d3-benchmark-comparison.html"
407
- title="Mix-in Dataset Effect"
408
  desc="Effect of different mix-in datasets. Use the Setup dropdown to compare HQ vs LQ source data."
409
  config={{
410
  setups: {
@@ -441,16 +395,11 @@ The mix-in dataset matters enormously. But what about the source dataset we feed
441
 
442
  #### Does the source dataset matter?
443
 
444
- We know the mix-in dataset is critical. <mark>But does the quality of the source documents we feed to the rephrasing model also matter?</mark>
445
-
446
- We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [tutorial](#tutorial) and [faq](#faq) prompts. We test two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ).
447
-
448
- When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). This corroborates our finding that the mix-in matters much more than the source. <mark>TLDR: Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.</mark>
449
 
450
  <HtmlEmbed
451
  id="source-dataset-mixin-source"
452
  src="d3-benchmark-comparison.html"
453
- title="Source Dataset (Mix-in = Source)"
454
  desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
455
  config={{
456
  setups: {
@@ -481,7 +430,6 @@ When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ
481
  <HtmlEmbed
482
  id="source-dataset-fixed-mixin"
483
  src="d3-benchmark-comparison.html"
484
- title="Source Dataset (Fixed Mix-in: FineWeb-Edu-HQ)"
485
  desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
486
  config={{
487
  setups: {
@@ -513,22 +461,15 @@ This is exciting because it shows the potential of upcycling low-quality data th
513
 
514
  #### Does increased diversity help?
515
 
516
- Given that mixing matters, a natural next step is to maximize diversity in the synthetic portion. <mark>Does combining multiple prompts or model families increase performance?</mark>
517
-
518
- We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies.
519
-
520
- No significant improvement from any diversity strategy. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
521
 
522
  <Sidenote>
523
  Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
524
  </Sidenote>
525
 
526
- <mark>TLDR: At our 20B token scale, diversity does not compound. Mixing datasets averages rather than improves performance, though larger-scale experiments may tell a different story.</mark>
527
-
528
  <HtmlEmbed
529
  id="diversity"
530
  src="d3-benchmark-comparison.html"
531
- title="Diversity"
532
  desc="Different diversity strategies. Use the Setup dropdown to compare approaches."
533
  config={{
534
  setups: {
@@ -571,20 +512,23 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
571
  }}
572
  />
573
 
 
 
 
 
 
 
 
 
574
  Let's turn to some unexpected findings from our experiments.
575
 
576
  ### Do Typos in the Prompt Hurt?
577
 
578
- The original REWIRE prompt contains many typos and grammar errors. <mark>Do these imperfections degrade the quality of the synthetic data?</mark>
579
-
580
- We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale.
581
-
582
- Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />). <mark>TLDR: Typos in prompts do not hurt downstream performance.</mark>
583
 
584
  <HtmlEmbed
585
  id="typos-effect"
586
  src="d3-benchmark-comparison.html"
587
- title="Effect of Typos in Prompt"
588
  desc="REWIRE prompt with original typos vs improved version at 1B and 12B scale."
589
  config={{
590
  datasetNames: {
@@ -612,9 +556,7 @@ TODO: Run this analysis and add a small report
612
 
613
  ### Math Rephrasing: When "Worse" Outputs Win
614
 
615
- We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better. <mark>Does higher-quality output actually lead to better training data?</mark>
616
-
617
- We compare SmolLM2 (messy, variable outputs) vs Qwen3 (clean, structured outputs) for [math](#math) rephrasing.
618
 
619
  **Qwen3 produced beautiful, structured outputs:**
620
 
@@ -663,4 +605,4 @@ SmolLM2's quality distribution was actually reasonable:
663
  | Partial | 30+ tokens but missing structure | 25% |
664
  | Poor | {'<'}30 tokens | 8% |
665
 
666
- <mark>TLDR: For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.</mark>
 
6
 
7
  {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
8
  {/* TODO: shorten the vllm inference benchmark or put stuff into the appendix */}
9
+ {/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
10
  {/* TODO: add a plot for the table with the benchmark results */}
11
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
12
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
 
19
 
20
  With the infrastructure and setup in place, we now systematically work through our research questions. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
21
 
22
+ ### How Do Existing Datasets Compare?
23
 
24
+ We train on eight datasets under identical conditions and compare their final evaluation performance. DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.
 
 
 
 
25
 
26
  <HtmlEmbed
27
  id="baselines-comparison"
28
  src="d3-benchmark-comparison.html"
 
29
  desc="Comparison of baseline datasets across different evaluation metrics. Use the dropdown to switch metrics."
30
  config={{
31
  baselines: [],
 
44
 
45
  The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
46
 
47
+ #### Which Individual Prompts Match DCLM?
 
 
 
 
48
 
49
+ We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source. Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.
50
 
51
  <Sidenote>
52
  The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
53
  </Sidenote>
54
 
 
 
55
  <HtmlEmbed
56
  id="dissecting-baselines"
57
  src="d3-benchmark-comparison.html"
 
58
  desc="Individual prompt performance from existing synthetic datasets compared to DCLM and FineWeb-Edu-HQ."
59
  config={{
60
  baselines: ["dclm", "nemotron_hq_synth", "rewire"],
 
88
 
89
  Can we design prompts that consistently beat DCLM?
90
 
91
+ ### Can New Prompts Beat DCLM?
 
 
92
 
93
+ Since most existing prompts fail to beat DCLM, we designed seven novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
 
 
94
 
95
  <HtmlEmbed
96
  id="new-prompts"
97
  src="d3-benchmark-comparison.html"
 
98
  desc="Seven new prompts compared against DCLM and FineWeb-Edu-HQ."
99
  config={{
100
  datasetNames: {
 
119
 
120
  #### Does the model size matter?
121
 
122
+ We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
 
 
 
 
123
 
124
  <Sidenote>
125
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
 
128
  <HtmlEmbed
129
  id="model-size"
130
  src="d3-benchmark-comparison.html"
 
131
  desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
132
  config={{
133
  setups: {
 
161
 
162
  #### Do we need better models for rephrasing low-quality data?
163
 
164
+ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
 
 
 
 
165
 
166
  <HtmlEmbed
167
  id="size-quality"
168
  src="d3-benchmark-comparison.html"
 
169
  desc="1B vs 12B model on HQ vs LQ data. Use the Setup dropdown to compare across prompts."
170
  config={{
171
  setups: {
 
217
 
218
  #### Does the model family matter?
219
 
220
+ We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on four prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all four prompts (see <FigRef target="model-family" />).
 
 
 
 
221
 
222
  <Sidenote>
223
+ We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
224
  </Sidenote>
225
 
226
  <HtmlEmbed
227
  id="model-family"
228
  src="d3-benchmark-comparison.html"
 
229
  desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
230
  config={{
231
  setups: {
 
285
 
286
  #### Does the model generation matter?
287
 
288
+ We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt. While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />) especially cumulative from version 1.5 to 3.
 
 
 
 
289
 
290
  <HtmlEmbed
291
  id="model-generation"
292
  src="d3-benchmark-comparison.html"
 
293
  desc="Qwen model generations (1.5 to 3) on the tutorial prompt."
294
  config={{
295
  datasetNames: {
 
304
  />
305
 
306
  <Note title="Summary: Impact of the Rephrasing Model" variant="info">
307
+ **Model size**: 1B is sufficient. Larger models do not help.<br/>
308
+ **Model family**: SmolLM2 dominates across all prompts.<br/>
309
+ **Model generation**: Newer is slightly better.<br/>
310
  **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
311
  </Note>
312
 
 
318
 
319
  #### Is synthetic data enough?
320
 
321
+ We compare synthetic-only training vs mixed training (synthetic + source) for [tutorial](#tutorial) and [faq](#faq) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
 
 
 
 
322
 
323
  <HtmlEmbed
324
  id="synthetic-only"
325
  src="d3-benchmark-comparison.html"
 
326
  desc="Synthetic-only vs mixed training. Use the Setup dropdown to compare across source datasets."
327
  config={{
328
  setups: {
 
350
  }}
351
  />
352
 
353
+ So synthetic data alone does not seem to be enough. But how much does the specific choice of mix-in dataset affect performance?
354
 
355
  #### Does the mix-in dataset matter?
356
 
357
+ We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data. DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
 
 
 
 
358
 
359
  <HtmlEmbed
360
  id="mixin-dataset"
361
  src="d3-benchmark-comparison.html"
 
362
  desc="Effect of different mix-in datasets. Use the Setup dropdown to compare HQ vs LQ source data."
363
  config={{
364
  setups: {
 
395
 
396
  #### Does the source dataset matter?
397
 
398
+ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [tutorial](#tutorial) and [faq](#faq) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
 
 
 
 
399
 
400
  <HtmlEmbed
401
  id="source-dataset-mixin-source"
402
  src="d3-benchmark-comparison.html"
 
403
  desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
404
  config={{
405
  setups: {
 
430
  <HtmlEmbed
431
  id="source-dataset-fixed-mixin"
432
  src="d3-benchmark-comparison.html"
 
433
  desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
434
  config={{
435
  setups: {
 
461
 
462
  #### Does increased diversity help?
463
 
464
+ We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies. None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
 
 
 
 
465
 
466
  <Sidenote>
467
  Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
468
  </Sidenote>
469
 
 
 
470
  <HtmlEmbed
471
  id="diversity"
472
  src="d3-benchmark-comparison.html"
 
473
  desc="Different diversity strategies. Use the Setup dropdown to compare approaches."
474
  config={{
475
  setups: {
 
512
  }}
513
  />
514
 
515
+ <Note title="Summary: Impact of the Dataset Choices" variant="info">
516
+ **Synthetic-only**: Not enough. Always mix with original data.<br/>
517
+ **Mix-in dataset**: Major performance driver, sometimes more important than the synthetic data itself.<br/>
518
+ **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
519
+ **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
520
+ **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
521
+ </Note>
522
+
523
  Let's turn to some unexpected findings from our experiments.
524
 
525
  ### Do Typos in the Prompt Hurt?
526
 
527
+ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale. Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />).
 
 
 
 
528
 
529
  <HtmlEmbed
530
  id="typos-effect"
531
  src="d3-benchmark-comparison.html"
 
532
  desc="REWIRE prompt with original typos vs improved version at 1B and 12B scale."
533
  config={{
534
  datasetNames: {
 
556
 
557
  ### Math Rephrasing: When "Worse" Outputs Win
558
 
559
+ We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
 
 
560
 
561
  **Qwen3 produced beautiful, structured outputs:**
562
 
 
605
  | Partial | 30+ tokens but missing structure | 25% |
606
  | Poor | {'<'}30 tokens | 8% |
607
 
608
+ For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
app/src/content/chapters/infrastructure.mdx CHANGED
@@ -373,7 +373,6 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
373
  <HtmlEmbed
374
  id="optimization-sweep"
375
  src="d3-optimization-sweep.html"
376
- title="Throughput Optimization Sweep"
377
  desc="Throughput optimization across 18 models in two tiers. Tier 0 tunes serving parameters (tp, mns, mnbt). Tier 1 adds gpu-memory-utilization and speculative decoding. Shape encodes tier, color encodes model family."
378
  />
379
 
 
373
  <HtmlEmbed
374
  id="optimization-sweep"
375
  src="d3-optimization-sweep.html"
 
376
  desc="Throughput optimization across 18 models in two tiers. Tier 0 tunes serving parameters (tp, mns, mnbt). Tier 1 adds gpu-memory-utilization and speculative decoding. Shape encodes tier, color encodes model family."
377
  />
378
 
app/src/content/chapters/introduction.mdx CHANGED
@@ -46,7 +46,6 @@ Here's a preview of where we end up: FinePhrase, our best configuration, clearly
46
  <HtmlEmbed
47
  id="finephrase-vs-baselines"
48
  src="d3-benchmark-comparison.html"
49
- title="FinePhrase vs Synthetic Baselines"
50
  desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
51
  config={{
52
  defaultView: "line",
 
46
  <HtmlEmbed
47
  id="finephrase-vs-baselines"
48
  src="d3-benchmark-comparison.html"
 
49
  desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
50
  config={{
51
  defaultView: "line",