joelniklaus HF Staff commited on
Commit
c33d5c0
·
1 Parent(s): 4138808

made sure prompts are always alphabetically ordered within categories

Browse files
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -101,7 +101,7 @@ Can we design prompts that consistently beat DCLM?
101
 
102
  ### Can New Prompts Beat DCLM?
103
 
104
- Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion), [narrative](#narrative), [explanation](#explanation)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [narrative](#narrative), [explanation](#explanation), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
105
 
106
  <HtmlEmbed
107
  id="new-prompts"
@@ -109,15 +109,15 @@ Since most existing prompts fail to beat DCLM, we designed nine novel prompt for
109
  desc="Nine new prompts compared against the DCLM baseline."
110
  config={{
111
  datasets: {
112
- "mix-fw_edu_hq-math_1b_hq": "Math",
113
- "mix-fw_edu_hq-table_1b_hq": "Table",
114
- "mix-fw_edu_hq-faq_1b_hq": "FAQ",
115
- "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
116
- "mix-fw_edu_hq-narrative_1b_hq": "Narrative",
117
  "mix-fw_edu_hq-article_1b_hq": "Article",
118
- "mix-fw_edu_hq-explanation_1b_hq": "Explanation",
119
  "mix-fw_edu_hq-commentary_1b_hq": "Commentary",
120
  "mix-fw_edu_hq-discussion_1b_hq": "Discussion",
 
 
 
 
 
 
121
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
122
  }
123
  }}
@@ -131,8 +131,8 @@ We want to know whether using a stronger model leads to better synthetic data. W
131
 
132
  #### Does the model size matter?
133
 
134
- We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial), [math](#math), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts (use the Setup dropdown in <FigRef target="model-size" /> to switch between them).
135
- For [tutorial](#tutorial) and [math](#math), the 270M model underperforms, but 1B through 27B show no significant difference.
136
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
137
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
138
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
@@ -148,16 +148,6 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
148
  desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
149
  config={{
150
  setups: {
151
- "Gemma-3: Tutorial": {
152
- datasets: {
153
- "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
154
- "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
155
- "mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
156
- "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
157
- "mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
158
- dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
159
- }
160
- },
161
  "Gemma-3: Math": {
162
  datasets: {
163
  "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
@@ -178,6 +168,16 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
178
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
179
  }
180
  },
 
 
 
 
 
 
 
 
 
 
181
  "SmolLM2: Tutorial": {
182
  datasets: {
183
  "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
@@ -194,7 +194,7 @@ On high-quality source data, we see no evidence that larger models help. But REW
194
 
195
  #### Do we need better models for rephrasing low-quality data?
196
 
197
- The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
198
 
199
  <HtmlEmbed
200
  id="size-quality"
@@ -220,15 +220,6 @@ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large
220
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
221
  }
222
  },
223
- "Tutorial Prompt": {
224
- datasets: {
225
- "mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
226
- "mix-fw_edu_hq-tutorial_12b_hq": "12B, HQ Source",
227
- "mix-fw_edu_hq-tutorial_12b_lq": "12B, LQ Source",
228
- "mix-fw_edu_hq-tutorial_1b_lq": "1B, LQ Source",
229
- dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
230
- }
231
- },
232
  "FAQ Prompt": {
233
  datasets: {
234
  "mix-fw_edu_hq-faq_1b_hq": "1B, HQ Source",
@@ -237,6 +228,15 @@ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large
237
  "mix-fw_edu_hq-faq_12b_lq": "12B, LQ Source",
238
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
239
  }
 
 
 
 
 
 
 
 
 
240
  }
241
  }
242
  }}
@@ -280,39 +280,28 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
280
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
281
  }
282
  },
283
- "Tutorial Prompt": {
284
  datasets: {
285
- "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
286
- "mix-fw_edu_hq-tutorial_falcon3_1b_hq": "Falcon3",
287
- "mix-fw_edu_hq-tutorial_qwen3_1.7b_hq": "Qwen3",
288
- "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3",
289
- "mix-fw_edu_hq-tutorial_granite3_1b_hq": "Granite3",
290
- "mix-fw_edu_hq-tutorial_llama3.2_1b_hq": "Llama-3.2",
291
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
292
  }
293
  },
294
  "FAQ Prompt": {
295
  datasets: {
296
  "mix-fw_edu_hq-faq_smollm2_1.7b_hq": "SmolLM2",
297
- "mix-fw_edu_hq-faq_llama3.2_1b_hq": "Llama-3.2",
298
  "mix-fw_edu_hq-faq_falcon3_1b_hq": "Falcon3",
299
- "mix-fw_edu_hq-faq_1b_hq": "Gemma-3",
300
  "mix-fw_edu_hq-faq_granite3_1b_hq": "Granite3",
 
 
301
  "mix-fw_edu_hq-faq_qwen3_1.7b_hq": "Qwen3",
302
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
303
  }
304
  },
305
- "Table Prompt": {
306
- datasets: {
307
- "mix-fw_edu_hq-table_smollm2_1.7b_hq": "SmolLM2",
308
- "mix-fw_edu_hq-table_falcon3_1b_hq": "Falcon3",
309
- "mix-fw_edu_hq-table_granite3_1b_hq": "Granite3",
310
- "mix-fw_edu_hq-table_qwen3_1.7b_hq": "Qwen3",
311
- "mix-fw_edu_hq-table_llama3.2_1b_hq": "Llama-3.2",
312
- "mix-fw_edu_hq-table_1b_hq": "Gemma-3",
313
- dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
314
- }
315
- },
316
  "Math Prompt": {
317
  datasets: {
318
  "mix-fw_edu_hq-math_smollm2_1.7b_hq": "SmolLM2",
@@ -335,14 +324,25 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
335
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
336
  }
337
  },
338
- "Explanation Prompt": {
339
  datasets: {
340
- "mix-fw_edu_hq-explanation_smollm2_1.7b_hq": "SmolLM2",
341
- "mix-fw_edu_hq-explanation_falcon3_1b_hq": "Falcon3",
342
- "mix-fw_edu_hq-explanation_granite3_1b_hq": "Granite3",
343
- "mix-fw_edu_hq-explanation_1b_hq": "Gemma-3",
344
- "mix-fw_edu_hq-explanation_llama3.2_1b_hq": "Llama-3.2",
345
- "mix-fw_edu_hq-explanation_qwen3_1.7b_hq": "Qwen3",
 
 
 
 
 
 
 
 
 
 
 
346
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
347
  }
348
  }
@@ -386,7 +386,7 @@ So far we've always mixed synthetic data with a <Glossary term="source dataset"
386
 
387
  #### Is synthetic data enough?
388
 
389
- We compare synthetic-only training vs mixed training (synthetic + source) for [tutorial](#tutorial) and [faq](#faq) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
390
 
391
  <HtmlEmbed
392
  id="synthetic-only"
@@ -460,7 +460,7 @@ The mix-in dataset matters enormously. But what about the source dataset we feed
460
 
461
  #### Does the source dataset matter?
462
 
463
- We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [tutorial](#tutorial) and [faq](#faq) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
464
 
465
  <HtmlEmbed
466
  id="source-dataset-mixin-source"
@@ -468,15 +468,6 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
468
  desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
469
  config={{
470
  setups: {
471
- "Tutorial Prompt": {
472
- datasets: {
473
- "mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
474
- "mix-dclm-tutorial_1b_dclm": "Source: DCLM",
475
- "mix-cosmopedia-tutorial_1b_cosmopedia": "Source: Cosmopedia",
476
- "mix-fw_edu_lq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
477
- dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
478
- }
479
- },
480
  "FAQ Prompt": {
481
  datasets: {
482
  "mix-dclm-faq_1b_dclm": "Source: DCLM",
@@ -485,6 +476,15 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
485
  "mix-cosmopedia-faq_1b_cosmopedia": "Source: Cosmopedia",
486
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
487
  }
 
 
 
 
 
 
 
 
 
488
  }
489
  }
490
  }}
@@ -496,15 +496,6 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
496
  desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
497
  config={{
498
  setups: {
499
- "Tutorial Prompt": {
500
- datasets: {
501
- "mix-fw_edu_hq-tutorial_1b_dclm": "Source: DCLM",
502
- "mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
503
- "mix-fw_edu_hq-tutorial_1b_cosmopedia": "Source: Cosmopedia",
504
- "mix-fw_edu_hq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
505
- dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
506
- }
507
- },
508
  "FAQ Prompt": {
509
  datasets: {
510
  "mix-fw_edu_hq-faq_1b_dclm": "Source: DCLM",
@@ -513,6 +504,15 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
513
  "mix-fw_edu_hq-faq_1b_cosmopedia": "Source: Cosmopedia",
514
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
515
  }
 
 
 
 
 
 
 
 
 
516
  }
517
  }
518
  }}
@@ -608,7 +608,7 @@ Here are the key takeaways from our experiments:
608
  - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
609
  A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
610
  - **Q: Can new prompts beat DCLM?**<br/>
611
- A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM. Narrative, Explanation, Article, Commentary, and Discussion do not.
612
  - **Q: Does model size matter?**<br/>
613
  A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
614
  - **Q: Do we need better models for low-quality data?**<br/>
@@ -628,4 +628,4 @@ Here are the key takeaways from our experiments:
628
  - **Q: Do typos in the prompt hurt?**<br/>
629
  A: No. Typos have no negative effect on downstream performance.
630
 
631
- So what actually matters? Prompt design, above all else. Structured formats like Math, Table, FAQ, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving. A 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
 
101
 
102
  ### Can New Prompts Beat DCLM?
103
 
104
+ Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), and [narrative](#narrative) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
105
 
106
  <HtmlEmbed
107
  id="new-prompts"
 
109
  desc="Nine new prompts compared against the DCLM baseline."
110
  config={{
111
  datasets: {
 
 
 
 
 
112
  "mix-fw_edu_hq-article_1b_hq": "Article",
 
113
  "mix-fw_edu_hq-commentary_1b_hq": "Commentary",
114
  "mix-fw_edu_hq-discussion_1b_hq": "Discussion",
115
+ "mix-fw_edu_hq-explanation_1b_hq": "Explanation",
116
+ "mix-fw_edu_hq-faq_1b_hq": "FAQ",
117
+ "mix-fw_edu_hq-math_1b_hq": "Math",
118
+ "mix-fw_edu_hq-narrative_1b_hq": "Narrative",
119
+ "mix-fw_edu_hq-table_1b_hq": "Table",
120
+ "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
121
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
122
  }
123
  }}
 
131
 
132
  #### Does the model size matter?
133
 
134
+ We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [math](#math), [tutorial](#tutorial), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts (use the Setup dropdown in <FigRef target="model-size" /> to switch between them).
135
+ For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1B through 27B show no significant difference.
136
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
137
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
138
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
 
148
  desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
149
  config={{
150
  setups: {
 
 
 
 
 
 
 
 
 
 
151
  "Gemma-3: Math": {
152
  datasets: {
153
  "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
 
168
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
169
  }
170
  },
171
+ "Gemma-3: Tutorial": {
172
+ datasets: {
173
+ "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
174
+ "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
175
+ "mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
176
+ "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
177
+ "mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
178
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
179
+ }
180
+ },
181
  "SmolLM2: Tutorial": {
182
  datasets: {
183
  "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
 
194
 
195
  #### Do we need better models for rephrasing low-quality data?
196
 
197
+ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
198
 
199
  <HtmlEmbed
200
  id="size-quality"
 
220
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
221
  }
222
  },
 
 
 
 
 
 
 
 
 
223
  "FAQ Prompt": {
224
  datasets: {
225
  "mix-fw_edu_hq-faq_1b_hq": "1B, HQ Source",
 
228
  "mix-fw_edu_hq-faq_12b_lq": "12B, LQ Source",
229
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
230
  }
231
+ },
232
+ "Tutorial Prompt": {
233
+ datasets: {
234
+ "mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
235
+ "mix-fw_edu_hq-tutorial_12b_hq": "12B, HQ Source",
236
+ "mix-fw_edu_hq-tutorial_12b_lq": "12B, LQ Source",
237
+ "mix-fw_edu_hq-tutorial_1b_lq": "1B, LQ Source",
238
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
239
+ }
240
  }
241
  }
242
  }}
 
280
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
281
  }
282
  },
283
+ "Explanation Prompt": {
284
  datasets: {
285
+ "mix-fw_edu_hq-explanation_smollm2_1.7b_hq": "SmolLM2",
286
+ "mix-fw_edu_hq-explanation_falcon3_1b_hq": "Falcon3",
287
+ "mix-fw_edu_hq-explanation_granite3_1b_hq": "Granite3",
288
+ "mix-fw_edu_hq-explanation_1b_hq": "Gemma-3",
289
+ "mix-fw_edu_hq-explanation_llama3.2_1b_hq": "Llama-3.2",
290
+ "mix-fw_edu_hq-explanation_qwen3_1.7b_hq": "Qwen3",
291
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
292
  }
293
  },
294
  "FAQ Prompt": {
295
  datasets: {
296
  "mix-fw_edu_hq-faq_smollm2_1.7b_hq": "SmolLM2",
 
297
  "mix-fw_edu_hq-faq_falcon3_1b_hq": "Falcon3",
 
298
  "mix-fw_edu_hq-faq_granite3_1b_hq": "Granite3",
299
+ "mix-fw_edu_hq-faq_1b_hq": "Gemma-3",
300
+ "mix-fw_edu_hq-faq_llama3.2_1b_hq": "Llama-3.2",
301
  "mix-fw_edu_hq-faq_qwen3_1.7b_hq": "Qwen3",
302
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
303
  }
304
  },
 
 
 
 
 
 
 
 
 
 
 
305
  "Math Prompt": {
306
  datasets: {
307
  "mix-fw_edu_hq-math_smollm2_1.7b_hq": "SmolLM2",
 
324
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
325
  }
326
  },
327
+ "Table Prompt": {
328
  datasets: {
329
+ "mix-fw_edu_hq-table_smollm2_1.7b_hq": "SmolLM2",
330
+ "mix-fw_edu_hq-table_falcon3_1b_hq": "Falcon3",
331
+ "mix-fw_edu_hq-table_granite3_1b_hq": "Granite3",
332
+ "mix-fw_edu_hq-table_1b_hq": "Gemma-3",
333
+ "mix-fw_edu_hq-table_llama3.2_1b_hq": "Llama-3.2",
334
+ "mix-fw_edu_hq-table_qwen3_1.7b_hq": "Qwen3",
335
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
336
+ }
337
+ },
338
+ "Tutorial Prompt": {
339
+ datasets: {
340
+ "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
341
+ "mix-fw_edu_hq-tutorial_falcon3_1b_hq": "Falcon3",
342
+ "mix-fw_edu_hq-tutorial_granite3_1b_hq": "Granite3",
343
+ "mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3",
344
+ "mix-fw_edu_hq-tutorial_llama3.2_1b_hq": "Llama-3.2",
345
+ "mix-fw_edu_hq-tutorial_qwen3_1.7b_hq": "Qwen3",
346
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
347
  }
348
  }
 
386
 
387
  #### Is synthetic data enough?
388
 
389
+ We compare synthetic-only training vs mixed training (synthetic + source) for [faq](#faq) and [tutorial](#tutorial) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
390
 
391
  <HtmlEmbed
392
  id="synthetic-only"
 
460
 
461
  #### Does the source dataset matter?
462
 
463
+ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
464
 
465
  <HtmlEmbed
466
  id="source-dataset-mixin-source"
 
468
  desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
469
  config={{
470
  setups: {
 
 
 
 
 
 
 
 
 
471
  "FAQ Prompt": {
472
  datasets: {
473
  "mix-dclm-faq_1b_dclm": "Source: DCLM",
 
476
  "mix-cosmopedia-faq_1b_cosmopedia": "Source: Cosmopedia",
477
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
478
  }
479
+ },
480
+ "Tutorial Prompt": {
481
+ datasets: {
482
+ "mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
483
+ "mix-dclm-tutorial_1b_dclm": "Source: DCLM",
484
+ "mix-cosmopedia-tutorial_1b_cosmopedia": "Source: Cosmopedia",
485
+ "mix-fw_edu_lq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
486
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
487
+ }
488
  }
489
  }
490
  }}
 
496
  desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
497
  config={{
498
  setups: {
 
 
 
 
 
 
 
 
 
499
  "FAQ Prompt": {
500
  datasets: {
501
  "mix-fw_edu_hq-faq_1b_dclm": "Source: DCLM",
 
504
  "mix-fw_edu_hq-faq_1b_cosmopedia": "Source: Cosmopedia",
505
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
506
  }
507
+ },
508
+ "Tutorial Prompt": {
509
+ datasets: {
510
+ "mix-fw_edu_hq-tutorial_1b_dclm": "Source: DCLM",
511
+ "mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
512
+ "mix-fw_edu_hq-tutorial_1b_cosmopedia": "Source: Cosmopedia",
513
+ "mix-fw_edu_hq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
514
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
515
+ }
516
  }
517
  }
518
  }}
 
608
  - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
609
  A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
610
  - **Q: Can new prompts beat DCLM?**<br/>
611
+ A: Yes. FAQ, Math, Table, and Tutorial all outperform DCLM. Article, Commentary, Discussion, Explanation, and Narrative do not.
612
  - **Q: Does model size matter?**<br/>
613
  A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
614
  - **Q: Do we need better models for low-quality data?**<br/>
 
628
  - **Q: Do typos in the prompt hurt?**<br/>
629
  A: No. Typos have no negative effect on downstream performance.
630
 
631
+ So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving. A 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.