Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
c33d5c0
1
Parent(s): 4138808
made sure prompts are always alphabetically ordered within categories
Browse files
app/src/content/chapters/3-experiments.mdx
CHANGED
|
@@ -101,7 +101,7 @@ Can we design prompts that consistently beat DCLM?
|
|
| 101 |
|
| 102 |
### Can New Prompts Beat DCLM?
|
| 103 |
|
| 104 |
-
Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([
|
| 105 |
|
| 106 |
<HtmlEmbed
|
| 107 |
id="new-prompts"
|
|
@@ -109,15 +109,15 @@ Since most existing prompts fail to beat DCLM, we designed nine novel prompt for
|
|
| 109 |
desc="Nine new prompts compared against the DCLM baseline."
|
| 110 |
config={{
|
| 111 |
datasets: {
|
| 112 |
-
"mix-fw_edu_hq-math_1b_hq": "Math",
|
| 113 |
-
"mix-fw_edu_hq-table_1b_hq": "Table",
|
| 114 |
-
"mix-fw_edu_hq-faq_1b_hq": "FAQ",
|
| 115 |
-
"mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
|
| 116 |
-
"mix-fw_edu_hq-narrative_1b_hq": "Narrative",
|
| 117 |
"mix-fw_edu_hq-article_1b_hq": "Article",
|
| 118 |
-
"mix-fw_edu_hq-explanation_1b_hq": "Explanation",
|
| 119 |
"mix-fw_edu_hq-commentary_1b_hq": "Commentary",
|
| 120 |
"mix-fw_edu_hq-discussion_1b_hq": "Discussion",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 122 |
}
|
| 123 |
}}
|
|
@@ -131,8 +131,8 @@ We want to know whether using a stronger model leads to better synthetic data. W
|
|
| 131 |
|
| 132 |
#### Does the model size matter?
|
| 133 |
|
| 134 |
-
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [
|
| 135 |
-
For [
|
| 136 |
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
|
| 137 |
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
|
| 138 |
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
|
|
@@ -148,16 +148,6 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 148 |
desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
|
| 149 |
config={{
|
| 150 |
setups: {
|
| 151 |
-
"Gemma-3: Tutorial": {
|
| 152 |
-
datasets: {
|
| 153 |
-
"mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
|
| 154 |
-
"mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
|
| 155 |
-
"mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
|
| 156 |
-
"mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
|
| 157 |
-
"mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
|
| 158 |
-
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 159 |
-
}
|
| 160 |
-
},
|
| 161 |
"Gemma-3: Math": {
|
| 162 |
datasets: {
|
| 163 |
"mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
|
|
@@ -178,6 +168,16 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 178 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 179 |
}
|
| 180 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
"SmolLM2: Tutorial": {
|
| 182 |
datasets: {
|
| 183 |
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
|
|
@@ -194,7 +194,7 @@ On high-quality source data, we see no evidence that larger models help. But REW
|
|
| 194 |
|
| 195 |
#### Do we need better models for rephrasing low-quality data?
|
| 196 |
|
| 197 |
-
The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [
|
| 198 |
|
| 199 |
<HtmlEmbed
|
| 200 |
id="size-quality"
|
|
@@ -220,15 +220,6 @@ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large
|
|
| 220 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 221 |
}
|
| 222 |
},
|
| 223 |
-
"Tutorial Prompt": {
|
| 224 |
-
datasets: {
|
| 225 |
-
"mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
|
| 226 |
-
"mix-fw_edu_hq-tutorial_12b_hq": "12B, HQ Source",
|
| 227 |
-
"mix-fw_edu_hq-tutorial_12b_lq": "12B, LQ Source",
|
| 228 |
-
"mix-fw_edu_hq-tutorial_1b_lq": "1B, LQ Source",
|
| 229 |
-
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 230 |
-
}
|
| 231 |
-
},
|
| 232 |
"FAQ Prompt": {
|
| 233 |
datasets: {
|
| 234 |
"mix-fw_edu_hq-faq_1b_hq": "1B, HQ Source",
|
|
@@ -237,6 +228,15 @@ The REWIRE [@rewire] paper claims that upcycling low-quality data requires large
|
|
| 237 |
"mix-fw_edu_hq-faq_12b_lq": "12B, LQ Source",
|
| 238 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 239 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
}
|
| 241 |
}
|
| 242 |
}}
|
|
@@ -280,39 +280,28 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
|
|
| 280 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 281 |
}
|
| 282 |
},
|
| 283 |
-
"
|
| 284 |
datasets: {
|
| 285 |
-
"mix-fw_edu_hq-
|
| 286 |
-
"mix-fw_edu_hq-
|
| 287 |
-
"mix-fw_edu_hq-
|
| 288 |
-
"mix-fw_edu_hq-
|
| 289 |
-
"mix-fw_edu_hq-
|
| 290 |
-
"mix-fw_edu_hq-
|
| 291 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 292 |
}
|
| 293 |
},
|
| 294 |
"FAQ Prompt": {
|
| 295 |
datasets: {
|
| 296 |
"mix-fw_edu_hq-faq_smollm2_1.7b_hq": "SmolLM2",
|
| 297 |
-
"mix-fw_edu_hq-faq_llama3.2_1b_hq": "Llama-3.2",
|
| 298 |
"mix-fw_edu_hq-faq_falcon3_1b_hq": "Falcon3",
|
| 299 |
-
"mix-fw_edu_hq-faq_1b_hq": "Gemma-3",
|
| 300 |
"mix-fw_edu_hq-faq_granite3_1b_hq": "Granite3",
|
|
|
|
|
|
|
| 301 |
"mix-fw_edu_hq-faq_qwen3_1.7b_hq": "Qwen3",
|
| 302 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 303 |
}
|
| 304 |
},
|
| 305 |
-
"Table Prompt": {
|
| 306 |
-
datasets: {
|
| 307 |
-
"mix-fw_edu_hq-table_smollm2_1.7b_hq": "SmolLM2",
|
| 308 |
-
"mix-fw_edu_hq-table_falcon3_1b_hq": "Falcon3",
|
| 309 |
-
"mix-fw_edu_hq-table_granite3_1b_hq": "Granite3",
|
| 310 |
-
"mix-fw_edu_hq-table_qwen3_1.7b_hq": "Qwen3",
|
| 311 |
-
"mix-fw_edu_hq-table_llama3.2_1b_hq": "Llama-3.2",
|
| 312 |
-
"mix-fw_edu_hq-table_1b_hq": "Gemma-3",
|
| 313 |
-
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 314 |
-
}
|
| 315 |
-
},
|
| 316 |
"Math Prompt": {
|
| 317 |
datasets: {
|
| 318 |
"mix-fw_edu_hq-math_smollm2_1.7b_hq": "SmolLM2",
|
|
@@ -335,14 +324,25 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
|
|
| 335 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 336 |
}
|
| 337 |
},
|
| 338 |
-
"
|
| 339 |
datasets: {
|
| 340 |
-
"mix-fw_edu_hq-
|
| 341 |
-
"mix-fw_edu_hq-
|
| 342 |
-
"mix-fw_edu_hq-
|
| 343 |
-
"mix-fw_edu_hq-
|
| 344 |
-
"mix-fw_edu_hq-
|
| 345 |
-
"mix-fw_edu_hq-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 346 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 347 |
}
|
| 348 |
}
|
|
@@ -386,7 +386,7 @@ So far we've always mixed synthetic data with a <Glossary term="source dataset"
|
|
| 386 |
|
| 387 |
#### Is synthetic data enough?
|
| 388 |
|
| 389 |
-
We compare synthetic-only training vs mixed training (synthetic + source) for [
|
| 390 |
|
| 391 |
<HtmlEmbed
|
| 392 |
id="synthetic-only"
|
|
@@ -460,7 +460,7 @@ The mix-in dataset matters enormously. But what about the source dataset we feed
|
|
| 460 |
|
| 461 |
#### Does the source dataset matter?
|
| 462 |
|
| 463 |
-
We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [
|
| 464 |
|
| 465 |
<HtmlEmbed
|
| 466 |
id="source-dataset-mixin-source"
|
|
@@ -468,15 +468,6 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
|
|
| 468 |
desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
|
| 469 |
config={{
|
| 470 |
setups: {
|
| 471 |
-
"Tutorial Prompt": {
|
| 472 |
-
datasets: {
|
| 473 |
-
"mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
|
| 474 |
-
"mix-dclm-tutorial_1b_dclm": "Source: DCLM",
|
| 475 |
-
"mix-cosmopedia-tutorial_1b_cosmopedia": "Source: Cosmopedia",
|
| 476 |
-
"mix-fw_edu_lq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
|
| 477 |
-
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 478 |
-
}
|
| 479 |
-
},
|
| 480 |
"FAQ Prompt": {
|
| 481 |
datasets: {
|
| 482 |
"mix-dclm-faq_1b_dclm": "Source: DCLM",
|
|
@@ -485,6 +476,15 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
|
|
| 485 |
"mix-cosmopedia-faq_1b_cosmopedia": "Source: Cosmopedia",
|
| 486 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 487 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 488 |
}
|
| 489 |
}
|
| 490 |
}}
|
|
@@ -496,15 +496,6 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
|
|
| 496 |
desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
|
| 497 |
config={{
|
| 498 |
setups: {
|
| 499 |
-
"Tutorial Prompt": {
|
| 500 |
-
datasets: {
|
| 501 |
-
"mix-fw_edu_hq-tutorial_1b_dclm": "Source: DCLM",
|
| 502 |
-
"mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
|
| 503 |
-
"mix-fw_edu_hq-tutorial_1b_cosmopedia": "Source: Cosmopedia",
|
| 504 |
-
"mix-fw_edu_hq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
|
| 505 |
-
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 506 |
-
}
|
| 507 |
-
},
|
| 508 |
"FAQ Prompt": {
|
| 509 |
datasets: {
|
| 510 |
"mix-fw_edu_hq-faq_1b_dclm": "Source: DCLM",
|
|
@@ -513,6 +504,15 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
|
|
| 513 |
"mix-fw_edu_hq-faq_1b_cosmopedia": "Source: Cosmopedia",
|
| 514 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 515 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 516 |
}
|
| 517 |
}
|
| 518 |
}}
|
|
@@ -608,7 +608,7 @@ Here are the key takeaways from our experiments:
|
|
| 608 |
- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
|
| 609 |
A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
|
| 610 |
- **Q: Can new prompts beat DCLM?**<br/>
|
| 611 |
-
A: Yes. Math, Table,
|
| 612 |
- **Q: Does model size matter?**<br/>
|
| 613 |
A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
|
| 614 |
- **Q: Do we need better models for low-quality data?**<br/>
|
|
@@ -628,4 +628,4 @@ Here are the key takeaways from our experiments:
|
|
| 628 |
- **Q: Do typos in the prompt hurt?**<br/>
|
| 629 |
A: No. Typos have no negative effect on downstream performance.
|
| 630 |
|
| 631 |
-
So what actually matters? Prompt design, above all else. Structured formats like Math, Table,
|
|
|
|
| 101 |
|
| 102 |
### Can New Prompts Beat DCLM?
|
| 103 |
|
| 104 |
+
Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), and [narrative](#narrative) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
|
| 105 |
|
| 106 |
<HtmlEmbed
|
| 107 |
id="new-prompts"
|
|
|
|
| 109 |
desc="Nine new prompts compared against the DCLM baseline."
|
| 110 |
config={{
|
| 111 |
datasets: {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
"mix-fw_edu_hq-article_1b_hq": "Article",
|
|
|
|
| 113 |
"mix-fw_edu_hq-commentary_1b_hq": "Commentary",
|
| 114 |
"mix-fw_edu_hq-discussion_1b_hq": "Discussion",
|
| 115 |
+
"mix-fw_edu_hq-explanation_1b_hq": "Explanation",
|
| 116 |
+
"mix-fw_edu_hq-faq_1b_hq": "FAQ",
|
| 117 |
+
"mix-fw_edu_hq-math_1b_hq": "Math",
|
| 118 |
+
"mix-fw_edu_hq-narrative_1b_hq": "Narrative",
|
| 119 |
+
"mix-fw_edu_hq-table_1b_hq": "Table",
|
| 120 |
+
"mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
|
| 121 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 122 |
}
|
| 123 |
}}
|
|
|
|
| 131 |
|
| 132 |
#### Does the model size matter?
|
| 133 |
|
| 134 |
+
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [math](#math), [tutorial](#tutorial), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts (use the Setup dropdown in <FigRef target="model-size" /> to switch between them).
|
| 135 |
+
For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1B through 27B show no significant difference.
|
| 136 |
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
|
| 137 |
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
|
| 138 |
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
|
|
|
|
| 148 |
desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
|
| 149 |
config={{
|
| 150 |
setups: {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
"Gemma-3: Math": {
|
| 152 |
datasets: {
|
| 153 |
"mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
|
|
|
|
| 168 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 169 |
}
|
| 170 |
},
|
| 171 |
+
"Gemma-3: Tutorial": {
|
| 172 |
+
datasets: {
|
| 173 |
+
"mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
|
| 174 |
+
"mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
|
| 175 |
+
"mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
|
| 176 |
+
"mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
|
| 177 |
+
"mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
|
| 178 |
+
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 179 |
+
}
|
| 180 |
+
},
|
| 181 |
"SmolLM2: Tutorial": {
|
| 182 |
datasets: {
|
| 183 |
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
|
|
|
|
| 194 |
|
| 195 |
#### Do we need better models for rephrasing low-quality data?
|
| 196 |
|
| 197 |
+
The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
|
| 198 |
|
| 199 |
<HtmlEmbed
|
| 200 |
id="size-quality"
|
|
|
|
| 220 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 221 |
}
|
| 222 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
"FAQ Prompt": {
|
| 224 |
datasets: {
|
| 225 |
"mix-fw_edu_hq-faq_1b_hq": "1B, HQ Source",
|
|
|
|
| 228 |
"mix-fw_edu_hq-faq_12b_lq": "12B, LQ Source",
|
| 229 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 230 |
}
|
| 231 |
+
},
|
| 232 |
+
"Tutorial Prompt": {
|
| 233 |
+
datasets: {
|
| 234 |
+
"mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
|
| 235 |
+
"mix-fw_edu_hq-tutorial_12b_hq": "12B, HQ Source",
|
| 236 |
+
"mix-fw_edu_hq-tutorial_12b_lq": "12B, LQ Source",
|
| 237 |
+
"mix-fw_edu_hq-tutorial_1b_lq": "1B, LQ Source",
|
| 238 |
+
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 239 |
+
}
|
| 240 |
}
|
| 241 |
}
|
| 242 |
}}
|
|
|
|
| 280 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 281 |
}
|
| 282 |
},
|
| 283 |
+
"Explanation Prompt": {
|
| 284 |
datasets: {
|
| 285 |
+
"mix-fw_edu_hq-explanation_smollm2_1.7b_hq": "SmolLM2",
|
| 286 |
+
"mix-fw_edu_hq-explanation_falcon3_1b_hq": "Falcon3",
|
| 287 |
+
"mix-fw_edu_hq-explanation_granite3_1b_hq": "Granite3",
|
| 288 |
+
"mix-fw_edu_hq-explanation_1b_hq": "Gemma-3",
|
| 289 |
+
"mix-fw_edu_hq-explanation_llama3.2_1b_hq": "Llama-3.2",
|
| 290 |
+
"mix-fw_edu_hq-explanation_qwen3_1.7b_hq": "Qwen3",
|
| 291 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 292 |
}
|
| 293 |
},
|
| 294 |
"FAQ Prompt": {
|
| 295 |
datasets: {
|
| 296 |
"mix-fw_edu_hq-faq_smollm2_1.7b_hq": "SmolLM2",
|
|
|
|
| 297 |
"mix-fw_edu_hq-faq_falcon3_1b_hq": "Falcon3",
|
|
|
|
| 298 |
"mix-fw_edu_hq-faq_granite3_1b_hq": "Granite3",
|
| 299 |
+
"mix-fw_edu_hq-faq_1b_hq": "Gemma-3",
|
| 300 |
+
"mix-fw_edu_hq-faq_llama3.2_1b_hq": "Llama-3.2",
|
| 301 |
"mix-fw_edu_hq-faq_qwen3_1.7b_hq": "Qwen3",
|
| 302 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 303 |
}
|
| 304 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 305 |
"Math Prompt": {
|
| 306 |
datasets: {
|
| 307 |
"mix-fw_edu_hq-math_smollm2_1.7b_hq": "SmolLM2",
|
|
|
|
| 324 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 325 |
}
|
| 326 |
},
|
| 327 |
+
"Table Prompt": {
|
| 328 |
datasets: {
|
| 329 |
+
"mix-fw_edu_hq-table_smollm2_1.7b_hq": "SmolLM2",
|
| 330 |
+
"mix-fw_edu_hq-table_falcon3_1b_hq": "Falcon3",
|
| 331 |
+
"mix-fw_edu_hq-table_granite3_1b_hq": "Granite3",
|
| 332 |
+
"mix-fw_edu_hq-table_1b_hq": "Gemma-3",
|
| 333 |
+
"mix-fw_edu_hq-table_llama3.2_1b_hq": "Llama-3.2",
|
| 334 |
+
"mix-fw_edu_hq-table_qwen3_1.7b_hq": "Qwen3",
|
| 335 |
+
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 336 |
+
}
|
| 337 |
+
},
|
| 338 |
+
"Tutorial Prompt": {
|
| 339 |
+
datasets: {
|
| 340 |
+
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
|
| 341 |
+
"mix-fw_edu_hq-tutorial_falcon3_1b_hq": "Falcon3",
|
| 342 |
+
"mix-fw_edu_hq-tutorial_granite3_1b_hq": "Granite3",
|
| 343 |
+
"mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3",
|
| 344 |
+
"mix-fw_edu_hq-tutorial_llama3.2_1b_hq": "Llama-3.2",
|
| 345 |
+
"mix-fw_edu_hq-tutorial_qwen3_1.7b_hq": "Qwen3",
|
| 346 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 347 |
}
|
| 348 |
}
|
|
|
|
| 386 |
|
| 387 |
#### Is synthetic data enough?
|
| 388 |
|
| 389 |
+
We compare synthetic-only training vs mixed training (synthetic + source) for [faq](#faq) and [tutorial](#tutorial) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
|
| 390 |
|
| 391 |
<HtmlEmbed
|
| 392 |
id="synthetic-only"
|
|
|
|
| 460 |
|
| 461 |
#### Does the source dataset matter?
|
| 462 |
|
| 463 |
+
We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
|
| 464 |
|
| 465 |
<HtmlEmbed
|
| 466 |
id="source-dataset-mixin-source"
|
|
|
|
| 468 |
desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
|
| 469 |
config={{
|
| 470 |
setups: {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 471 |
"FAQ Prompt": {
|
| 472 |
datasets: {
|
| 473 |
"mix-dclm-faq_1b_dclm": "Source: DCLM",
|
|
|
|
| 476 |
"mix-cosmopedia-faq_1b_cosmopedia": "Source: Cosmopedia",
|
| 477 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 478 |
}
|
| 479 |
+
},
|
| 480 |
+
"Tutorial Prompt": {
|
| 481 |
+
datasets: {
|
| 482 |
+
"mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
|
| 483 |
+
"mix-dclm-tutorial_1b_dclm": "Source: DCLM",
|
| 484 |
+
"mix-cosmopedia-tutorial_1b_cosmopedia": "Source: Cosmopedia",
|
| 485 |
+
"mix-fw_edu_lq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
|
| 486 |
+
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 487 |
+
}
|
| 488 |
}
|
| 489 |
}
|
| 490 |
}}
|
|
|
|
| 496 |
desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
|
| 497 |
config={{
|
| 498 |
setups: {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 499 |
"FAQ Prompt": {
|
| 500 |
datasets: {
|
| 501 |
"mix-fw_edu_hq-faq_1b_dclm": "Source: DCLM",
|
|
|
|
| 504 |
"mix-fw_edu_hq-faq_1b_cosmopedia": "Source: Cosmopedia",
|
| 505 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 506 |
}
|
| 507 |
+
},
|
| 508 |
+
"Tutorial Prompt": {
|
| 509 |
+
datasets: {
|
| 510 |
+
"mix-fw_edu_hq-tutorial_1b_dclm": "Source: DCLM",
|
| 511 |
+
"mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
|
| 512 |
+
"mix-fw_edu_hq-tutorial_1b_cosmopedia": "Source: Cosmopedia",
|
| 513 |
+
"mix-fw_edu_hq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
|
| 514 |
+
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 515 |
+
}
|
| 516 |
}
|
| 517 |
}
|
| 518 |
}}
|
|
|
|
| 608 |
- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
|
| 609 |
A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
|
| 610 |
- **Q: Can new prompts beat DCLM?**<br/>
|
| 611 |
+
A: Yes. FAQ, Math, Table, and Tutorial all outperform DCLM. Article, Commentary, Discussion, Explanation, and Narrative do not.
|
| 612 |
- **Q: Does model size matter?**<br/>
|
| 613 |
A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
|
| 614 |
- **Q: Do we need better models for low-quality data?**<br/>
|
|
|
|
| 628 |
- **Q: Do typos in the prompt hurt?**<br/>
|
| 629 |
A: No. Typos have no negative effect on downstream performance.
|
| 630 |
|
| 631 |
+
So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving. A 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
|