finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Mar 7

Commit

716b5c6

1 Parent(s): 1ce2b86

reformatted experiment takeaways into table

Browse files

Files changed (2) hide show

app/src/content/chapters/3-experiments.mdx +23 -24
app/src/styles/components/_table.css +6 -0

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -638,30 +638,29 @@ With that final detail in hand, let's take stock of everything we've found.
 Let's step back and summarize what we learned:
-- **Q: How do existing datasets compare?**<br/>
-  A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
-- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
-  A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
-- **Q: Can new prompts beat DCLM?**<br/>
-  A: Yes. FAQ, Math, Table, and Tutorial all outperform DCLM. Article, Commentary, Discussion, Explanation, and Narrative do not.
-- **Q: Does model size matter?**<br/>
-  A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
-- **Q: Do we need better models for low-quality data?**<br/>
-  A: No consistent advantage from larger models on low-quality sources.
-- **Q: Does the model family matter?**<br/>
-  A: Yes. SmolLM2 dominates across all prompts.
-- **Q: Does the model generation matter?**<br/>
-  A: Slightly. Newer Qwen versions trend better.
-- **Q: Is synthetic data enough?**<br/>
-  A: No. Always mix synthetic with original data.
-- **Q: Does the mix-in dataset matter?**<br/>
-  A: Yes, a major performance driver, sometimes more important than the synthetic data.
-- **Q: Does the source dataset matter?**<br/>
-  A: Not with a strong mix-in. Even low-quality sources produce competitive results.
-- **Q: Does increased diversity help?**<br/>
-  A: No, performance averages rather than compounds.
-- **Q: Do typos in the prompt hurt?**<br/>
-  A: No. Typos have no negative effect on downstream performance.
 So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving: a 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.

 Let's step back and summarize what we learned:
+<table className="wrap-text" style={{width: '100%', tableLayout: 'fixed', borderCollapse: 'collapse', marginBottom: '1.5rem'}}>
+  <colgroup>
+    <col style={{width: '40%'}} />
+    <col style={{width: '60%'}} />
+  </colgroup>
+  <thead>
+    <tr><th style={{textAlign: 'left'}}>Question</th><th style={{textAlign: 'left'}}>Answer</th></tr>
+  </thead>
+  <tbody>
+    <tr><td>How do existing datasets compare?</td><td>DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.</td></tr>
+    <tr><td>Which individual prompts from the synthetic baselines match DCLM?</td><td>Only Diverse QA Pairs and REWIRE's Guided Rewrite.</td></tr>
+    <tr><td>Can new prompts beat DCLM?</td><td>Yes. FAQ, Math, Table, and Tutorial all outperform DCLM. Article, Commentary, Discussion, Explanation, and Narrative do not.</td></tr>
+    <tr><td>Does model size matter?</td><td>Not much. 1B is sufficient for simple prompts, 4B for complex ones.</td></tr>
+    <tr><td>Do we need better models for low-quality data?</td><td>No consistent advantage from larger models on low-quality sources.</td></tr>
+    <tr><td>Does the model family matter?</td><td>Yes. SmolLM2 dominates across all prompts.</td></tr>
+    <tr><td>Does the model generation matter?</td><td>Slightly. Newer Qwen versions trend better.</td></tr>
+    <tr><td>Is synthetic data enough?</td><td>No. Always mix synthetic with original data.</td></tr>
+    <tr><td>Does the mix-in dataset matter?</td><td>Yes, a major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge), and the best choice depends on source data quality.</td></tr>
+    <tr><td>Does the source dataset matter?</td><td>Not with a strong mix-in. Even low-quality sources produce competitive results.</td></tr>
+    <tr><td>Does increased diversity help?</td><td>No, performance averages rather than compounds.</td></tr>
+    <tr><td>Do typos in the prompt hurt?</td><td>No. Typos have no negative effect on downstream performance.</td></tr>
+  </tbody>
+</table>
 So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving: a 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.

app/src/styles/components/_table.css CHANGED Viewed

@@ -17,6 +17,12 @@
   vertical-align: top;
 }
 .content-grid main thead th {
   border-bottom: 1px solid var(--border-color);
 }

   vertical-align: top;
 }
+.content-grid main table.wrap-text th,
+.content-grid main table.wrap-text td {
+  white-space: normal;
+  word-break: break-word;
+}
 .content-grid main thead th {
   border-bottom: 1px solid var(--border-color);
 }