joelniklaus HF Staff commited on
Commit
716b5c6
·
1 Parent(s): 1ce2b86

reformatted experiment takeaways into table

Browse files
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -638,30 +638,29 @@ With that final detail in hand, let's take stock of everything we've found.
638
 
639
  Let's step back and summarize what we learned:
640
 
641
- - **Q: How do existing datasets compare?**<br/>
642
- A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
643
- - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
644
- A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
645
- - **Q: Can new prompts beat DCLM?**<br/>
646
- A: Yes. FAQ, Math, Table, and Tutorial all outperform DCLM. Article, Commentary, Discussion, Explanation, and Narrative do not.
647
- - **Q: Does model size matter?**<br/>
648
- A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
649
- - **Q: Do we need better models for low-quality data?**<br/>
650
- A: No consistent advantage from larger models on low-quality sources.
651
- - **Q: Does the model family matter?**<br/>
652
- A: Yes. SmolLM2 dominates across all prompts.
653
- - **Q: Does the model generation matter?**<br/>
654
- A: Slightly. Newer Qwen versions trend better.
655
- - **Q: Is synthetic data enough?**<br/>
656
- A: No. Always mix synthetic with original data.
657
- - **Q: Does the mix-in dataset matter?**<br/>
658
- A: Yes, a major performance driver, sometimes more important than the synthetic data.
659
- - **Q: Does the source dataset matter?**<br/>
660
- A: Not with a strong mix-in. Even low-quality sources produce competitive results.
661
- - **Q: Does increased diversity help?**<br/>
662
- A: No, performance averages rather than compounds.
663
- - **Q: Do typos in the prompt hurt?**<br/>
664
- A: No. Typos have no negative effect on downstream performance.
665
 
666
  So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving: a 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
667
 
 
638
 
639
  Let's step back and summarize what we learned:
640
 
641
+ <table className="wrap-text" style={{width: '100%', tableLayout: 'fixed', borderCollapse: 'collapse', marginBottom: '1.5rem'}}>
642
+ <colgroup>
643
+ <col style={{width: '40%'}} />
644
+ <col style={{width: '60%'}} />
645
+ </colgroup>
646
+ <thead>
647
+ <tr><th style={{textAlign: 'left'}}>Question</th><th style={{textAlign: 'left'}}>Answer</th></tr>
648
+ </thead>
649
+ <tbody>
650
+ <tr><td>How do existing datasets compare?</td><td>DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.</td></tr>
651
+ <tr><td>Which individual prompts from the synthetic baselines match DCLM?</td><td>Only Diverse QA Pairs and REWIRE's Guided Rewrite.</td></tr>
652
+ <tr><td>Can new prompts beat DCLM?</td><td>Yes. FAQ, Math, Table, and Tutorial all outperform DCLM. Article, Commentary, Discussion, Explanation, and Narrative do not.</td></tr>
653
+ <tr><td>Does model size matter?</td><td>Not much. 1B is sufficient for simple prompts, 4B for complex ones.</td></tr>
654
+ <tr><td>Do we need better models for low-quality data?</td><td>No consistent advantage from larger models on low-quality sources.</td></tr>
655
+ <tr><td>Does the model family matter?</td><td>Yes. SmolLM2 dominates across all prompts.</td></tr>
656
+ <tr><td>Does the model generation matter?</td><td>Slightly. Newer Qwen versions trend better.</td></tr>
657
+ <tr><td>Is synthetic data enough?</td><td>No. Always mix synthetic with original data.</td></tr>
658
+ <tr><td>Does the mix-in dataset matter?</td><td>Yes, a major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge), and the best choice depends on source data quality.</td></tr>
659
+ <tr><td>Does the source dataset matter?</td><td>Not with a strong mix-in. Even low-quality sources produce competitive results.</td></tr>
660
+ <tr><td>Does increased diversity help?</td><td>No, performance averages rather than compounds.</td></tr>
661
+ <tr><td>Do typos in the prompt hurt?</td><td>No. Typos have no negative effect on downstream performance.</td></tr>
662
+ </tbody>
663
+ </table>
 
664
 
665
  So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving: a 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
666
 
app/src/styles/components/_table.css CHANGED
@@ -17,6 +17,12 @@
17
  vertical-align: top;
18
  }
19
 
 
 
 
 
 
 
20
  .content-grid main thead th {
21
  border-bottom: 1px solid var(--border-color);
22
  }
 
17
  vertical-align: top;
18
  }
19
 
20
+ .content-grid main table.wrap-text th,
21
+ .content-grid main table.wrap-text td {
22
+ white-space: normal;
23
+ word-break: break-word;
24
+ }
25
+
26
  .content-grid main thead th {
27
  border-bottom: 1px solid var(--border-color);
28
  }