finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on Feb 17

Commit

5df08f8

1 Parent(s): 7c8644c

add overview of the findings in the conclusions

Files changed (1) hide show

app/src/content/chapters/conclusions.mdx CHANGED Viewed

@@ -1,6 +1,31 @@
 ## Conclusions
-TODO: Table with answers to the questions (ablation sections)
 ### Next Steps

 ## Conclusions
+Here are the key takeaways from our experiments:
+- **Q: How do existing datasets compare?**<br/>
+  A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
+- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
+  A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
+- **Q: Can new prompts beat DCLM?**<br/>
+  A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
+- **Q: Does model size matter?**<br/>
+  A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
+- **Q: Do we need better models for low-quality data?**<br/>
+  A: No consistent advantage from larger models on low-quality sources.
+- **Q: Does the model family matter?**<br/>
+  A: Yes. SmolLM2 dominates across all prompts.
+- **Q: Does the model generation matter?**<br/>
+  A: Slightly. Newer Qwen versions trend better.
+- **Q: Is synthetic data enough?**<br/>
+  A: No. Always mix synthetic with original data.
+- **Q: Does the mix-in dataset matter?**<br/>
+  A: Yes, a major performance driver, sometimes more important than the synthetic data.
+- **Q: Does the source dataset matter?**<br/>
+  A: Not with a strong mix-in. Even low-quality sources produce competitive results.
+- **Q: Does increased diversity help?**<br/>
+  A: No, performance averages rather than compounds.
+- **Q: Do typos in the prompt hurt?**<br/>
+  A: No. Typos have no negative effect on downstream performance.
 ### Next Steps