Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
230393c
1
Parent(s): 49ef907
added conclusions paragraph
Browse files
app/src/content/chapters/conclusions.mdx
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
## Conclusions
|
| 2 |
|
|
|
|
| 3 |
|
| 4 |
### Next Steps
|
| 5 |
|
|
|
|
| 1 |
## Conclusions
|
| 2 |
|
| 3 |
+
We ran 65 experiments, generated over 750 billion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
|
| 4 |
|
| 5 |
### Next Steps
|
| 6 |
|