finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on Feb 18

Commit

230393c

1 Parent(s): 49ef907

added conclusions paragraph

Files changed (1) hide show

app/src/content/chapters/conclusions.mdx CHANGED Viewed

@@ -1,5 +1,6 @@
 ## Conclusions
 ### Next Steps

 ## Conclusions
+We ran 65 experiments, generated over 750 billion tokens, and spent more than 74,000 GPU hours to figure out what actually matters for synthetic pretraining data. The answer is surprisingly simple: **prompt design is the single biggest lever**. Structured formats like Table, Math, FAQ, and Tutorial consistently beat both curated web baselines and prior synthetic methods, producing our best configuration, FinePhrase. You don't need a large rephrasing model to get there. A 1B model is sufficient for most prompts, and even low-quality source data works fine when paired with a strong mix-in dataset. In fact, template diversity matters more than template polish: a messier model that produces varied outputs can outperform a polished one that repeats the same structure. SmolLM2-1.7B emerged as the best rephrasing model across all prompts, beating larger models from other families. And we found no reliable proxy metric that can replace training and evaluating a model, meaning there is no shortcut around the full pipeline. We open-source all infrastructure, prompts, and benchmarking code through DataTrove so others can build on these findings without reinventing the plumbing.
 ### Next Steps