finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on 20 days ago

Commit

1f06392

1 Parent(s): 5779b6e

add swallowmathv2 finding

Files changed (2) hide show

app/src/content/bibliography.bib CHANGED Viewed

@@ -90,6 +90,16 @@
   url           = {https://arxiv.org/abs/2501.19393}
 }
 % Synthetic data methods
 @inproceedings{demystifyingsynth,
   title         = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},

   url           = {https://arxiv.org/abs/2501.19393}
 }
+@misc{swallowmathv2,
+  title         = {Rewriting Pre-Training Data Boosts LLM Performance in Math and Code},
+  author        = {Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
+  year          = {2025},
+  eprint        = {2505.02881},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.LG},
+  url           = {https://arxiv.org/abs/2505.02881}
+}
 % Synthetic data methods
 @inproceedings{demystifyingsynth,
   title         = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -184,7 +184,7 @@ For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1
 SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
 The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
 This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
-The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
 That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?

 SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
 The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
 This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
+The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data, and with SwallowMath-v2 [@swallowmathv2], which reports no downstream gains on math data from scaling the rewriter from Qwen3-30B-A3B to Qwen3-235B-A22B. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
 That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?