joelniklaus HF Staff commited on
Commit
1f06392
·
1 Parent(s): 5779b6e

add swallowmathv2 finding

Browse files
app/src/content/bibliography.bib CHANGED
@@ -90,6 +90,16 @@
90
  url = {https://arxiv.org/abs/2501.19393}
91
  }
92
 
 
 
 
 
 
 
 
 
 
 
93
  % Synthetic data methods
94
  @inproceedings{demystifyingsynth,
95
  title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
 
90
  url = {https://arxiv.org/abs/2501.19393}
91
  }
92
 
93
+ @misc{swallowmathv2,
94
+ title = {Rewriting Pre-Training Data Boosts LLM Performance in Math and Code},
95
+ author = {Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
96
+ year = {2025},
97
+ eprint = {2505.02881},
98
+ archiveprefix = {arXiv},
99
+ primaryclass = {cs.LG},
100
+ url = {https://arxiv.org/abs/2505.02881}
101
+ }
102
+
103
  % Synthetic data methods
104
  @inproceedings{demystifyingsynth,
105
  title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -184,7 +184,7 @@ For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1
184
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
185
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
186
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
187
- The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
188
 
189
  That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
190
 
 
184
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
185
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
186
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
187
+ The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data, and with SwallowMath-v2 [@swallowmathv2], which reports no downstream gains on math data from scaling the rewriter from Qwen3-30B-A3B to Qwen3-235B-A22B. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
188
 
189
  That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
190