joelniklaus HF Staff commited on
Commit
8260aee
·
1 Parent(s): bb0d7f1

add some more related work

Browse files
app/src/content/bibliography.bib CHANGED
@@ -100,6 +100,24 @@
100
  }
101
 
102
  % Synthetic data methods
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  @inproceedings{wrap,
104
  title = {Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling},
105
  author = {Pratyush Maini and Skyler Seto and Richard He Bai and David Grangier and Yizhe Zhang and Navdeep Jaitly},
 
100
  }
101
 
102
  % Synthetic data methods
103
+ @inproceedings{demystifyingsynth,
104
+ title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
105
+ author = {Feiyang Kang and Newsha Ardalani and Michael Kuchnik and Youssef Emad and Mostafa Elhoushi and Shubhabrata Sengupta and Shang-Wen Li and Ramya Raghavendra and Ruoxi Jia and Carole-Jean Wu},
106
+ booktitle = {Conference on Empirical Methods in Natural Language Processing},
107
+ year = {2025},
108
+ url = {https://aclanthology.org/2025.emnlp-main.544/}
109
+ }
110
+
111
+ @misc{syntheticcpt,
112
+ title = {Synthetic Continued Pretraining},
113
+ author = {Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto},
114
+ year = {2024},
115
+ eprint = {2409.07431},
116
+ archiveprefix = {arXiv},
117
+ primaryclass = {cs.CL},
118
+ url = {https://arxiv.org/abs/2409.07431}
119
+ }
120
+
121
  @inproceedings{wrap,
122
  title = {Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling},
123
  author = {Pratyush Maini and Skyler Seto and Richard He Bai and David Grangier and Yizhe Zhang and Navdeep Jaitly},
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -68,7 +68,7 @@ The sections below are fairly self-contained, so feel free to jump around and sk
68
  <Note variant="info" title="But wait, what about model collapse?">
69
  You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
70
 
71
- In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data. The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
72
  </Note>
73
 
74
  Want to learn how to make GPUs go brrr and generate synthetic tokens at scale like this? This blog is for you!
 
68
  <Note variant="info" title="But wait, what about model collapse?">
69
  You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
70
 
71
+ In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. A large-scale empirical study training over 1,000 LLMs [@demystifyingsynth] confirms this nuanced picture: training on rephrased synthetic data mixed with natural web text (at around 30% synthetic) can speed up pretraining convergence by 5-10x, with no signs of degradation. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data. The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
72
  </Note>
73
 
74
  Want to learn how to make GPUs go brrr and generate synthetic tokens at scale like this? This blog is for you!
app/src/content/chapters/2-setup.mdx CHANGED
@@ -5,7 +5,7 @@ import Tab from "../../components/Tab.astro";
5
 
6
  ## Rephrasing the Web
7
 
8
- Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, and BeyondWeb [@beyondweb] tries continuation and summarization. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
9
 
10
  ### What is Rephrasing?
11
 
 
5
 
6
  ## Rephrasing the Web
7
 
8
+ Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, BeyondWeb [@beyondweb] tries continuation and summarization, and EntiGraph [@syntheticcpt] uses entity-centric augmentation to synthesize diverse knowledge representations from small corpora. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
9
 
10
  ### What is Rephrasing?
11
 
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -185,7 +185,7 @@ For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1
185
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
186
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
187
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
188
- The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
189
 
190
  That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
191
 
@@ -447,7 +447,7 @@ The dream scenario would be generating all your training data synthetically, no
447
  }}
448
  />
449
 
450
- Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type.
451
 
452
  The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
453
 
@@ -615,7 +615,7 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
615
  }}
616
  />
617
 
618
- None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds. This was a bit disappointing. That said, our ablations train on only 20B tokens, so diversity benefits may emerge at larger scales where the model can better exploit the varied signal.
619
 
620
  Putting together our findings on synthetic-only training, mix-in choice, source quality, and diversity:
621
 
 
185
  SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
186
  The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
187
  This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
188
+ The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
189
 
190
  That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
191
 
 
447
  }}
448
  />
449
 
450
+ Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type. This echoes @demystifyingsynth, who found that pure synthetic data never outperforms natural web text alone, but mixing roughly 30% rephrased synthetic data with natural text can accelerate convergence by 5-10x.
451
 
452
  The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
453
 
 
615
  }}
616
  />
617
 
618
+ None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds. This was a bit disappointing. @syntheticcpt found that simple paraphrasing quickly saturates in their continued pretraining setting, while their entity-graph-based EntiGraph approach scales log-linearly by externalizing diversity to a combinatorial structure over entities. Our prompts may already capture enough structural diversity that additional mixing has diminishing returns at 20B tokens, but diversity benefits may emerge at larger scales where the model can better exploit the varied signal.
619
 
620
  Putting together our findings on synthetic-only training, mix-in choice, source quality, and diversity:
621
 
app/src/content/chapters/7-conclusions.mdx CHANGED
@@ -9,7 +9,7 @@ The biggest bottleneck to scaling synthetic data experiments is the compute cost
9
  Beyond faster generation, we answered several questions about best practices but many remain wide open:
10
 
11
  - **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
12
- - **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
13
  - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
14
  - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
15
  - **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?
 
9
  Beyond faster generation, we answered several questions about best practices but many remain wide open:
10
 
11
  - **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
12
+ - **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. @demystifyingsynth found ~30% rephrased synthetic to be optimal for their setup, but this likely depends on model size, data budget, and synthetic data type. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
13
  - **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
14
  - **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
15
  - **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?