Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
8260aee
1
Parent(s): bb0d7f1
add some more related work
Browse files
app/src/content/bibliography.bib
CHANGED
|
@@ -100,6 +100,24 @@
|
|
| 100 |
}
|
| 101 |
|
| 102 |
% Synthetic data methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
@inproceedings{wrap,
|
| 104 |
title = {Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling},
|
| 105 |
author = {Pratyush Maini and Skyler Seto and Richard He Bai and David Grangier and Yizhe Zhang and Navdeep Jaitly},
|
|
|
|
| 100 |
}
|
| 101 |
|
| 102 |
% Synthetic data methods
|
| 103 |
+
@inproceedings{demystifyingsynth,
|
| 104 |
+
title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
|
| 105 |
+
author = {Feiyang Kang and Newsha Ardalani and Michael Kuchnik and Youssef Emad and Mostafa Elhoushi and Shubhabrata Sengupta and Shang-Wen Li and Ramya Raghavendra and Ruoxi Jia and Carole-Jean Wu},
|
| 106 |
+
booktitle = {Conference on Empirical Methods in Natural Language Processing},
|
| 107 |
+
year = {2025},
|
| 108 |
+
url = {https://aclanthology.org/2025.emnlp-main.544/}
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
@misc{syntheticcpt,
|
| 112 |
+
title = {Synthetic Continued Pretraining},
|
| 113 |
+
author = {Zitong Yang and Neil Band and Shuangping Li and Emmanuel Candès and Tatsunori Hashimoto},
|
| 114 |
+
year = {2024},
|
| 115 |
+
eprint = {2409.07431},
|
| 116 |
+
archiveprefix = {arXiv},
|
| 117 |
+
primaryclass = {cs.CL},
|
| 118 |
+
url = {https://arxiv.org/abs/2409.07431}
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
@inproceedings{wrap,
|
| 122 |
title = {Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling},
|
| 123 |
author = {Pratyush Maini and Skyler Seto and Richard He Bai and David Grangier and Yizhe Zhang and Navdeep Jaitly},
|
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -68,7 +68,7 @@ The sections below are fairly self-contained, so feel free to jump around and sk
|
|
| 68 |
<Note variant="info" title="But wait, what about model collapse?">
|
| 69 |
You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
|
| 70 |
|
| 71 |
-
In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data. The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
|
| 72 |
</Note>
|
| 73 |
|
| 74 |
Want to learn how to make GPUs go brrr and generate synthetic tokens at scale like this? This blog is for you!
|
|
|
|
| 68 |
<Note variant="info" title="But wait, what about model collapse?">
|
| 69 |
You might be wondering: doesn't training on synthetic data inevitably lead to model collapse? This is a common misconception that stems from research [@modelcollapse] showing severe degradation when models are trained exclusively and iteratively on their own outputs, without any new information or human data.
|
| 70 |
|
| 71 |
+
In practice, nobody trains models this way. Real-world pipelines mix synthetic with human data, use diverse reference materials in prompts, and apply synthetic data strategically rather than replacing entire training corpora. A large-scale empirical study training over 1,000 LLMs [@demystifyingsynth] confirms this nuanced picture: training on rephrased synthetic data mixed with natural web text (at around 30% synthetic) can speed up pretraining convergence by 5-10x, with no signs of degradation. Model collapse happens in a closed loop on a model's own outputs without new signal, which is not how practitioners use synthetic data. The real concern is frontier models generating training data for other frontier models in isolation. Thoughtful integration of synthetic data that introduces new knowledge or perspectives is a different story entirely. In FineWeb [@fineweb] we also found no degradation from naturally occurring AI-generated data on the web.
|
| 72 |
</Note>
|
| 73 |
|
| 74 |
Want to learn how to make GPUs go brrr and generate synthetic tokens at scale like this? This blog is for you!
|
app/src/content/chapters/2-setup.mdx
CHANGED
|
@@ -5,7 +5,7 @@ import Tab from "../../components/Tab.astro";
|
|
| 5 |
|
| 6 |
## Rephrasing the Web
|
| 7 |
|
| 8 |
-
Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting,
|
| 9 |
|
| 10 |
### What is Rephrasing?
|
| 11 |
|
|
|
|
| 5 |
|
| 6 |
## Rephrasing the Web
|
| 7 |
|
| 8 |
+
Several teams have already shown that rephrasing web content into cleaner formats can beat training on raw data: WRAP [@wrap] rewrites text in different styles, Nemotron-CC [@nemotroncc] extracts QA pairs and knowledge lists, REWIRE [@rewire] does guided rewriting, BeyondWeb [@beyondweb] tries continuation and summarization, and EntiGraph [@syntheticcpt] uses entity-centric augmentation to synthesize diverse knowledge representations from small corpora. But nobody has done a systematic comparison across all these approaches, and the field still lacks a clear framework for what "rephrasing" even means. So let's fix that.
|
| 9 |
|
| 10 |
### What is Rephrasing?
|
| 11 |
|
app/src/content/chapters/3-experiments.mdx
CHANGED
|
@@ -185,7 +185,7 @@ For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1
|
|
| 185 |
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
|
| 186 |
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
|
| 187 |
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
|
| 188 |
-
The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
|
| 189 |
|
| 190 |
That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
|
| 191 |
|
|
@@ -447,7 +447,7 @@ The dream scenario would be generating all your training data synthetically, no
|
|
| 447 |
}}
|
| 448 |
/>
|
| 449 |
|
| 450 |
-
Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type.
|
| 451 |
|
| 452 |
The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
|
| 453 |
|
|
@@ -615,7 +615,7 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
|
|
| 615 |
}}
|
| 616 |
/>
|
| 617 |
|
| 618 |
-
None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds. This was a bit disappointing.
|
| 619 |
|
| 620 |
Putting together our findings on synthetic-only training, mix-in choice, source quality, and diversity:
|
| 621 |
|
|
|
|
| 185 |
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
|
| 186 |
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
|
| 187 |
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
|
| 188 |
+
The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
|
| 189 |
|
| 190 |
That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
|
| 191 |
|
|
|
|
| 447 |
}}
|
| 448 |
/>
|
| 449 |
|
| 450 |
+
Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type. This echoes @demystifyingsynth, who found that pure synthetic data never outperforms natural web text alone, but mixing roughly 30% rephrased synthetic data with natural text can accelerate convergence by 5-10x.
|
| 451 |
|
| 452 |
The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
|
| 453 |
|
|
|
|
| 615 |
}}
|
| 616 |
/>
|
| 617 |
|
| 618 |
+
None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds. This was a bit disappointing. @syntheticcpt found that simple paraphrasing quickly saturates in their continued pretraining setting, while their entity-graph-based EntiGraph approach scales log-linearly by externalizing diversity to a combinatorial structure over entities. Our prompts may already capture enough structural diversity that additional mixing has diminishing returns at 20B tokens, but diversity benefits may emerge at larger scales where the model can better exploit the varied signal.
|
| 619 |
|
| 620 |
Putting together our findings on synthetic-only training, mix-in choice, source quality, and diversity:
|
| 621 |
|
app/src/content/chapters/7-conclusions.mdx
CHANGED
|
@@ -9,7 +9,7 @@ The biggest bottleneck to scaling synthetic data experiments is the compute cost
|
|
| 9 |
Beyond faster generation, we answered several questions about best practices but many remain wide open:
|
| 10 |
|
| 11 |
- **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
|
| 12 |
-
- **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
|
| 13 |
- **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
|
| 14 |
- **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
|
| 15 |
- **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?
|
|
|
|
| 9 |
Beyond faster generation, we answered several questions about best practices but many remain wide open:
|
| 10 |
|
| 11 |
- **Data repetition**: Can you repeat data more often without performance loss if the repetitions are rephrased?
|
| 12 |
+
- **Mixing ratio**: We mixed unrephrased source data with synthetic data at equal proportions. @demystifyingsynth found ~30% rephrased synthetic to be optimal for their setup, but this likely depends on model size, data budget, and synthetic data type. How little synthetic data can you get away with: 50%, 20%, 5%? What are the best data mixes for pretraining at scale?
|
| 13 |
- **Generation parameters**: What influence do temperature, `top_p`, and other sampling settings have on rephrasing quality?
|
| 14 |
- **Context extension**: Does chunked rollouts context extension during mid-training improve downstream performance?
|
| 15 |
- **Best-of-N filtering**: Can we generate multiple rollouts per example and score them to keep only the best one?
|