Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
8be4608
1
Parent(s): 1f06392
added gdr paper
Browse files
app/src/content/bibliography.bib
CHANGED
|
@@ -101,6 +101,16 @@
|
|
| 101 |
}
|
| 102 |
|
| 103 |
% Synthetic data methods
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
@inproceedings{demystifyingsynth,
|
| 105 |
title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
|
| 106 |
author = {Feiyang Kang and Newsha Ardalani and Michael Kuchnik and Youssef Emad and Mostafa Elhoushi and Shubhabrata Sengupta and Shang-Wen Li and Ramya Raghavendra and Ruoxi Jia and Carole-Jean Wu},
|
|
|
|
| 101 |
}
|
| 102 |
|
| 103 |
% Synthetic data methods
|
| 104 |
+
@misc{gdr,
|
| 105 |
+
title = {Generative Data Refinement: Just Ask for Better Data},
|
| 106 |
+
author = {Minqi Jiang and João G. M. Araújo and Will Ellsworth and Sian Gooding and Edward Grefenstette},
|
| 107 |
+
year = {2025},
|
| 108 |
+
eprint = {2509.08653},
|
| 109 |
+
archiveprefix = {arXiv},
|
| 110 |
+
primaryclass = {cs.LG},
|
| 111 |
+
url = {https://arxiv.org/abs/2509.08653}
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
@inproceedings{demystifyingsynth,
|
| 115 |
title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
|
| 116 |
author = {Feiyang Kang and Newsha Ardalani and Michael Kuchnik and Youssef Emad and Mostafa Elhoushi and Shubhabrata Sengupta and Shang-Wen Li and Ramya Raghavendra and Ruoxi Jia and Carole-Jean Wu},
|
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -40,6 +40,7 @@ If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3
|
|
| 40 |
- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
|
| 41 |
- When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
|
| 42 |
- Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. Cosmopedia [@cosmopedia] was an early example, generating 25B tokens of textbooks and stories with Mixtral [@mixtral]. Today the latest LLMs are trained on trillions of synthetic tokens, matching the volume of unaltered data.
|
|
|
|
| 43 |
|
| 44 |
We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
|
| 45 |
|
|
|
|
| 40 |
- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
|
| 41 |
- When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
|
| 42 |
- Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. Cosmopedia [@cosmopedia] was an early example, generating 25B tokens of textbooks and stories with Mixtral [@mixtral]. Today the latest LLMs are trained on trillions of synthetic tokens, matching the volume of unaltered data.
|
| 43 |
+
- But publicly indexed web data is only part of the picture. Massive amounts of user-generated content (emails, messages, proprietary codebases) remain untapped because they contain PII, toxic content, or copyrighted material. Generative Data Refinement (GDR) [@gdr] shows that LLMs can anonymize and detoxify such data while preserving its utility for training, outperforming industry-grade PII detectors with a single zero-shot prompt. By conditioning rewrites on each real example, GDR also preserves the diversity of the original data, avoiding the mode collapse that plagues purely synthetic generation. This could dramatically expand the usable data pool beyond what's publicly crawlable.
|
| 44 |
|
| 45 |
We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
|
| 46 |
|