joelniklaus HF Staff commited on
Commit
8be4608
·
1 Parent(s): 1f06392

added gdr paper

Browse files
app/src/content/bibliography.bib CHANGED
@@ -101,6 +101,16 @@
101
  }
102
 
103
  % Synthetic data methods
 
 
 
 
 
 
 
 
 
 
104
  @inproceedings{demystifyingsynth,
105
  title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
106
  author = {Feiyang Kang and Newsha Ardalani and Michael Kuchnik and Youssef Emad and Mostafa Elhoushi and Shubhabrata Sengupta and Shang-Wen Li and Ramya Raghavendra and Ruoxi Jia and Carole-Jean Wu},
 
101
  }
102
 
103
  % Synthetic data methods
104
+ @misc{gdr,
105
+ title = {Generative Data Refinement: Just Ask for Better Data},
106
+ author = {Minqi Jiang and João G. M. Araújo and Will Ellsworth and Sian Gooding and Edward Grefenstette},
107
+ year = {2025},
108
+ eprint = {2509.08653},
109
+ archiveprefix = {arXiv},
110
+ primaryclass = {cs.LG},
111
+ url = {https://arxiv.org/abs/2509.08653}
112
+ }
113
+
114
  @inproceedings{demystifyingsynth,
115
  title = {Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls},
116
  author = {Feiyang Kang and Newsha Ardalani and Michael Kuchnik and Youssef Emad and Mostafa Elhoushi and Shubhabrata Sengupta and Shang-Wen Li and Ramya Raghavendra and Ruoxi Jia and Carole-Jean Wu},
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -40,6 +40,7 @@ If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3
40
  - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
41
  - When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
42
  - Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. Cosmopedia [@cosmopedia] was an early example, generating 25B tokens of textbooks and stories with Mixtral [@mixtral]. Today the latest LLMs are trained on trillions of synthetic tokens, matching the volume of unaltered data.
 
43
 
44
  We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
45
 
 
40
  - After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
41
  - When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
42
  - Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. Cosmopedia [@cosmopedia] was an early example, generating 25B tokens of textbooks and stories with Mixtral [@mixtral]. Today the latest LLMs are trained on trillions of synthetic tokens, matching the volume of unaltered data.
43
+ - But publicly indexed web data is only part of the picture. Massive amounts of user-generated content (emails, messages, proprietary codebases) remain untapped because they contain PII, toxic content, or copyrighted material. Generative Data Refinement (GDR) [@gdr] shows that LLMs can anonymize and detoxify such data while preserving its utility for training, outperforming industry-grade PII detectors with a single zero-shot prompt. By conditioning rewrites on each real example, GDR also preserves the diversity of the original data, avoiding the mode collapse that plagues purely synthetic generation. This could dramatically expand the usable data pool beyond what's publicly crawlable.
44
 
45
  We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
46