finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 16 days ago

Commit

ae68e5f

1 Parent(s): 77f7fc5

add some concrete examples

Browse files

Files changed (2) hide show

app/src/content/bibliography.bib +22 -0
app/src/content/chapters/1-introduction.mdx +3 -3

app/src/content/bibliography.bib CHANGED Viewed

@@ -1,6 +1,28 @@
 % FinePhrase blog post bibliography
 % Datasets
 @misc{datacomp,
   title         = {DataComp-LM: In search of the next generation of training sets for language models},
   author        = {Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng-Yu Hsieh and Dhruba Ghosh and Josh Gardner and Maciej Kilian and Hanlin Zhang and Rulin Shao and Sarah Pratt and Sunny Sanyal and Gabriel Ilharco and Giannis Daras and Kalyani Marathe and Aaron Gokaslan and Jieyu Zhang and Khyathi Chandu and Thao Nguyen and Igor Vasiljevic and Sham Kakade and Shuran Song and Sujay Sanghavi and Fartash Faghri and Sewoong Oh and Luke Zettlemoyer and Kyle Lo and Alaaeldin El-Nouby and Hadi Pouransari and Alexander Toshev and Stephanie Wang and Dirk Groeneveld and Luca Soldaini and Pang Wei Koh and Jenia Jitsev and Thomas Kollar and Alexandros G. Dimakis and Yair Carmon and Achal Dave and Ludwig Schmidt and Vaishaal Shankar},

 % FinePhrase blog post bibliography
 % Datasets
+@article{c4,
+  title         = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
+  author        = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
+  journal       = {Journal of Machine Learning Research},
+  volume        = {21},
+  number        = {140},
+  pages         = {1--67},
+  year          = {2020},
+  url           = {https://arxiv.org/abs/1910.10683}
+}
+@article{thepile,
+  title         = {The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
+  author        = {Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy},
+  journal       = {arXiv preprint arXiv:2101.00027},
+  year          = {2020},
+  eprint        = {2101.00027},
+  archiveprefix = {arXiv},
+  primaryclass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2101.00027}
+}
 @misc{datacomp,
   title         = {DataComp-LM: In search of the next generation of training sets for language models},
   author        = {Jeffrey Li and Alex Fang and Georgios Smyrnis and Maor Ivgi and Matt Jordan and Samir Gadre and Hritik Bansal and Etash Guha and Sedrick Keh and Kushal Arora and Saurabh Garg and Rui Xin and Niklas Muennighoff and Reinhard Heckel and Jean Mercat and Mayee Chen and Suchin Gururangan and Mitchell Wortsman and Alon Albalak and Yonatan Bitton and Marianna Nezhurina and Amro Abbas and Cheng-Yu Hsieh and Dhruba Ghosh and Josh Gardner and Maciej Kilian and Hanlin Zhang and Rulin Shao and Sarah Pratt and Sunny Sanyal and Gabriel Ilharco and Giannis Daras and Kalyani Marathe and Aaron Gokaslan and Jieyu Zhang and Khyathi Chandu and Thao Nguyen and Igor Vasiljevic and Sham Kakade and Shuran Song and Sujay Sanghavi and Fartash Faghri and Sewoong Oh and Luke Zettlemoyer and Kyle Lo and Alaaeldin El-Nouby and Hadi Pouransari and Alexander Toshev and Stephanie Wang and Dirk Groeneveld and Luca Soldaini and Pang Wei Koh and Jenia Jitsev and Thomas Kollar and Alexandros G. Dimakis and Yair Carmon and Achal Dave and Ludwig Schmidt and Vaishaal Shankar},

app/src/content/chapters/1-introduction.mdx CHANGED Viewed

@@ -10,9 +10,9 @@ import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
 If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
-- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. We went from training on just a few billion tokens to training on trillions of tokens including most of the web text.
-- When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.
-- Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. The latest LLMs were trained on trillions of synthetic tokens, matching the volume of unaltered data.
 We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.

 If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
+- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
+- When approaching the scaling limits of web data, the discussion shifted from volume to quality. Researchers started with stronger heuristics and deduplication pipelines, then switched to neural classifiers looking for "educational" or "instruction-like" data. FineWeb-Edu used Llama 3 70B [@llama3] to score educational quality, DCLM used model-based filtering to train a 7B model to 64% MMLU with 2.6T tokens. With higher quality data, some repetitions seemed fine.
+- Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. Cosmopedia [@cosmopedia] was an early example, generating 25B tokens of textbooks and stories with Mixtral [@mixtral]. Today the latest LLMs are trained on trillions of synthetic tokens, matching the volume of unaltered data.
 We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.