joelniklaus HF Staff commited on
Commit
f42775d
·
1 Parent(s): 305f4f3

changed links to references, removed old references and cleaned up todos

Browse files
app/src/content/bibliography.bib CHANGED
The diff for this file is too large to render. See raw diff
 
app/src/content/chapters/conclusions.mdx CHANGED
@@ -11,8 +11,9 @@ While we answered some questions in this work, many still remain such as:
11
  - Can you "repeat" data more without performance loss if the repetitions are rephrased?
12
  - We mixed unrephrased source data with synthetic data in our experiments to equal proportions. How little synthetic data can we get away with: 50%, 20%, 5%?
13
  - What influence do generation parameters such as temperature or top_p have on rephrasing performance
14
- - [https://z-lab.ai/projects/dflash/](https://z-lab.ai/projects/dflash/) This as future work for speeding up inference more: currently still a bit cumbersome to use and limited model support
15
  - Experiment with chunked rollouts context extension in mid-training
16
  - Experiment with multiple rollouts per example and filtering for the highest quality one
17
  - In REWIRE, they show larger gains for bigger models trained on their data. Can we reproduce this?
18
  - Does automatic prompt optimization with tools like dspy improve rephrasing performance?
 
 
11
  - Can you "repeat" data more without performance loss if the repetitions are rephrased?
12
  - We mixed unrephrased source data with synthetic data in our experiments to equal proportions. How little synthetic data can we get away with: 50%, 20%, 5%?
13
  - What influence do generation parameters such as temperature or top_p have on rephrasing performance
14
+ - DFlash [@dflash] as future work for speeding up inference more: currently still a bit cumbersome to use and limited model support
15
  - Experiment with chunked rollouts context extension in mid-training
16
  - Experiment with multiple rollouts per example and filtering for the highest quality one
17
  - In REWIRE, they show larger gains for bigger models trained on their data. Can we reproduce this?
18
  - Does automatic prompt optimization with tools like dspy improve rephrasing performance?
19
+ - The ablations only trained for 21B tokens. It is still unclear how these findings transfer to larger scales in terms of both model parameters and data.
app/src/content/chapters/experiments.mdx CHANGED
@@ -8,16 +8,10 @@ TODO: Benchmarking: plot compare against default, mention how expensive one swee
8
 
9
  TODO: With Thibaud look how to visualize the 1T tokens of ablation data (similar to fineweb visualization)
10
 
11
- TODO: rename the experiment names so it is easier to understand
12
-
13
  TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2
14
 
15
- TODO: larger ablation with 100b tokens (how little synthetic data can we get away with)
16
-
17
  TODO: more conversational style (less condensed than papers)
18
 
19
- TODO: read recent long blog posts as inspiration
20
-
21
  TODO: add a visualization for the infrastructure
22
 
23
  TODO: add a plot for the table with the benchmark results
@@ -26,6 +20,8 @@ TODO: Analyze if certain models are more verbose than others (how many tokens di
26
 
27
  TODO: Add appendix section of weird unexplainable results?
28
 
 
 
29
  ### FinePhrase vs Synthetic Baselines
30
 
31
  We see that FinePhrase clearly outperforms the synthetic baselines.
@@ -169,7 +165,7 @@ Potentially, writing a tutorial is easy enough and we only need larger models fo
169
  TODO: also run this experiment for the REWIRE prompt since the original authors claim that larger models are necessary there
170
 
171
  **Do we need better models for rephrasing low-quality data?**
172
- The [REWIRE](https://arxiv.org/abs/2506.04689) paper claims that for upcycling low quality data we need large models (Llama-3.3 70B in their case). Is this true?
173
  Continue prompt: For the 1b model the source data does not seem to matter, but the 12b model can make use of the hq data better.
174
 
175
  <HtmlEmbed
 
8
 
9
  TODO: With Thibaud look how to visualize the 1T tokens of ablation data (similar to fineweb visualization)
10
 
 
 
11
  TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2
12
 
 
 
13
  TODO: more conversational style (less condensed than papers)
14
 
 
 
15
  TODO: add a visualization for the infrastructure
16
 
17
  TODO: add a plot for the table with the benchmark results
 
20
 
21
  TODO: Add appendix section of weird unexplainable results?
22
 
23
+ TODO: Go through blog post and fix references
24
+
25
  ### FinePhrase vs Synthetic Baselines
26
 
27
  We see that FinePhrase clearly outperforms the synthetic baselines.
 
165
  TODO: also run this experiment for the REWIRE prompt since the original authors claim that larger models are necessary there
166
 
167
  **Do we need better models for rephrasing low-quality data?**
168
+ The REWIRE [@rewire] paper claims that for upcycling low quality data we need large models (Llama-3.3 70B in their case). Is this true?
169
  Continue prompt: For the 1b model the source data does not seem to matter, but the 12b model can make use of the hq data better.
170
 
171
  <HtmlEmbed
app/src/content/chapters/infrastructure.mdx CHANGED
@@ -6,7 +6,7 @@ import Screenshot_2026_01_20_at_09_42_21_2f81384e_bcac_80e6_b3fa_d06567e56b15 fr
6
 
7
  When you start generating your first synthetic tokens with LLMs you will notice quickly that this is an extremely slow and compute heavy process. Even though we can cache KV values from previous tokens, we still need to do one forward pass for *EVERY* token and every web document typically has a few thousand tokens. So the first step before we can run any large scale experiments is to setup some infrastructure to make sure we can generate as efficiently and scalable as possible. Let's have look at what is involved!
8
 
9
- Synthetic data has emerged as a key ingredient in training modern LLMs, providing a path past the pretraining data wall, where high-quality text (or ["fossil fuel"](https://youtu.be/1yvBqasHLZs?si=YgaaCSfngJNi3OSb&t=475)) becomes scarce and collecting more internet data yields diminishing returns. For example, NVIDIA used LLMs to rephrase around 2 trillion tokens (!) of web text in their [Nemotron-CC dataset](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2), while Z.ai generated 500 billion reasoning tokens to mid-train the [GLM-4.5 series of models](https://huggingface.co/collections/zai-org/glm-45):
10
 
11
  <Image src={SyDLepVveg_2f81384e_bcac_806f_acb7_fd65c71dd9df} alt="Image" />
12
 
 
6
 
7
  When you start generating your first synthetic tokens with LLMs you will notice quickly that this is an extremely slow and compute heavy process. Even though we can cache KV values from previous tokens, we still need to do one forward pass for *EVERY* token and every web document typically has a few thousand tokens. So the first step before we can run any large scale experiments is to setup some infrastructure to make sure we can generate as efficiently and scalable as possible. Let's have look at what is involved!
8
 
9
+ Synthetic data has emerged as a key ingredient in training modern LLMs, providing a path past the pretraining data wall, where high-quality text (or ["fossil fuel"](https://youtu.be/1yvBqasHLZs?si=YgaaCSfngJNi3OSb&t=475)) becomes scarce and collecting more internet data yields diminishing returns. For example, NVIDIA used LLMs to rephrase around 2 trillion tokens (!) of web text in their [Nemotron-CC dataset](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) [@nemotroncc], while Z.ai generated 500 billion reasoning tokens to mid-train the GLM-4.5 series of models [@glm45]:
10
 
11
  <Image src={SyDLepVveg_2f81384e_bcac_806f_acb7_fd65c71dd9df} alt="Image" />
12
 
app/src/content/chapters/introduction.mdx CHANGED
@@ -13,7 +13,7 @@ If you read some of the latest LLM papers [add some refs, e.g. Nemotron 3, Arcee
13
  - When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.
14
  - Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. The latest LLMs were trained on trillions of synthetic tokens, matching the volume of unaltered data.
15
 
16
- Besides pretraining, synthetic data generation also has become a useful tool for post-training. It is applied to fill gaps identified in models. A fun anecdote is the SmolLM2 training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a small talk dataset ([https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0)) quickly solved this issue.
17
 
18
  We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
19
 
 
13
  - When approaching the scaling limits of web data people started to more aggressively filter the data and the discussion shifted from volume to quality. Starting with stronger heuristics including deduplication pipelines and eventually switching to neural classifiers looking for "educational" or "instruction-like" data. The first model trainings were conservative with repeating data but with higher quality data some repetitions seemed fine.
14
  - Now that we have mostly exhausted web text data and concluded that quality is more important, synthetic data has become an interesting option to up-cycle the data that the classifiers would have normally excluded and thus increase the volume of data again. The latest LLMs were trained on trillions of synthetic tokens, matching the volume of unaltered data.
15
 
16
+ Besides pretraining, synthetic data generation also has become a useful tool for post-training. It is applied to fill gaps identified in models. A fun anecdote is the SmolLM2 [@smollm2] training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a small talk dataset ([https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0)) quickly solved this issue.
17
 
18
  We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
19
 
app/src/content/chapters/setup.mdx CHANGED
@@ -2,7 +2,7 @@
2
 
3
  ### Synthetic Data for Pretraining
4
 
5
- Language model development has encountered a fundamental data wall as high-quality web data becomes increasingly scarce, pushing researchers toward synthetic data generation as a complement to traditional internet-scraped datasets. While recent work has demonstrated that synthetic data can dramatically improve model quality, with approaches like WRAP, Nemotron-CC, and BeyondWeb showing that rephrasing existing web content into higher-quality formats can outperform training on raw data alone, the field lacks both a clear conceptual framework for understanding what "synthetic data" and "rephrasing" actually mean and systematic investigations of the factors that determine their effectiveness.
6
 
7
  **What is Rephrasing?** At its core, rephrasing involves transforming existing documents through language models to produce variants that preserve semantic content while modifying presentation, structure, or style. However, this simple definition masks considerable complexity. Rephrasing exists along a spectrum from conservative transformations (style transfer, format conversion) to more aggressive interventions (content expansion, pedagogical restructuring, knowledge extraction). A document might be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different downstream objectives: tutorials may enhance step-by-step reasoning, FAQs might improve question-answering capabilities, and mathematical reformulations could strengthen quantitative skills. Understanding which transformations work, when they work, and why they work remains an open challenge.
8
 
@@ -29,19 +29,19 @@ TODO: in the blog, we could make this into a widget where you have a tab for eac
29
 
30
  We compare against several baseline datasets for pretraining and data rephrasing:
31
 
32
- [ **DCLM (DataComp-LM)** ](https://arxiv.org/abs/2406.11794) **:** A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens.
33
 
34
- [ **Fineweb-Edu-HQ and Fineweb-Edu-LQ** ](https://arxiv.org/html/2406.17557v1) **:** Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing.
35
 
36
- [ **Ultra-Fineweb-1.4** ](https://arxiv.org/abs/2505.05427) **:** A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality.
37
 
38
- [ **Nemotron-HQ-Synth** ](https://arxiv.org/abs/2412.02595) **:** Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B.
39
 
40
- [ **Cosmopedia** ](https://huggingface.co/blog/cosmopedia) **:** A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct, containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters.
41
 
42
- [ **SYNTH** ](https://pleias.fr/blog/blogsynth-the-new-data-frontier) **:** A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content.
43
 
44
- [ **REWIRE** ](https://arxiv.org/abs/2506.04689) **:** A method for recycling the web with guided rewrite that enriches low-quality documents discarded by filtering pipelines to make them useful for training. Experiments show that mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales respectively across 22 tasks.
45
 
46
  We use source data and seed data interchangeably.
47
 
@@ -82,4 +82,4 @@ I am pretty sure you can use this as a prompt for claude to help rename things.
82
 
83
  ### A Note on Synthetic Data and Model Collapse
84
 
85
- A common misconception about model collapse is that any use of synthetic data in training will inevitably degrade model performance, leading many to view AI-generated training data with blanket suspicion. This misunderstanding stems from [influential ](https://www.nature.com/articles/s41586-024-07566-y)[research](https://www.ft.com/content/ae507468-7f5b-440b-8512-aea81c6bf4a5) that demonstrated severe degradation when models were trained exclusively and iteratively on outputs from previous model generations, without any injection of new information or human-generated content. In practice, however, the AI research community doesn't train models this way. Real-world applications of synthetic data typically involve mixing it with genuine human data, using diverse reference materials in prompts to ensure variety, and employing synthetic data strategically for specific purposes like domain adaptation or augmenting limited datasets rather than replacing entire training corpora. The key distinction is that model collapse occurs specifically when models are trained in a closed loop on their own outputs without introducing new signal or information, a scenario that practitioners actively avoid. The concern should be focused on frontier models generating training data for other frontier models in isolation, not on the thoughtful integration of synthetic data that introduces new knowledge or perspectives into the training process. In [Fineweb v1](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) we also did not find degradation from naturally occurring data on the web likely created by ChatGPT.
 
2
 
3
  ### Synthetic Data for Pretraining
4
 
5
+ Language model development has encountered a fundamental data wall as high-quality web data becomes increasingly scarce, pushing researchers toward synthetic data generation as a complement to traditional internet-scraped datasets. While recent work has demonstrated that synthetic data can dramatically improve model quality, with approaches like WRAP [@wrap], Nemotron-CC [@nemotroncc], and BeyondWeb [@beyondweb] showing that rephrasing existing web content into higher-quality formats can outperform training on raw data alone, the field lacks both a clear conceptual framework for understanding what "synthetic data" and "rephrasing" actually mean and systematic investigations of the factors that determine their effectiveness.
6
 
7
  **What is Rephrasing?** At its core, rephrasing involves transforming existing documents through language models to produce variants that preserve semantic content while modifying presentation, structure, or style. However, this simple definition masks considerable complexity. Rephrasing exists along a spectrum from conservative transformations (style transfer, format conversion) to more aggressive interventions (content expansion, pedagogical restructuring, knowledge extraction). A document might be reformatted as a tutorial with worked examples, restructured as FAQ pairs, expanded with explanatory commentary, condensed into knowledge lists, or rewritten in Wikipedia style. Each transformation targets different downstream objectives: tutorials may enhance step-by-step reasoning, FAQs might improve question-answering capabilities, and mathematical reformulations could strengthen quantitative skills. Understanding which transformations work, when they work, and why they work remains an open challenge.
8
 
 
29
 
30
  We compare against several baseline datasets for pretraining and data rephrasing:
31
 
32
+ **DCLM (DataComp-LM)** [@datacomp] **:** A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens.
33
 
34
+ **Fineweb-Edu-HQ and Fineweb-Edu-LQ** [@fineweb] **:** Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing.
35
 
36
+ **Ultra-Fineweb-1.4** [@ultrafineweb] **:** A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality.
37
 
38
+ **Nemotron-HQ-Synth** [@nemotroncc] **:** Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B.
39
 
40
+ **Cosmopedia** [@cosmopedia] **:** A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct, containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters.
41
 
42
+ **SYNTH** [@synthpleias] **:** A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content.
43
 
44
+ **REWIRE** [@rewire] **:** A method for recycling the web with guided rewrite that enriches low-quality documents discarded by filtering pipelines to make them useful for training. Experiments show that mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales respectively across 22 tasks.
45
 
46
  We use source data and seed data interchangeably.
47
 
 
82
 
83
  ### A Note on Synthetic Data and Model Collapse
84
 
85
+ A common misconception about model collapse is that any use of synthetic data in training will inevitably degrade model performance, leading many to view AI-generated training data with blanket suspicion. This misunderstanding stems from influential research [@modelcollapse] that demonstrated severe degradation when models were trained exclusively and iteratively on outputs from previous model generations, without any injection of new information or human-generated content. In practice, however, the AI research community doesn't train models this way. Real-world applications of synthetic data typically involve mixing it with genuine human data, using diverse reference materials in prompts to ensure variety, and employing synthetic data strategically for specific purposes like domain adaptation or augmenting limited datasets rather than replacing entire training corpora. The key distinction is that model collapse occurs specifically when models are trained in a closed loop on their own outputs without introducing new signal or information, a scenario that practitioners actively avoid. The concern should be focused on frontier models generating training data for other frontier models in isolation, not on the thoughtful integration of synthetic data that introduces new knowledge or perspectives into the training process. In FineWeb [@fineweb] we also did not find degradation from naturally occurring data on the web likely created by ChatGPT.