joelniklaus HF Staff commited on
Commit
1e2edcd
·
1 Parent(s): e569197

improved textual flow with sidenotes

Browse files
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -26,7 +26,10 @@ The scale is staggering: NVIDIA used LLMs to rephrase around 2 trillion tokens o
26
  <figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
27
  </figure>
28
 
29
- Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3. Another fun anecdote is the SmolLM2 [@smollm2] training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this issue.
 
 
 
30
 
31
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
32
 
 
26
  <figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
27
  </figure>
28
 
29
+ Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
30
+ <Sidenote>
31
+ During SmolLM2 [@smollm2] training, the model was decent at coding and math but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this.
32
+ </Sidenote>
33
 
34
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
35
 
app/src/content/chapters/2-setup.mdx CHANGED
@@ -1,4 +1,5 @@
1
  import Accordion from "../../components/Accordion.astro";
 
2
  import ReadingTime from "../../components/ReadingTime.astro";
3
 
4
  ## Rephrasing the Web
@@ -64,13 +65,18 @@ We compare against several baseline datasets for pretraining and data rephrasing
64
 
65
  ### Ablation Setup
66
 
67
- Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the Appendix) and evaluate on 12 benchmarks spanning reasoning, question answering, and math:
 
 
 
68
 
69
- - **Reasoning**: ARC [@arc], HellaSwag [@hellaswag], MMLU Redux [@mmluredux], XCSQA [@xcsqa], OpenBookQA [@openbookqa], Winogrande [@winogrande], PIQA [@piqa]
70
- - **Question answering**: SQuAD v2 [@squad2], DROP [@drop], WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
 
 
71
  - **Math**: GSM8K [@gsm8k]
 
72
 
73
- Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting. All evaluations use 3-shot prompting with a single seed.
74
 
75
  ### A Note on Model Collapse
76
 
 
1
  import Accordion from "../../components/Accordion.astro";
2
+ import Sidenote from "../../components/Sidenote.astro";
3
  import ReadingTime from "../../components/ReadingTime.astro";
4
 
5
  ## Rephrasing the Web
 
65
 
66
  ### Ablation Setup
67
 
68
+ Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
69
+ <Sidenote>
70
+ Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
71
+ </Sidenote>
72
 
73
+ - **General Knowledge**: ARC [@arc], MMLU Redux [@mmluredux]
74
+ - **Reading Comprehension**: SQuAD v2 [@squad2], DROP [@drop]
75
+ - **Reasoning**: OpenBookQA [@openbookqa], XCSQA [@xcsqa]
76
+ - **Natural Language Understanding**: WinoGrande [@winogrande], PIQA [@piqa], HellaSwag [@hellaswag]
77
  - **Math**: GSM8K [@gsm8k]
78
+ - **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
79
 
 
80
 
81
  ### A Note on Model Collapse
82