finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on 18 days ago

Commit

1e2edcd

1 Parent(s): e569197

improved textual flow with sidenotes

Files changed (2) hide show

app/src/content/chapters/1-introduction.mdx CHANGED Viewed

@@ -26,7 +26,10 @@ The scale is staggering: NVIDIA used LLMs to rephrase around 2 trillion tokens o
 <figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
 </figure>
-Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3. Another fun anecdote is the SmolLM2 [@smollm2] training, where we noticed the model was decent at coding and math, but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this issue.
 However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?

 <figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
 </figure>
+Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
+<Sidenote>
+During SmolLM2 [@smollm2] training, the model was decent at coding and math but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this.
+</Sidenote>
 However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?

app/src/content/chapters/2-setup.mdx CHANGED Viewed

@@ -1,4 +1,5 @@
 import Accordion from "../../components/Accordion.astro";
 import ReadingTime from "../../components/ReadingTime.astro";
 ## Rephrasing the Web
@@ -64,13 +65,18 @@ We compare against several baseline datasets for pretraining and data rephrasing
 ### Ablation Setup
-Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the Appendix) and evaluate on 12 benchmarks spanning reasoning, question answering, and math:
-- **Reasoning**: ARC [@arc], HellaSwag [@hellaswag], MMLU Redux [@mmluredux], XCSQA [@xcsqa], OpenBookQA [@openbookqa], Winogrande [@winogrande], PIQA [@piqa]
-- **Question answering**: SQuAD v2 [@squad2], DROP [@drop], WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
 - **Math**: GSM8K [@gsm8k]
-Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting. All evaluations use 3-shot prompting with a single seed.
 ### A Note on Model Collapse

 import Accordion from "../../components/Accordion.astro";
+import Sidenote from "../../components/Sidenote.astro";
 import ReadingTime from "../../components/ReadingTime.astro";
 ## Rephrasing the Web
 ### Ablation Setup
+Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
+<Sidenote>
+Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
+</Sidenote>
+- **General Knowledge**: ARC [@arc], MMLU Redux [@mmluredux]
+- **Reading Comprehension**: SQuAD v2 [@squad2], DROP [@drop]
+- **Reasoning**: OpenBookQA [@openbookqa], XCSQA [@xcsqa]
+- **Natural Language Understanding**: WinoGrande [@winogrande], PIQA [@piqa], HellaSwag [@hellaswag]
 - **Math**: GSM8K [@gsm8k]
+- **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
 ### A Note on Model Collapse