Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
1e2edcd
1
Parent(s): e569197
improved textual flow with sidenotes
Browse files
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -26,7 +26,10 @@ The scale is staggering: NVIDIA used LLMs to rephrase around 2 trillion tokens o
|
|
| 26 |
<figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
|
| 27 |
</figure>
|
| 28 |
|
| 29 |
-
Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 32 |
|
|
|
|
| 26 |
<figcaption>Scale of synthetic data usage in recent LLM training runs. Several recent models were trained on hundreds of billions to trillions of synthetic tokens.</figcaption>
|
| 27 |
</figure>
|
| 28 |
|
| 29 |
+
Synthetic data also plays a central role in post-training via *distillation*, where a capable model generates targeted training data for domains like reasoning, instruction-following, and tool-use. For example, [SmolLM3](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook) [@smollm3] was post-trained almost entirely on data generated from models like DeepSeek-R1 [@deepseekr1] and Qwen3.
|
| 30 |
+
<Sidenote>
|
| 31 |
+
During SmolLM2 [@smollm2] training, the model was decent at coding and math but totally went off the rails with small talk queries (e.g. "How are you?", "Hi", "What's up?"). Synthetically generating a [small talk dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/train_sft?row=0) quickly solved this.
|
| 32 |
+
</Sidenote>
|
| 33 |
|
| 34 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 35 |
|
app/src/content/chapters/2-setup.mdx
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
import Accordion from "../../components/Accordion.astro";
|
|
|
|
| 2 |
import ReadingTime from "../../components/ReadingTime.astro";
|
| 3 |
|
| 4 |
## Rephrasing the Web
|
|
@@ -64,13 +65,18 @@ We compare against several baseline datasets for pretraining and data rephrasing
|
|
| 64 |
|
| 65 |
### Ablation Setup
|
| 66 |
|
| 67 |
-
Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the Appendix) and evaluate on 12 benchmarks
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
- **
|
| 70 |
-
- **
|
|
|
|
|
|
|
| 71 |
- **Math**: GSM8K [@gsm8k]
|
|
|
|
| 72 |
|
| 73 |
-
Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting. All evaluations use 3-shot prompting with a single seed.
|
| 74 |
|
| 75 |
### A Note on Model Collapse
|
| 76 |
|
|
|
|
| 1 |
import Accordion from "../../components/Accordion.astro";
|
| 2 |
+
import Sidenote from "../../components/Sidenote.astro";
|
| 3 |
import ReadingTime from "../../components/ReadingTime.astro";
|
| 4 |
|
| 5 |
## Rephrasing the Web
|
|
|
|
| 65 |
|
| 66 |
### Ablation Setup
|
| 67 |
|
| 68 |
+
Following the ablation methodology from FineWeb [@fineweb], we train a 1.2B parameter language model with a Qwen2-style architecture [@qwen2] (details in the [Appendix](#details-on-the-experiments)) and evaluate on 12 benchmarks across six categories using 3-shot prompting with a single seed:
|
| 69 |
+
<Sidenote>
|
| 70 |
+
Since our model is small and trained on only 20B tokens, we use the **cloze format** (CF) for most tasks rather than standard multiple-choice. CF frames evaluation as next-token prediction, which gives more reliable signal for smaller models that may struggle with instruction following or multiple-choice formatting.
|
| 71 |
+
</Sidenote>
|
| 72 |
|
| 73 |
+
- **General Knowledge**: ARC [@arc], MMLU Redux [@mmluredux]
|
| 74 |
+
- **Reading Comprehension**: SQuAD v2 [@squad2], DROP [@drop]
|
| 75 |
+
- **Reasoning**: OpenBookQA [@openbookqa], XCSQA [@xcsqa]
|
| 76 |
+
- **Natural Language Understanding**: WinoGrande [@winogrande], PIQA [@piqa], HellaSwag [@hellaswag]
|
| 77 |
- **Math**: GSM8K [@gsm8k]
|
| 78 |
+
- **Table Understanding**: WikiTableQuestions [@wikitablequestions], TriviaQA [@triviaqa]
|
| 79 |
|
|
|
|
| 80 |
|
| 81 |
### A Note on Model Collapse
|
| 82 |
|