joelniklaus HF Staff commited on
Commit
1271748
·
1 Parent(s): ac60d42

moved conclusions to experiments, polished transitions and filled in fancy numbers

Browse files
app/src/content/chapters/conclusions.mdx CHANGED
@@ -1,33 +1,5 @@
1
  ## Conclusions
2
 
3
- Here are the key takeaways from our experiments:
4
-
5
- - **Q: How do existing datasets compare?**<br/>
6
- A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
7
- - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
8
- A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
9
- - **Q: Can new prompts beat DCLM?**<br/>
10
- A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
11
- - **Q: Does model size matter?**<br/>
12
- A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
13
- - **Q: Do we need better models for low-quality data?**<br/>
14
- A: No consistent advantage from larger models on low-quality sources.
15
- - **Q: Does the model family matter?**<br/>
16
- A: Yes. SmolLM2 dominates across all prompts.
17
- - **Q: Does the model generation matter?**<br/>
18
- A: Slightly. Newer Qwen versions trend better.
19
- - **Q: Is synthetic data enough?**<br/>
20
- A: No. Always mix synthetic with original data.
21
- - **Q: Does the mix-in dataset matter?**<br/>
22
- A: Yes, a major performance driver, sometimes more important than the synthetic data.
23
- - **Q: Does the source dataset matter?**<br/>
24
- A: Not with a strong mix-in. Even low-quality sources produce competitive results.
25
- - **Q: Does increased diversity help?**<br/>
26
- A: No, performance averages rather than compounds.
27
- - **Q: Do typos in the prompt hurt?**<br/>
28
- A: No. Typos have no negative effect on downstream performance.
29
-
30
- The bottom line: the details of synthetic rephrasing matter a lot, and knowing which ones matter is the key to scaling it up. Prompt design is the single biggest lever, with structured formats like Math, Table, FAQ, and Tutorial consistently beating curated baselines. But equally important is knowing where you can cut corners without losing quality. You don't need a large rephrasing model (1B is enough for simple prompts, 4B for complex ones). You don't need pristine source data (even low-quality sources work with a strong mix-in). Smaller models generate faster, directly translating into higher throughput. And tolerating lower-quality sources opens up a much bigger and more diverse data pool to draw from. The practical recipe is straightforward: pick a strong structured prompt, use the smallest model that handles it, blend with high-quality original data, and spend your remaining compute on volume.
31
 
32
  ### Next Steps
33
 
 
1
  ## Conclusions
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  ### Next Steps
5
 
app/src/content/chapters/experiments.mdx CHANGED
@@ -9,11 +9,6 @@ import FigRef from "../../components/FigRef.astro";
9
  {/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
10
  {/* TODO: Standardize colors across charts in the blog post more */}
11
  {/* TODO: Check all the charts again in dark mode */}
12
- {/* TODO: In analyses make transitions better */}
13
- {/* TODO: Combine verbosity and compression analysis under one title */}
14
- {/* TODO: Combine quality score and edu/dclm score analysis under one title */}
15
- {/* TODO: Move conclusions to after the experiments into conclusions subsection */}
16
- {/* TODO: put the new analyses into context and update the intro paragraphs of the experiments and analyses accordingly */}
17
 
18
  ## Experiments
19
 
@@ -552,7 +547,7 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
552
  **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
553
  </Note>
554
 
555
- Let's turn to some unexpected findings from our experiments.
556
 
557
  ### Do Typos in the Prompt Hurt?
558
 
@@ -574,3 +569,34 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
574
  }}
575
  />
576
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  {/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
10
  {/* TODO: Standardize colors across charts in the blog post more */}
11
  {/* TODO: Check all the charts again in dark mode */}
 
 
 
 
 
12
 
13
  ## Experiments
14
 
 
547
  **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
548
  </Note>
549
 
550
+ We've covered prompts, models, and datasets. One last question: how sensitive is all of this to small details in the prompt itself?
551
 
552
  ### Do Typos in the Prompt Hurt?
553
 
 
569
  }}
570
  />
571
 
572
+ ### Takeaways
573
+
574
+ Here are the key takeaways from our experiments:
575
+
576
+ - **Q: How do existing datasets compare?**<br/>
577
+ A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
578
+ - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
579
+ A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
580
+ - **Q: Can new prompts beat DCLM?**<br/>
581
+ A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
582
+ - **Q: Does model size matter?**<br/>
583
+ A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
584
+ - **Q: Do we need better models for low-quality data?**<br/>
585
+ A: No consistent advantage from larger models on low-quality sources.
586
+ - **Q: Does the model family matter?**<br/>
587
+ A: Yes. SmolLM2 dominates across all prompts.
588
+ - **Q: Does the model generation matter?**<br/>
589
+ A: Slightly. Newer Qwen versions trend better.
590
+ - **Q: Is synthetic data enough?**<br/>
591
+ A: No. Always mix synthetic with original data.
592
+ - **Q: Does the mix-in dataset matter?**<br/>
593
+ A: Yes, a major performance driver, sometimes more important than the synthetic data.
594
+ - **Q: Does the source dataset matter?**<br/>
595
+ A: Not with a strong mix-in. Even low-quality sources produce competitive results.
596
+ - **Q: Does increased diversity help?**<br/>
597
+ A: No, performance averages rather than compounds.
598
+ - **Q: Do typos in the prompt hurt?**<br/>
599
+ A: No. Typos have no negative effect on downstream performance.
600
+
601
+ The bottom line: the details of synthetic rephrasing matter a lot, and knowing which ones matter is the key to scaling it up. Prompt design is the single biggest lever, with structured formats like Math, Table, FAQ, and Tutorial consistently beating curated baselines. But equally important is knowing where you can cut corners without losing quality. You don't need a large rephrasing model (1B is enough for simple prompts, 4B for complex ones). You don't need pristine source data (even low-quality sources work with a strong mix-in). Smaller models generate faster, directly translating into higher throughput. And tolerating lower-quality sources opens up a much bigger and more diverse data pool to draw from. The practical recipe is straightforward: pick a strong structured prompt, use the smallest model that handles it, blend with high-quality original data, and spend your remaining compute on volume.
602
+
app/src/content/chapters/introduction.mdx CHANGED
@@ -31,7 +31,7 @@ Synthetic data also plays a central role in post-training via *distillation*, wh
31
 
32
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
33
 
34
- In this blog post we take a journey to answer all these questions systematically. We ran XXX experiments and generated YYY tokens in total to find the ideal settings for synthetic data.
35
 
36
  Here's the plan:
37
 
@@ -39,7 +39,7 @@ We start with the [Infrastructure](#infrastructure) needed for synthetic data ge
39
 
40
  We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
41
 
42
- Finally we present the suite of XXX [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
43
 
44
  Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
45
 
 
31
 
32
  However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
33
 
34
+ In this blog post we take a journey to answer all these questions systematically. We ran 65 experiments, generated over 750 billion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
35
 
36
  Here's the plan:
37
 
 
39
 
40
  We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
41
 
42
+ Finally we present the suite of 65 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
43
 
44
  Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
45