Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
1271748
1
Parent(s): ac60d42
moved conclusions to experiments, polished transitions and filled in fancy numbers
Browse files
app/src/content/chapters/conclusions.mdx
CHANGED
|
@@ -1,33 +1,5 @@
|
|
| 1 |
## Conclusions
|
| 2 |
|
| 3 |
-
Here are the key takeaways from our experiments:
|
| 4 |
-
|
| 5 |
-
- **Q: How do existing datasets compare?**<br/>
|
| 6 |
-
A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
|
| 7 |
-
- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
|
| 8 |
-
A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
|
| 9 |
-
- **Q: Can new prompts beat DCLM?**<br/>
|
| 10 |
-
A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
|
| 11 |
-
- **Q: Does model size matter?**<br/>
|
| 12 |
-
A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
|
| 13 |
-
- **Q: Do we need better models for low-quality data?**<br/>
|
| 14 |
-
A: No consistent advantage from larger models on low-quality sources.
|
| 15 |
-
- **Q: Does the model family matter?**<br/>
|
| 16 |
-
A: Yes. SmolLM2 dominates across all prompts.
|
| 17 |
-
- **Q: Does the model generation matter?**<br/>
|
| 18 |
-
A: Slightly. Newer Qwen versions trend better.
|
| 19 |
-
- **Q: Is synthetic data enough?**<br/>
|
| 20 |
-
A: No. Always mix synthetic with original data.
|
| 21 |
-
- **Q: Does the mix-in dataset matter?**<br/>
|
| 22 |
-
A: Yes, a major performance driver, sometimes more important than the synthetic data.
|
| 23 |
-
- **Q: Does the source dataset matter?**<br/>
|
| 24 |
-
A: Not with a strong mix-in. Even low-quality sources produce competitive results.
|
| 25 |
-
- **Q: Does increased diversity help?**<br/>
|
| 26 |
-
A: No, performance averages rather than compounds.
|
| 27 |
-
- **Q: Do typos in the prompt hurt?**<br/>
|
| 28 |
-
A: No. Typos have no negative effect on downstream performance.
|
| 29 |
-
|
| 30 |
-
The bottom line: the details of synthetic rephrasing matter a lot, and knowing which ones matter is the key to scaling it up. Prompt design is the single biggest lever, with structured formats like Math, Table, FAQ, and Tutorial consistently beating curated baselines. But equally important is knowing where you can cut corners without losing quality. You don't need a large rephrasing model (1B is enough for simple prompts, 4B for complex ones). You don't need pristine source data (even low-quality sources work with a strong mix-in). Smaller models generate faster, directly translating into higher throughput. And tolerating lower-quality sources opens up a much bigger and more diverse data pool to draw from. The practical recipe is straightforward: pick a strong structured prompt, use the smallest model that handles it, blend with high-quality original data, and spend your remaining compute on volume.
|
| 31 |
|
| 32 |
### Next Steps
|
| 33 |
|
|
|
|
| 1 |
## Conclusions
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
### Next Steps
|
| 5 |
|
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -9,11 +9,6 @@ import FigRef from "../../components/FigRef.astro";
|
|
| 9 |
{/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
|
| 10 |
{/* TODO: Standardize colors across charts in the blog post more */}
|
| 11 |
{/* TODO: Check all the charts again in dark mode */}
|
| 12 |
-
{/* TODO: In analyses make transitions better */}
|
| 13 |
-
{/* TODO: Combine verbosity and compression analysis under one title */}
|
| 14 |
-
{/* TODO: Combine quality score and edu/dclm score analysis under one title */}
|
| 15 |
-
{/* TODO: Move conclusions to after the experiments into conclusions subsection */}
|
| 16 |
-
{/* TODO: put the new analyses into context and update the intro paragraphs of the experiments and analyses accordingly */}
|
| 17 |
|
| 18 |
## Experiments
|
| 19 |
|
|
@@ -552,7 +547,7 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
|
|
| 552 |
**Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
|
| 553 |
</Note>
|
| 554 |
|
| 555 |
-
|
| 556 |
|
| 557 |
### Do Typos in the Prompt Hurt?
|
| 558 |
|
|
@@ -574,3 +569,34 @@ We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) aga
|
|
| 574 |
}}
|
| 575 |
/>
|
| 576 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
{/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
|
| 10 |
{/* TODO: Standardize colors across charts in the blog post more */}
|
| 11 |
{/* TODO: Check all the charts again in dark mode */}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## Experiments
|
| 14 |
|
|
|
|
| 547 |
**Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
|
| 548 |
</Note>
|
| 549 |
|
| 550 |
+
We've covered prompts, models, and datasets. One last question: how sensitive is all of this to small details in the prompt itself?
|
| 551 |
|
| 552 |
### Do Typos in the Prompt Hurt?
|
| 553 |
|
|
|
|
| 569 |
}}
|
| 570 |
/>
|
| 571 |
|
| 572 |
+
### Takeaways
|
| 573 |
+
|
| 574 |
+
Here are the key takeaways from our experiments:
|
| 575 |
+
|
| 576 |
+
- **Q: How do existing datasets compare?**<br/>
|
| 577 |
+
A: DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.
|
| 578 |
+
- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
|
| 579 |
+
A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
|
| 580 |
+
- **Q: Can new prompts beat DCLM?**<br/>
|
| 581 |
+
A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
|
| 582 |
+
- **Q: Does model size matter?**<br/>
|
| 583 |
+
A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
|
| 584 |
+
- **Q: Do we need better models for low-quality data?**<br/>
|
| 585 |
+
A: No consistent advantage from larger models on low-quality sources.
|
| 586 |
+
- **Q: Does the model family matter?**<br/>
|
| 587 |
+
A: Yes. SmolLM2 dominates across all prompts.
|
| 588 |
+
- **Q: Does the model generation matter?**<br/>
|
| 589 |
+
A: Slightly. Newer Qwen versions trend better.
|
| 590 |
+
- **Q: Is synthetic data enough?**<br/>
|
| 591 |
+
A: No. Always mix synthetic with original data.
|
| 592 |
+
- **Q: Does the mix-in dataset matter?**<br/>
|
| 593 |
+
A: Yes, a major performance driver, sometimes more important than the synthetic data.
|
| 594 |
+
- **Q: Does the source dataset matter?**<br/>
|
| 595 |
+
A: Not with a strong mix-in. Even low-quality sources produce competitive results.
|
| 596 |
+
- **Q: Does increased diversity help?**<br/>
|
| 597 |
+
A: No, performance averages rather than compounds.
|
| 598 |
+
- **Q: Do typos in the prompt hurt?**<br/>
|
| 599 |
+
A: No. Typos have no negative effect on downstream performance.
|
| 600 |
+
|
| 601 |
+
The bottom line: the details of synthetic rephrasing matter a lot, and knowing which ones matter is the key to scaling it up. Prompt design is the single biggest lever, with structured formats like Math, Table, FAQ, and Tutorial consistently beating curated baselines. But equally important is knowing where you can cut corners without losing quality. You don't need a large rephrasing model (1B is enough for simple prompts, 4B for complex ones). You don't need pristine source data (even low-quality sources work with a strong mix-in). Smaller models generate faster, directly translating into higher throughput. And tolerating lower-quality sources opens up a much bigger and more diverse data pool to draw from. The practical recipe is straightforward: pick a strong structured prompt, use the smallest model that handles it, blend with high-quality original data, and spend your remaining compute on volume.
|
| 602 |
+
|
app/src/content/chapters/introduction.mdx
CHANGED
|
@@ -31,7 +31,7 @@ Synthetic data also plays a central role in post-training via *distillation*, wh
|
|
| 31 |
|
| 32 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 33 |
|
| 34 |
-
In this blog post we take a journey to answer all these questions systematically. We ran
|
| 35 |
|
| 36 |
Here's the plan:
|
| 37 |
|
|
@@ -39,7 +39,7 @@ We start with the [Infrastructure](#infrastructure) needed for synthetic data ge
|
|
| 39 |
|
| 40 |
We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
|
| 41 |
|
| 42 |
-
Finally we present the suite of
|
| 43 |
|
| 44 |
Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
|
| 45 |
|
|
|
|
| 31 |
|
| 32 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 33 |
|
| 34 |
+
In this blog post we take a journey to answer all these questions systematically. We ran 65 experiments, generated over 750 billion tokens and spent {'>'}74,000 GPU hours (~8.5 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
|
| 35 |
|
| 36 |
Here's the plan:
|
| 37 |
|
|
|
|
| 39 |
|
| 40 |
We continue with the [Setup](#setup), a walkthrough of the different approaches for synthetic data in pretraining, from explaining what prior work did to the prompts we are experimenting with.
|
| 41 |
|
| 42 |
+
Finally we present the suite of 65 [Experiments](#experiments) we ran to figure out best practices regarding what models, prompts and settings work well.
|
| 43 |
|
| 44 |
Here's a preview of where we end up: FinePhrase, our best configuration, clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains what's needed to get there.
|
| 45 |
|