Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
b42aabd
1
Parent(s): 77b87fe
moved the plot as a hook in the begginning
Browse files
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -11,6 +11,24 @@ import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
|
|
| 11 |
|
| 12 |
<ReadingTime words={756} visuals={3} />
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
|
| 15 |
|
| 16 |
- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
|
|
@@ -33,8 +51,6 @@ During SmolLM2 [@smollm2] training, the model was decent at coding and math but
|
|
| 33 |
|
| 34 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 35 |
|
| 36 |
-
In this blog post we take a journey to answer all these questions systematically. We ran 90 experiments, generated over 1 trillion tokens and spent {'>'}111,000 GPU hours (~12.7 GPU years) for rephrasing alone to find the ideal settings for synthetic data.
|
| 37 |
-
|
| 38 |
Here's the plan:
|
| 39 |
<Sidenote>
|
| 40 |
The sections are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
|
|
@@ -42,20 +58,3 @@ The sections are fairly self-contained, so feel free to jump around and skip wha
|
|
| 42 |
|
| 43 |
We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
|
| 44 |
|
| 45 |
-
Here's a preview of where we end up: FinePhrase clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). The rest of this post explains the journey to get there.
|
| 46 |
-
|
| 47 |
-
<HtmlEmbed
|
| 48 |
-
id="finephrase-vs-baselines"
|
| 49 |
-
src="d3-benchmark-comparison.html"
|
| 50 |
-
desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
|
| 51 |
-
config={{
|
| 52 |
-
defaultView: "line",
|
| 53 |
-
datasets: {
|
| 54 |
-
cosmopedia: "Cosmopedia",
|
| 55 |
-
"mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase", color: "#EBA937" },
|
| 56 |
-
nemotron_hq_synth: "Nemotron-HQ-Synth",
|
| 57 |
-
rewire: "REWIRE",
|
| 58 |
-
synth_query_reasoning_answer: "SYNTH"
|
| 59 |
-
}
|
| 60 |
-
}}
|
| 61 |
-
/>
|
|
|
|
| 11 |
|
| 12 |
<ReadingTime words={756} visuals={3} />
|
| 13 |
|
| 14 |
+
We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is **FinePhrase**, a dataset that clearly outperforms all existing synthetic data baselines (<FigRef target="finephrase-vs-baselines" />). It's [available on the Hub](https://huggingface.co/datasets/HuggingFaceFW/finephrase), and this post walks you through everything we learned along the way.
|
| 15 |
+
|
| 16 |
+
<HtmlEmbed
|
| 17 |
+
id="finephrase-vs-baselines"
|
| 18 |
+
src="d3-benchmark-comparison.html"
|
| 19 |
+
desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
|
| 20 |
+
config={{
|
| 21 |
+
defaultView: "line",
|
| 22 |
+
datasets: {
|
| 23 |
+
cosmopedia: "Cosmopedia",
|
| 24 |
+
"mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase", color: "#EBA937" },
|
| 25 |
+
nemotron_hq_synth: "Nemotron-HQ-Synth",
|
| 26 |
+
rewire: "REWIRE",
|
| 27 |
+
synth_query_reasoning_answer: "SYNTH"
|
| 28 |
+
}
|
| 29 |
+
}}
|
| 30 |
+
/>
|
| 31 |
+
|
| 32 |
If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinity]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
|
| 33 |
|
| 34 |
- After training the first language models on small-ish datasets like Wikipedia, people started scaling up the pretraining corpora including more and more data from the web. Datasets like C4 [@c4] and The Pile [@thepile] pushed into hundreds of gigabytes. Then FineWeb [@fineweb] and DCLM [@datacomp] brought things to the trillion-token scale, covering most of the crawlable web.
|
|
|
|
| 51 |
|
| 52 |
However, how to do synthetic data generation properly still resembles alchemy these days: Which model should you use? Which prompts work best and how many do you need? And how do you even scale this effectively?
|
| 53 |
|
|
|
|
|
|
|
| 54 |
Here's the plan:
|
| 55 |
<Sidenote>
|
| 56 |
The sections are fairly self-contained, so feel free to jump around and skip whatever seems less interesting to you.
|
|
|
|
| 58 |
|
| 59 |
We start by [setting up the problem](#rephrasing-the-web): what rephrasing is, which approaches exist, and what we want to test. Then we dive into the 90 [Experiments](#experiments) we ran to figure out which prompts, models, and datasets actually work. The [Analyses](#analyses) section zooms out to ask *why* things work the way they do. Next comes the [Infrastructure](#infrastructure) that made all of this possible, including detailed throughput benchmarking of popular models (super important for getting the most data for your bucks). Finally, we [put it all together](#applying-the-recipe-at-scale) into FinePhrase, our best configuration.
|
| 60 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|