Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit Β·
6c3f55b
1
Parent(s): 3742813
removed figrefs and changed the setup for charts: first set the scene, then show the plot and finally analyze the resutls and do the transition.
Browse files
app/src/content/chapters/1-introduction.mdx
CHANGED
|
@@ -4,14 +4,13 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
|
| 4 |
import Wide from "../../components/Wide.astro";
|
| 5 |
import Sidenote from "../../components/Sidenote.astro";
|
| 6 |
import Note from "../../components/Note.astro";
|
| 7 |
-
import FigRef from "../../components/FigRef.astro";
|
| 8 |
import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
|
| 9 |
|
| 10 |
|
| 11 |
## Introduction
|
| 12 |
|
| 13 |
|
| 14 |
-
We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is **FinePhrase**, a 486B token dataset that clearly outperforms all existing synthetic data baselines
|
| 15 |
<Sidenote>
|
| 16 |
Reading time: One weekend
|
| 17 |
</Sidenote>
|
|
@@ -40,7 +39,7 @@ If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3
|
|
| 40 |
|
| 41 |
We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
|
| 42 |
|
| 43 |
-
The scale is staggering: NVIDIA used LLMs to rephrase around 2 trillion tokens of web text for their [Nemotron-CC dataset](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) [@nemotroncc], while Z.ai generated 500 billion reasoning tokens to mid-train the GLM-4.5 series [@glm45].
|
| 44 |
|
| 45 |
<figure id="synthetic-data-scale">
|
| 46 |
<Image src={syntheticDataScaleImg} alt="Scale of synthetic data in recent LLM training runs" />
|
|
|
|
| 4 |
import Wide from "../../components/Wide.astro";
|
| 5 |
import Sidenote from "../../components/Sidenote.astro";
|
| 6 |
import Note from "../../components/Note.astro";
|
|
|
|
| 7 |
import syntheticDataScaleImg from "../assets/image/synthetic-data-scale.jpg";
|
| 8 |
|
| 9 |
|
| 10 |
## Introduction
|
| 11 |
|
| 12 |
|
| 13 |
+
We ran 90 experiments, generated over 1 trillion tokens, and spent 12.7 GPU years to find the best recipe for synthetic pretraining data. The result is **FinePhrase**, a 486B token dataset that clearly outperforms all existing synthetic data baselines. It's [available on the Hub](https://huggingface.co/datasets/HuggingFaceFW/finephrase), and this post walks you through everything we learned along the way.
|
| 14 |
<Sidenote>
|
| 15 |
Reading time: One weekend
|
| 16 |
</Sidenote>
|
|
|
|
| 39 |
|
| 40 |
We are seeing a radical shift in compute allocation for model training: while the model training dominated the compute budget early on, we see more and more compute allocated to curate and improve the training datasets, both in pretraining and post-training.
|
| 41 |
|
| 42 |
+
The scale is staggering: NVIDIA used LLMs to rephrase around 2 trillion tokens of web text for their [Nemotron-CC dataset](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2) [@nemotroncc], while Z.ai generated 500 billion reasoning tokens to mid-train the GLM-4.5 series [@glm45]. Here's how much synthetic data recent models are using:
|
| 43 |
|
| 44 |
<figure id="synthetic-data-scale">
|
| 45 |
<Image src={syntheticDataScaleImg} alt="Scale of synthetic data in recent LLM training runs" />
|
app/src/content/chapters/3-experiments.mdx
CHANGED
|
@@ -2,8 +2,6 @@ import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
|
| 2 |
import Note from "../../components/Note.astro";
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
| 4 |
import Glossary from "../../components/Glossary.astro";
|
| 5 |
-
import FigRef from "../../components/FigRef.astro";
|
| 6 |
-
|
| 7 |
|
| 8 |
{/* TODO: Integrate decay experiment as another analysis for proxy */}
|
| 9 |
{/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
|
|
@@ -18,7 +16,7 @@ Notes:
|
|
| 18 |
|
| 19 |
## Experiments
|
| 20 |
|
| 21 |
-
Time to put all of this to the test. We ran 90 experiments to systematically answer our questions, and the journey took some unexpected turns.
|
| 22 |
|
| 23 |
<HtmlEmbed
|
| 24 |
id="experiment-overview"
|
|
@@ -27,9 +25,11 @@ Time to put all of this to the test. We ran 90 experiments to systematically ans
|
|
| 27 |
desc="Flow of experiments from source datasets through prompt strategies to model families. Hover over nodes and links to see experiment counts."
|
| 28 |
/>
|
| 29 |
|
|
|
|
|
|
|
| 30 |
### How Do Existing Datasets Compare?
|
| 31 |
|
| 32 |
-
First things first: where does the bar sit? We train on eight datasets under identical conditions and compare their
|
| 33 |
|
| 34 |
<HtmlEmbed
|
| 35 |
id="baselines-comparison"
|
|
@@ -50,11 +50,13 @@ First things first: where does the bar sit? We train on eight datasets under ide
|
|
| 50 |
}}
|
| 51 |
/>
|
| 52 |
|
|
|
|
|
|
|
| 53 |
Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?
|
| 54 |
|
| 55 |
#### Which Individual Prompts Match DCLM?
|
| 56 |
|
| 57 |
-
We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source
|
| 58 |
|
| 59 |
<Sidenote>
|
| 60 |
The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
|
|
@@ -81,11 +83,11 @@ The BeyondWeb dataset was never released and the paper omits key details, yet cl
|
|
| 81 |
}}
|
| 82 |
/>
|
| 83 |
|
| 84 |
-
That's a pretty underwhelming hit rate. Can we do better with our own prompts?
|
| 85 |
|
| 86 |
### Can New Prompts Beat DCLM?
|
| 87 |
|
| 88 |
-
Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ
|
| 89 |
|
| 90 |
<HtmlEmbed
|
| 91 |
id="new-prompts"
|
|
@@ -107,6 +109,8 @@ Since most existing prompts fail to beat DCLM, we designed nine novel prompt for
|
|
| 107 |
}}
|
| 108 |
/>
|
| 109 |
|
|
|
|
|
|
|
| 110 |
So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
|
| 111 |
|
| 112 |
### Impact of the Rephrasing Model
|
|
@@ -115,12 +119,7 @@ We look at this from three angles: model size, model family, and model generatio
|
|
| 115 |
|
| 116 |
#### Does the model size matter?
|
| 117 |
|
| 118 |
-
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [math](#math), [tutorial](#tutorial), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts
|
| 119 |
-
For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1B through 27B show no significant difference.
|
| 120 |
-
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
|
| 121 |
-
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
|
| 122 |
-
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
|
| 123 |
-
The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
|
| 124 |
|
| 125 |
<Sidenote>
|
| 126 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
|
@@ -174,11 +173,17 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 174 |
}}
|
| 175 |
/>
|
| 176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
|
| 178 |
|
| 179 |
#### Do we need better models for rephrasing low-quality data?
|
| 180 |
|
| 181 |
-
REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts
|
| 182 |
|
| 183 |
<HtmlEmbed
|
| 184 |
id="size-quality"
|
|
@@ -226,11 +231,13 @@ REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data r
|
|
| 226 |
}}
|
| 227 |
/>
|
| 228 |
|
|
|
|
|
|
|
| 229 |
So model size doesn't matter much. But what if you're using the wrong model family entirely?
|
| 230 |
|
| 231 |
#### Does the model family matter?
|
| 232 |
|
| 233 |
-
We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts
|
| 234 |
|
| 235 |
<Sidenote>
|
| 236 |
We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
|
|
@@ -334,6 +341,8 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
|
|
| 334 |
}}
|
| 335 |
/>
|
| 336 |
|
|
|
|
|
|
|
| 337 |
SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
|
| 338 |
<Sidenote>
|
| 339 |
[SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
|
|
@@ -341,7 +350,7 @@ SmolLM2 is already over a year old at this point. If model quality matters, shou
|
|
| 341 |
|
| 342 |
#### Does the model generation matter?
|
| 343 |
|
| 344 |
-
We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt
|
| 345 |
|
| 346 |
<HtmlEmbed
|
| 347 |
id="model-generation"
|
|
@@ -358,6 +367,8 @@ We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and
|
|
| 358 |
}}
|
| 359 |
/>
|
| 360 |
|
|
|
|
|
|
|
| 361 |
<Note title="Summary: Impact of the Rephrasing Model" variant="info">
|
| 362 |
**Model size**: 1B is sufficient. Larger models do not help.<br/>
|
| 363 |
**Model family**: SmolLM2 dominates across all prompts.<br/>
|
|
@@ -373,7 +384,7 @@ So far we've always mixed synthetic data with a <Glossary term="source dataset"
|
|
| 373 |
|
| 374 |
#### Is synthetic data enough?
|
| 375 |
|
| 376 |
-
The dream scenario would be generating all your training data synthetically, no curation needed. We test this by comparing synthetic-only training vs mixed training (synthetic + source) across all our prompts on DCLM and FineWeb-Edu-HQ sources
|
| 377 |
|
| 378 |
<HtmlEmbed
|
| 379 |
id="synthetic-only"
|
|
@@ -424,11 +435,13 @@ The dream scenario would be generating all your training data synthetically, no
|
|
| 424 |
}}
|
| 425 |
/>
|
| 426 |
|
|
|
|
|
|
|
| 427 |
OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
|
| 428 |
|
| 429 |
#### Does the mix-in dataset matter?
|
| 430 |
|
| 431 |
-
We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data
|
| 432 |
|
| 433 |
<HtmlEmbed
|
| 434 |
id="mixin-dataset"
|
|
@@ -464,11 +477,13 @@ We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, th
|
|
| 464 |
}}
|
| 465 |
/>
|
| 466 |
|
|
|
|
|
|
|
| 467 |
If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
|
| 468 |
|
| 469 |
#### Does the source dataset matter?
|
| 470 |
|
| 471 |
-
We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ).
|
| 472 |
|
| 473 |
<HtmlEmbed
|
| 474 |
id="source-dataset-mixin-source"
|
|
@@ -498,6 +513,8 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
|
|
| 498 |
}}
|
| 499 |
/>
|
| 500 |
|
|
|
|
|
|
|
| 501 |
<HtmlEmbed
|
| 502 |
id="source-dataset-fixed-mixin"
|
| 503 |
src="d3-benchmark-comparison.html"
|
|
@@ -526,11 +543,11 @@ We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) wit
|
|
| 526 |
}}
|
| 527 |
/>
|
| 528 |
|
| 529 |
-
That opens up a much larger pool of source data to draw from. But can we squeeze out even more performance by increasing diversity in the synthetic portion?
|
| 530 |
|
| 531 |
#### Does increased diversity help?
|
| 532 |
|
| 533 |
-
We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies
|
| 534 |
|
| 535 |
<Sidenote>
|
| 536 |
Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
|
|
@@ -578,6 +595,8 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
|
|
| 578 |
}}
|
| 579 |
/>
|
| 580 |
|
|
|
|
|
|
|
| 581 |
<Note title="Summary: Impact of the Dataset Choices" variant="info">
|
| 582 |
**Synthetic-only**: Not enough. Always mix with original data.<br/>
|
| 583 |
**Mix-in dataset**: Major performance driver, sometimes more important than the synthetic data itself.<br/>
|
|
@@ -590,7 +609,7 @@ We've covered prompts, models, and datasets. One last fun question: how sensitiv
|
|
| 590 |
|
| 591 |
### Do Typos in the Prompt Hurt?
|
| 592 |
|
| 593 |
-
While implementing the REWIRE prompt, we noticed it contained several typos and grammatical errors. So we cleaned it up and ran both versions
|
| 594 |
|
| 595 |
<HtmlEmbed
|
| 596 |
id="typos-effect"
|
|
@@ -607,6 +626,8 @@ While implementing the REWIRE prompt, we noticed it contained several typos and
|
|
| 607 |
}}
|
| 608 |
/>
|
| 609 |
|
|
|
|
|
|
|
| 610 |
With that final detail in hand, let's take stock of everything we've found.
|
| 611 |
|
| 612 |
### Takeaways
|
|
|
|
| 2 |
import Note from "../../components/Note.astro";
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
| 4 |
import Glossary from "../../components/Glossary.astro";
|
|
|
|
|
|
|
| 5 |
|
| 6 |
{/* TODO: Integrate decay experiment as another analysis for proxy */}
|
| 7 |
{/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
|
|
|
|
| 16 |
|
| 17 |
## Experiments
|
| 18 |
|
| 19 |
+
Time to put all of this to the test. We ran 90 experiments to systematically answer our questions, and the journey took some unexpected turns. Here's the full landscape of what we explored, with source datasets flowing through prompt strategies to model families:
|
| 20 |
|
| 21 |
<HtmlEmbed
|
| 22 |
id="experiment-overview"
|
|
|
|
| 25 |
desc="Flow of experiments from source datasets through prompt strategies to model families. Hover over nodes and links to see experiment counts."
|
| 26 |
/>
|
| 27 |
|
| 28 |
+
We start by seeing how existing datasets stack up, then dissect what makes their prompts tick. From there we design our own prompts, explore how the rephrasing model affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
|
| 29 |
+
|
| 30 |
### How Do Existing Datasets Compare?
|
| 31 |
|
| 32 |
+
First things first: where does the bar sit? We establish baselines and train on eight popular datasets under identical conditions and compare their evaluation performance:
|
| 33 |
|
| 34 |
<HtmlEmbed
|
| 35 |
id="baselines-comparison"
|
|
|
|
| 50 |
}}
|
| 51 |
/>
|
| 52 |
|
| 53 |
+
DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin. The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, SYNTH, and EssentialWeb, fall notably behind. DCLM is the strongest baseline and becomes our target to beat for everything that follows.
|
| 54 |
+
|
| 55 |
Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?
|
| 56 |
|
| 57 |
#### Which Individual Prompts Match DCLM?
|
| 58 |
|
| 59 |
+
We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source:
|
| 60 |
|
| 61 |
<Sidenote>
|
| 62 |
The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
|
|
|
|
| 83 |
}}
|
| 84 |
/>
|
| 85 |
|
| 86 |
+
Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate. Can we do better with our own prompts?
|
| 87 |
|
| 88 |
### Can New Prompts Beat DCLM?
|
| 89 |
|
| 90 |
+
Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ:
|
| 91 |
|
| 92 |
<HtmlEmbed
|
| 93 |
id="new-prompts"
|
|
|
|
| 109 |
}}
|
| 110 |
/>
|
| 111 |
|
| 112 |
+
Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level. The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
|
| 113 |
+
|
| 114 |
So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
|
| 115 |
|
| 116 |
### Impact of the Rephrasing Model
|
|
|
|
| 119 |
|
| 120 |
#### Does the model size matter?
|
| 121 |
|
| 122 |
+
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [math](#math), [tutorial](#tutorial), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts. Use the Setup dropdown to switch between them:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
<Sidenote>
|
| 125 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
|
|
|
| 173 |
}}
|
| 174 |
/>
|
| 175 |
|
| 176 |
+
For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1B through 27B show no significant difference.
|
| 177 |
+
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
|
| 178 |
+
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
|
| 179 |
+
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
|
| 180 |
+
The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
|
| 181 |
+
|
| 182 |
That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
|
| 183 |
|
| 184 |
#### Do we need better models for rephrasing low-quality data?
|
| 185 |
|
| 186 |
+
REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts:
|
| 187 |
|
| 188 |
<HtmlEmbed
|
| 189 |
id="size-quality"
|
|
|
|
| 231 |
}}
|
| 232 |
/>
|
| 233 |
|
| 234 |
+
The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins. We see no consistent advantage of using larger models for low-quality data.
|
| 235 |
+
|
| 236 |
So model size doesn't matter much. But what if you're using the wrong model family entirely?
|
| 237 |
|
| 238 |
#### Does the model family matter?
|
| 239 |
|
| 240 |
+
We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts:
|
| 241 |
|
| 242 |
<Sidenote>
|
| 243 |
We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
|
|
|
|
| 341 |
}}
|
| 342 |
/>
|
| 343 |
|
| 344 |
+
The result is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt.
|
| 345 |
+
|
| 346 |
SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
|
| 347 |
<Sidenote>
|
| 348 |
[SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
|
|
|
|
| 350 |
|
| 351 |
#### Does the model generation matter?
|
| 352 |
|
| 353 |
+
We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt:
|
| 354 |
|
| 355 |
<HtmlEmbed
|
| 356 |
id="model-generation"
|
|
|
|
| 367 |
}}
|
| 368 |
/>
|
| 369 |
|
| 370 |
+
The differences are small, but there is a consistent upward trend: newer versions lead to slightly higher evaluation performance, especially cumulative from version 1.5 to 3.
|
| 371 |
+
|
| 372 |
<Note title="Summary: Impact of the Rephrasing Model" variant="info">
|
| 373 |
**Model size**: 1B is sufficient. Larger models do not help.<br/>
|
| 374 |
**Model family**: SmolLM2 dominates across all prompts.<br/>
|
|
|
|
| 384 |
|
| 385 |
#### Is synthetic data enough?
|
| 386 |
|
| 387 |
+
The dream scenario would be generating all your training data synthetically, no curation needed. We test this by comparing synthetic-only training vs mixed training (synthetic + source) across all our prompts on DCLM and FineWeb-Edu-HQ sources:
|
| 388 |
|
| 389 |
<HtmlEmbed
|
| 390 |
id="synthetic-only"
|
|
|
|
| 435 |
}}
|
| 436 |
/>
|
| 437 |
|
| 438 |
+
Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type.
|
| 439 |
+
|
| 440 |
OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
|
| 441 |
|
| 442 |
#### Does the mix-in dataset matter?
|
| 443 |
|
| 444 |
+
We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data:
|
| 445 |
|
| 446 |
<HtmlEmbed
|
| 447 |
id="mixin-dataset"
|
|
|
|
| 477 |
}}
|
| 478 |
/>
|
| 479 |
|
| 480 |
+
DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones. This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
|
| 481 |
+
|
| 482 |
If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
|
| 483 |
|
| 484 |
#### Does the source dataset matter?
|
| 485 |
|
| 486 |
+
We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). First, here's what happens when mix-in varies with source:
|
| 487 |
|
| 488 |
<HtmlEmbed
|
| 489 |
id="source-dataset-mixin-source"
|
|
|
|
| 513 |
}}
|
| 514 |
/>
|
| 515 |
|
| 516 |
+
Source quality appears to matter here: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia. But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes:
|
| 517 |
+
|
| 518 |
<HtmlEmbed
|
| 519 |
id="source-dataset-fixed-mixin"
|
| 520 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 543 |
}}
|
| 544 |
/>
|
| 545 |
|
| 546 |
+
This is exciting: it means you can rephrase even low-quality data and still get competitive results, as long as you pair it with a strong mix-in dataset. That opens up a much larger pool of source data to draw from. But can we squeeze out even more performance by increasing diversity in the synthetic portion?
|
| 547 |
|
| 548 |
#### Does increased diversity help?
|
| 549 |
|
| 550 |
+
We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies:
|
| 551 |
|
| 552 |
<Sidenote>
|
| 553 |
Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
|
|
|
|
| 595 |
}}
|
| 596 |
/>
|
| 597 |
|
| 598 |
+
None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds. This was a bit disappointing. That said, our ablations train on only 20B tokens, so diversity benefits may emerge at larger scales where the model can better exploit the varied signal.
|
| 599 |
+
|
| 600 |
<Note title="Summary: Impact of the Dataset Choices" variant="info">
|
| 601 |
**Synthetic-only**: Not enough. Always mix with original data.<br/>
|
| 602 |
**Mix-in dataset**: Major performance driver, sometimes more important than the synthetic data itself.<br/>
|
|
|
|
| 609 |
|
| 610 |
### Do Typos in the Prompt Hurt?
|
| 611 |
|
| 612 |
+
While implementing the REWIRE prompt, we noticed it contained several typos and grammatical errors. So we cleaned it up and ran both versions:
|
| 613 |
|
| 614 |
<HtmlEmbed
|
| 615 |
id="typos-effect"
|
|
|
|
| 626 |
}}
|
| 627 |
/>
|
| 628 |
|
| 629 |
+
Typos don't hurt at all. For the 1B model, the typo-laden [original](#guided_rewrite_original) actually performs slightly better than the [improved version](#guided_rewrite_improved). So much for prompt polish.
|
| 630 |
+
|
| 631 |
With that final detail in hand, let's take stock of everything we've found.
|
| 632 |
|
| 633 |
### Takeaways
|
app/src/content/chapters/4-analyses.mdx
CHANGED
|
@@ -1,5 +1,4 @@
|
|
| 1 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 2 |
-
import FigRef from "../../components/FigRef.astro";
|
| 3 |
import Note from "../../components/Note.astro";
|
| 4 |
import Wide from "../../components/Wide.astro";
|
| 5 |
|
|
@@ -10,11 +9,7 @@ The experiments tell us *what* works. Now let's zoom out and ask *why*. We look
|
|
| 10 |
|
| 11 |
### Is More Compute Worth It?
|
| 12 |
|
| 13 |
-
Running 90 experiments is not cheap. GPU time varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time.
|
| 14 |
-
|
| 15 |
-
**The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time *decreasing* performance.
|
| 16 |
-
|
| 17 |
-
**The message is clear: invest in prompt design, not model size.** A well-chosen prompt on a 1B model will outperform a generic prompt on a 27B model at a tiny fraction of the cost. The only scenario where larger models might be justified is for complex prompts (like Guided Rewrite) that require more capable instruction following, but even there the gains are marginal.
|
| 18 |
|
| 19 |
<Wide>
|
| 20 |
<HtmlEmbed
|
|
@@ -25,24 +20,18 @@ Running 90 experiments is not cheap. GPU time varies by two orders of magnitude:
|
|
| 25 |
/>
|
| 26 |
</Wide>
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
Even the cheapest configurations still take over a week of GPU time, and we only know which ones work *after* rephrasing 10B tokens and then training a model. Wouldn't it be nice if we could just score the rephrased outputs directly and skip the expensive train-then-evaluate loop?
|
| 29 |
|
| 30 |
### Can Quality Scores Predict Performance?
|
| 31 |
|
| 32 |
-
FineWeb-Edu-score and DCLM-score are great quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and iterate on prompts without running the full pipeline each time. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores]
|
| 33 |
|
| 34 |
[^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
|
| 35 |
|
| 36 |
-
**DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (Ο = 0.61, p {'<'} 0.001), followed by the output DCLM-score (Ο = 0.56). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (Ο = 0.47β0.54) and reading comprehension (Ο = 0.49β0.52).
|
| 37 |
-
|
| 38 |
-
**Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (Ο = 0.27, p {'<'} 0.05), but the output edu-score (the score of the rephrased data) shows essentially no correlation (Ο = β0.08, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all.
|
| 39 |
-
|
| 40 |
-
{/*
|
| 41 |
-
**The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (Ο = 0.60) and PIQA (Ο = 0.58), while being *negatively* correlated with math (Ο = β0.39) and reading comprehension (Ο = β0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (Ο = 0.65 for HellaSwag, Ο = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (Ο = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
|
| 42 |
-
*/}
|
| 43 |
-
|
| 44 |
-
**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (Ο β 0.56β0.61) are still only moderate, explaining roughly 30% of the variance at best. **The bottom line: for synthetic data, there is no shortcut. You have to train models and evaluate them.**
|
| 45 |
-
|
| 46 |
{/*
|
| 47 |
Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
|
| 48 |
are excluded in both charts' JS via BROKEN_INPUT_SCORES rather than patched in the JSON:
|
|
@@ -58,11 +47,17 @@ article/commentary/discussion/tutorial-1b-hq, tutorial-12b-hq, faq-1b-lq, faq-12
|
|
| 58 |
/>
|
| 59 |
</Wide>
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
**
|
| 64 |
|
| 65 |
-
*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
<Wide>
|
| 68 |
<HtmlEmbed
|
|
@@ -73,11 +68,15 @@ The correlation matrix tells us that quality scores are weak predictors, but not
|
|
| 73 |
/>
|
| 74 |
</Wide>
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
So quality scores designed for filtering web data don't transfer to synthetic data. Maybe looking at the outputs more directly helps. For instance, does the length of the rephrased output tell us anything?
|
| 77 |
|
| 78 |
### Do Chatty Models Make Better Data?
|
| 79 |
|
| 80 |
-
Different prompt formats produce wildly different output lengths.
|
| 81 |
|
| 82 |
<Wide>
|
| 83 |
<HtmlEmbed
|
|
@@ -88,9 +87,9 @@ Different prompt formats produce wildly different output lengths. <FigRef target
|
|
| 88 |
/>
|
| 89 |
</Wide>
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
<Wide>
|
| 96 |
<HtmlEmbed
|
|
@@ -101,6 +100,8 @@ But does this variation actually affect downstream performance? Our prompts prod
|
|
| 101 |
/>
|
| 102 |
</Wide>
|
| 103 |
|
|
|
|
|
|
|
| 104 |
So output length doesn't predict quality either. But we stumbled onto something more interesting while looking at output *diversity*: a case where a model that follows instructions poorly actually produces better training data.
|
| 105 |
|
| 106 |
### Math Rephrasing: When "Worse" Outputs Win
|
|
|
|
| 1 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
|
|
|
| 2 |
import Note from "../../components/Note.astro";
|
| 3 |
import Wide from "../../components/Wide.astro";
|
| 4 |
|
|
|
|
| 9 |
|
| 10 |
### Is More Compute Worth It?
|
| 11 |
|
| 12 |
+
Running 90 experiments is not cheap. GPU time varies by two orders of magnitude: the cheapest run (Table with SmolLM2) took 8 days, while the most expensive (Guided Rewrite with Gemma-3 27B) consumed over 15 months of GPU time. Here's each experiment's downstream performance plotted against its GPU cost on a log scale, with a Pareto frontier connecting the most efficient configurations:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
<Wide>
|
| 15 |
<HtmlEmbed
|
|
|
|
| 20 |
/>
|
| 21 |
</Wide>
|
| 22 |
|
| 23 |
+
**The Pareto frontier is dominated by small models with simple prompts.** The best cost-performance tradeoffs come from 1B-class models (Gemma-3-1B, SmolLM2-1.7B) paired with format prompts like Math, Table, and FAQ. Scaling up to 12B or 27B models pushes GPU time by 5-10x while at the same time *decreasing* performance.
|
| 24 |
+
|
| 25 |
+
**The message is clear: invest in prompt design, not model size.** A well-chosen prompt on a 1B model will outperform a generic prompt on a 27B model at a tiny fraction of the cost. The only scenario where larger models might be justified is for complex prompts (like Guided Rewrite) that require more capable instruction following, but even there the gains are marginal.
|
| 26 |
+
|
| 27 |
Even the cheapest configurations still take over a week of GPU time, and we only know which ones work *after* rephrasing 10B tokens and then training a model. Wouldn't it be nice if we could just score the rephrased outputs directly and skip the expensive train-then-evaluate loop?
|
| 28 |
|
| 29 |
### Can Quality Scores Predict Performance?
|
| 30 |
|
| 31 |
+
FineWeb-Edu-score and DCLM-score are great quality filters for human-written web data. If they also work for synthetic data, we could score rephrased outputs directly and iterate on prompts without running the full pipeline each time. We computed Spearman rank correlations between various edu-score and DCLM-score metrics (input scores, output scores, score differences, and relative improvements) and all downstream benchmark results across our 90 experiments.[^broken-scores] Here's the full correlation matrix:
|
| 32 |
|
| 33 |
[^broken-scores]: Seven early runs had incorrect input quality scores due to a scoring pipeline bug and are excluded from the quality score analyses: `article-1b-hq`, `commentary-1b-hq`, `discussion-1b-hq`, `tutorial-1b-hq`, `tutorial-12b-hq`, `faq-1b-lq`, and `faq-12b-lq`. Their downstream benchmark results are unaffected and included in all other analyses.
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
{/*
|
| 36 |
Seven early runs have incorrect input quality scores due to a scoring pipeline bug and
|
| 37 |
are excluded in both charts' JS via BROKEN_INPUT_SCORES rather than patched in the JSON:
|
|
|
|
| 47 |
/>
|
| 48 |
</Wide>
|
| 49 |
|
| 50 |
+
**DCLM-score is a moderate predictor of aggregate performance.** The DCLM-score difference (output minus input) shows the strongest correlation with `agg_score_macro` (Ο = 0.61, p {'<'} 0.001), followed by the output DCLM-score (Ο = 0.56). These are moderate correlations at best. The DCLM-score variants are particularly predictive for table understanding (Ο = 0.47β0.54) and reading comprehension (Ο = 0.49β0.52).
|
| 51 |
|
| 52 |
+
**Edu-score tells a more nuanced story.** The input edu-score (the score of the original data before rephrasing) correlates with aggregate performance (Ο = 0.27, p {'<'} 0.05), but the output edu-score (the score of the rephrased data) shows essentially no correlation (Ο = β0.08, not significant). Starting with higher-quality source data matters, but the edu-score of the synthetic output is not a reliable proxy at all.
|
| 53 |
|
| 54 |
+
{/*
|
| 55 |
+
**The HellaSwag/PIQA anomaly deserves a closer look.** Edu-score improvement shows strong *positive* correlations with HellaSwag (Ο = 0.60) and PIQA (Ο = 0.58), while being *negatively* correlated with math (Ο = β0.39) and reading comprehension (Ο = β0.30). We investigated whether this was a confound from prompt type (FAQ and tutorial prompts both increase edu-scores and might independently help NLU). The correlation survives partial correlation controlling for prompt type (Ο = 0.65 for HellaSwag, Ο = 0.56 for PIQA, both p {'<'} 0.001) and for model size within the Gemma family (Ο = 0.60 and 0.68). So the effect is real. However, the practical magnitude is tiny: HellaSwag scores range from 0.066 to 0.092 across all 90 experiments (CV = 5.8%), compared to `agg_score_macro` ranging from 0.096 to 0.172 (CV = 10.5%). The edu-score captures something about sentence-completion and physical-intuition quality, but the absolute differences are so small that optimizing for it would be chasing noise.
|
| 56 |
+
*/}
|
| 57 |
+
|
| 58 |
+
**Neither score is a reliable universal proxy.** WinoGrande shows essentially zero correlation with any predictor. The strongest individual correlations (Ο β 0.56β0.61) are still only moderate, explaining roughly 30% of the variance at best. **The bottom line: for synthetic data, there is no shortcut. You have to train models and evaluate them.**
|
| 59 |
+
|
| 60 |
+
The correlation matrix tells us that quality scores are weak predictors, but not *how* scores change through rephrasing. The slope chart below visualizes this: each experiment is a line connecting its input score (left), output score (middle), and downstream `agg_score_macro` (right). Toggle between DCLM and edu-score views to see both perspectives:
|
| 61 |
|
| 62 |
<Wide>
|
| 63 |
<HtmlEmbed
|
|
|
|
| 68 |
/>
|
| 69 |
</Wide>
|
| 70 |
|
| 71 |
+
**DCLM scores almost universally increase through rephrasing.** Nearly every experiment shows an upward slope from input to output DCLM score, regardless of prompt type or model. The rephrasing models produce cleaner, more structured text that the DCLM classifier rewards. But the slope from output DCLM score to downstream performance is much flatter and noisier, confirming that a high DCLM score does not guarantee good training data.
|
| 72 |
+
|
| 73 |
+
**Edu-scores tell the opposite story.** Most experiments *decrease* the edu-score through rephrasing, particularly those starting from high-quality sources (FineWeb-Edu-HQ has high baseline edu-scores). The edu-score classifier penalizes format changes like tables, FAQs, and math notation that our best prompts produce. This is a case where the proxy metric actively misleads: the "quality degradation" measured by edu-score corresponds to format transformations that *improve* downstream performance.
|
| 74 |
+
|
| 75 |
So quality scores designed for filtering web data don't transfer to synthetic data. Maybe looking at the outputs more directly helps. For instance, does the length of the rephrased output tell us anything?
|
| 76 |
|
| 77 |
### Do Chatty Models Make Better Data?
|
| 78 |
|
| 79 |
+
Different prompt formats produce wildly different output lengths. Here are the output tokens per document across four prompt types, broken down by model family:
|
| 80 |
|
| 81 |
<Wide>
|
| 82 |
<HtmlEmbed
|
|
|
|
| 87 |
/>
|
| 88 |
</Wide>
|
| 89 |
|
| 90 |
+
Table and Math prompts tend to be concise, while FAQ and Tutorial prompts generate significantly more tokens per document. The spread within each prompt type varies across model families: some models are consistently verbose regardless of the prompt, while others adapt their output length to the task.
|
| 91 |
|
| 92 |
+
But does this variation actually affect downstream performance? Our prompts produce outputs ranging from 25% of the input length (Commentary) to 150% (Guided Rewrite at 12B). Here's each experiment's compression ratio plotted against its benchmark score:
|
| 93 |
|
| 94 |
<Wide>
|
| 95 |
<HtmlEmbed
|
|
|
|
| 100 |
/>
|
| 101 |
</Wide>
|
| 102 |
|
| 103 |
+
**There is no meaningful relationship between compression ratio and performance.** Highly compressive prompts (Commentary at 0.26x, Table at 0.25x) and expansive ones (Guided Rewrite at 1.5x) both appear across the full range of performance scores. The best-performing experiments cluster around 0.3xβ0.8x compression, but this likely reflects the distribution of prompt types rather than any causal effect of compression itself. FAQ and Tutorial prompts, which happen to compress moderately, also happen to be the strongest prompts for other reasons (pedagogical restructuring, diverse output formats). What matters is the content and structure of the output, not its length relative to the input.
|
| 104 |
+
|
| 105 |
So output length doesn't predict quality either. But we stumbled onto something more interesting while looking at output *diversity*: a case where a model that follows instructions poorly actually produces better training data.
|
| 106 |
|
| 107 |
### Math Rephrasing: When "Worse" Outputs Win
|
app/src/content/chapters/5-infrastructure.mdx
CHANGED
|
@@ -1,6 +1,5 @@
|
|
| 1 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 2 |
import Sidenote from "../../components/Sidenote.astro";
|
| 3 |
-
import FigRef from "../../components/FigRef.astro";
|
| 4 |
import Accordion from "../../components/Accordion.astro";
|
| 5 |
import Wide from "../../components/Wide.astro";
|
| 6 |
|
|
@@ -13,7 +12,7 @@ Thanks to fast inference engines like [vLLM](https://github.com/vllm-project/vll
|
|
| 13 |
|
| 14 |
We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
<HtmlEmbed
|
| 19 |
id="datatrove-pipeline"
|
|
@@ -21,6 +20,8 @@ We made major extensions to [DataTrove](https://github.com/huggingface/datatrove
|
|
| 21 |
caption="Overview of the DataTrove synthetic data generation pipeline. Documents flow through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching rollout functions to vLLM/SGLang. The system supports local and Slurm-based execution with automatic upload and progress monitoring."
|
| 22 |
/>
|
| 23 |
|
|
|
|
|
|
|
| 24 |
### Generating synthetic data at scale
|
| 25 |
|
| 26 |
At the core is `examples/inference/benchmark/generate_data.py`, a [Typer](https://typer.tiangolo.com/)-powered entry point that orchestrates the full synthetic data loop:
|
|
@@ -281,7 +282,7 @@ So what did all 801 configurations tell us?
|
|
| 281 |
|
| 282 |
#### Results
|
| 283 |
|
| 284 |
-
|
| 285 |
|
| 286 |
<HtmlEmbed
|
| 287 |
id="optimization-sweep"
|
|
@@ -435,9 +436,7 @@ Further improvement ideas:
|
|
| 435 |
- add a second model below so we can compare.
|
| 436 |
*/}
|
| 437 |
|
| 438 |
-
To get a feel for what these throughput numbers actually mean,
|
| 439 |
-
|
| 440 |
-
With all these infrastructure pieces in place, we have everything we need to build FinePhrase: the right prompts, the right model, and the machinery to run it all at scale.
|
| 441 |
|
| 442 |
<Wide>
|
| 443 |
<HtmlEmbed
|
|
@@ -446,3 +445,5 @@ With all these infrastructure pieces in place, we have everything we need to bui
|
|
| 446 |
caption="Interactive GPU throughput simulator. Select a model and adjust the number of H100 GPUs to see how fast text is generated. Scale mapping: π 1 page = 500 toks, π 1 book = 500 pages, π 1 shelf = 500 books."
|
| 447 |
/>
|
| 448 |
</Wide>
|
|
|
|
|
|
|
|
|
| 1 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 2 |
import Sidenote from "../../components/Sidenote.astro";
|
|
|
|
| 3 |
import Accordion from "../../components/Accordion.astro";
|
| 4 |
import Wide from "../../components/Wide.astro";
|
| 5 |
|
|
|
|
| 12 |
|
| 13 |
We made major extensions to [DataTrove](https://github.com/huggingface/datatrove) [@datatrove] to handle this. DataTrove supports both local generation and large-scale distributed runs on Slurm clusters, handling chunking, checkpointing, distributed queueing, and Hugging Face dataset management so you can focus on synthetic data design rather than operational glue. We used it for every experiment in this blog post, from 10k-example test runs to the full FinePhrase production pipeline.
|
| 14 |
|
| 15 |
+
Here's an overview of the pipeline:
|
| 16 |
|
| 17 |
<HtmlEmbed
|
| 18 |
id="datatrove-pipeline"
|
|
|
|
| 20 |
caption="Overview of the DataTrove synthetic data generation pipeline. Documents flow through a three-stage pipeline (Read, Transform, Write), with the InferenceRunner dispatching rollout functions to vLLM/SGLang. The system supports local and Slurm-based execution with automatic upload and progress monitoring."
|
| 21 |
/>
|
| 22 |
|
| 23 |
+
Let's walk through it.
|
| 24 |
+
|
| 25 |
### Generating synthetic data at scale
|
| 26 |
|
| 27 |
At the core is `examples/inference/benchmark/generate_data.py`, a [Typer](https://typer.tiangolo.com/)-powered entry point that orchestrates the full synthetic data loop:
|
|
|
|
| 282 |
|
| 283 |
#### Results
|
| 284 |
|
| 285 |
+
Here's the progression from baseline (vLLM defaults) through tier 0 and tier 1 optimization for all 18 models. Hover over any point to see the exact configuration and throughput:
|
| 286 |
|
| 287 |
<HtmlEmbed
|
| 288 |
id="optimization-sweep"
|
|
|
|
| 436 |
- add a second model below so we can compare.
|
| 437 |
*/}
|
| 438 |
|
| 439 |
+
To get a feel for what these throughput numbers actually mean, pick a model below and scale up the number of GPUs. Each page represents roughly 500 tokens of generated text. At high enough throughput, pages roll up into books (500 pages each), and books into shelves (500 books each).
|
|
|
|
|
|
|
| 440 |
|
| 441 |
<Wide>
|
| 442 |
<HtmlEmbed
|
|
|
|
| 445 |
caption="Interactive GPU throughput simulator. Select a model and adjust the number of H100 GPUs to see how fast text is generated. Scale mapping: π 1 page = 500 toks, π 1 book = 500 pages, π 1 shelf = 500 books."
|
| 446 |
/>
|
| 447 |
</Wide>
|
| 448 |
+
|
| 449 |
+
With all these infrastructure pieces in place, we have everything we need to build FinePhrase: the right prompts, the right model, and the machinery to run it all at scale.
|
app/src/content/chapters/6-finephrase.mdx
CHANGED
|
@@ -1,7 +1,6 @@
|
|
| 1 |
import Image from "../../components/Image.astro";
|
| 2 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
| 4 |
-
import FigRef from "../../components/FigRef.astro";
|
| 5 |
import Wide from "../../components/Wide.astro";
|
| 6 |
|
| 7 |
import datasetCardImg from "../assets/image/auto-dataset-card.png";
|
|
@@ -66,14 +65,14 @@ On Slurm, a single `generate_data` call orchestrates three coordinated jobs: the
|
|
| 66 |
|
| 67 |
### Automatic HF Upload and Progress Monitoring
|
| 68 |
|
| 69 |
-
We want you to be able to just press a button, let the GPUs go brrrr, and check back in to the finished dataset. DataTrove continuously uploads data to your specified Hugging Face dataset repo whenever a chunk is finished, using `ParquetWriter` with `hf://` paths so data appears on the Hub within minutes of generation, not after the full run completes. At the end, the `InferenceDatasetCardGenerator` pipeline step checks the logs directory, collects information about the throughput, and uploads a dataset card to document your new synthetic dataset.
|
| 70 |
|
| 71 |
<figure id="auto-dataset-card">
|
| 72 |
<Image src={datasetCardImg} alt="Auto-generated dataset card on the Hugging Face Hub" />
|
| 73 |
<figcaption>Example of an auto-generated dataset card with throughput metrics, uploaded to the Hugging Face Hub after inference completes.</figcaption>
|
| 74 |
</figure>
|
| 75 |
|
| 76 |
-
For long-running inference jobs like FinePhrase (which runs for about two weeks), the `InferenceProgressMonitor` runs as a separate Slurm job alongside the inference workers. It periodically scans the output directory, counts completed chunks across all 100 tasks, and updates the dataset card on the Hub with a progress bar and ETA for each prompt template.
|
| 77 |
|
| 78 |
<figure id="finephrase-progress">
|
| 79 |
<Image src={finephraseProgressImg} alt="Live progress monitoring of the FinePhrase generation run" />
|
|
@@ -143,7 +142,7 @@ mkdir -p "$HF_XET_CACHE"
|
|
| 143 |
|
| 144 |
#### Multi-config dataset support
|
| 145 |
|
| 146 |
-
FinePhrase runs four prompt templates that produce four independent dataset configs (faq, math, table, tutorial). Without config-awareness, all four templates would fight over a single dataset card and progress counters would exceed 100%. [PR #447](https://github.com/huggingface/datatrove/pull/447) adds first-class config support: outputs go to config-specific folders (`hf://datasets/HuggingFaceFW/finephrase/faq/`, `.../math/`, etc.), the dataset card merges information from all configs, and the progress monitor tracks each config independently so you see four separate progress bars (as in
|
| 147 |
|
| 148 |
#### Configurable server startup
|
| 149 |
|
|
@@ -164,7 +163,7 @@ With all these fixes in place, the pipeline ran to completion. So what does the
|
|
| 164 |
|
| 165 |
### What's in the Dataset?
|
| 166 |
|
| 167 |
-
|
| 168 |
|
| 169 |
<Wide>
|
| 170 |
<HtmlEmbed
|
|
|
|
| 1 |
import Image from "../../components/Image.astro";
|
| 2 |
import HtmlEmbed from "../../components/HtmlEmbed.astro";
|
| 3 |
import Sidenote from "../../components/Sidenote.astro";
|
|
|
|
| 4 |
import Wide from "../../components/Wide.astro";
|
| 5 |
|
| 6 |
import datasetCardImg from "../assets/image/auto-dataset-card.png";
|
|
|
|
| 65 |
|
| 66 |
### Automatic HF Upload and Progress Monitoring
|
| 67 |
|
| 68 |
+
We want you to be able to just press a button, let the GPUs go brrrr, and check back in to the finished dataset. DataTrove continuously uploads data to your specified Hugging Face dataset repo whenever a chunk is finished, using `ParquetWriter` with `hf://` paths so data appears on the Hub within minutes of generation, not after the full run completes. At the end, the `InferenceDatasetCardGenerator` pipeline step checks the logs directory, collects information about the throughput, and uploads a dataset card to document your new synthetic dataset. Here's an example of the auto-generated dataset card:
|
| 69 |
|
| 70 |
<figure id="auto-dataset-card">
|
| 71 |
<Image src={datasetCardImg} alt="Auto-generated dataset card on the Hugging Face Hub" />
|
| 72 |
<figcaption>Example of an auto-generated dataset card with throughput metrics, uploaded to the Hugging Face Hub after inference completes.</figcaption>
|
| 73 |
</figure>
|
| 74 |
|
| 75 |
+
For long-running inference jobs like FinePhrase (which runs for about two weeks), the `InferenceProgressMonitor` runs as a separate Slurm job alongside the inference workers. It periodically scans the output directory, counts completed chunks across all 100 tasks, and updates the dataset card on the Hub with a progress bar and ETA for each prompt template. Here's the live progress dashboard during the FinePhrase generation run:
|
| 76 |
|
| 77 |
<figure id="finephrase-progress">
|
| 78 |
<Image src={finephraseProgressImg} alt="Live progress monitoring of the FinePhrase generation run" />
|
|
|
|
| 142 |
|
| 143 |
#### Multi-config dataset support
|
| 144 |
|
| 145 |
+
FinePhrase runs four prompt templates that produce four independent dataset configs (faq, math, table, tutorial). Without config-awareness, all four templates would fight over a single dataset card and progress counters would exceed 100%. [PR #447](https://github.com/huggingface/datatrove/pull/447) adds first-class config support: outputs go to config-specific folders (`hf://datasets/HuggingFaceFW/finephrase/faq/`, `.../math/`, etc.), the dataset card merges information from all configs, and the progress monitor tracks each config independently so you see four separate progress bars (as in the [progress dashboard above](#finephrase-progress)).
|
| 146 |
|
| 147 |
#### Configurable server startup
|
| 148 |
|
|
|
|
| 163 |
|
| 164 |
### What's in the Dataset?
|
| 165 |
|
| 166 |
+
Browse some real examples from FinePhrase below. Each sample shows the original FineWeb-Edu source document alongside all four rephrased versions. Navigate through samples to see how the same web document becomes a FAQ, a math problem, a structured table, and a step-by-step tutorial:
|
| 167 |
|
| 168 |
<Wide>
|
| 169 |
<HtmlEmbed
|