Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
09c855a
1
Parent(s): 7adb03a
remove titles from charts, add summary note, remove highlighting and rephrase experiment paragraphs
Browse files
app/src/components/HtmlEmbed.astro
CHANGED
|
@@ -1,7 +1,6 @@
|
|
| 1 |
---
|
| 2 |
interface Props {
|
| 3 |
src: string;
|
| 4 |
-
title?: string;
|
| 5 |
desc?: string;
|
| 6 |
caption?: string;
|
| 7 |
frameless?: boolean;
|
|
@@ -13,7 +12,6 @@ interface Props {
|
|
| 13 |
}
|
| 14 |
const {
|
| 15 |
src,
|
| 16 |
-
title,
|
| 17 |
desc,
|
| 18 |
caption,
|
| 19 |
frameless = false,
|
|
@@ -69,11 +67,6 @@ const htmlWithId =
|
|
| 69 |
{
|
| 70 |
html ? (
|
| 71 |
<figure class={`html-embed${wide ? " html-embed--wide" : ""}`} id={id}>
|
| 72 |
-
{title && (
|
| 73 |
-
<figcaption class="html-embed__title" style={`text-align:${align}`}>
|
| 74 |
-
{title}
|
| 75 |
-
</figcaption>
|
| 76 |
-
)}
|
| 77 |
<div class={`html-embed__card${frameless ? " is-frameless" : ""}`}>
|
| 78 |
<div
|
| 79 |
id={mountId}
|
|
@@ -272,20 +265,6 @@ const htmlWithId =
|
|
| 272 |
}
|
| 273 |
}
|
| 274 |
|
| 275 |
-
.html-embed__title {
|
| 276 |
-
text-align: left;
|
| 277 |
-
font-weight: 600;
|
| 278 |
-
font-size: 0.95rem;
|
| 279 |
-
color: var(--text-color);
|
| 280 |
-
margin: 0;
|
| 281 |
-
padding: 0;
|
| 282 |
-
padding-bottom: var(--spacing-1);
|
| 283 |
-
position: relative;
|
| 284 |
-
display: block;
|
| 285 |
-
width: 100%;
|
| 286 |
-
background: var(--page-bg);
|
| 287 |
-
z-index: var(--z-elevated);
|
| 288 |
-
}
|
| 289 |
.html-embed__card {
|
| 290 |
background-color: var(--surface-bg);
|
| 291 |
border: 1px solid var(--border-color);
|
|
|
|
| 1 |
---
|
| 2 |
interface Props {
|
| 3 |
src: string;
|
|
|
|
| 4 |
desc?: string;
|
| 5 |
caption?: string;
|
| 6 |
frameless?: boolean;
|
|
|
|
| 12 |
}
|
| 13 |
const {
|
| 14 |
src,
|
|
|
|
| 15 |
desc,
|
| 16 |
caption,
|
| 17 |
frameless = false,
|
|
|
|
| 67 |
{
|
| 68 |
html ? (
|
| 69 |
<figure class={`html-embed${wide ? " html-embed--wide" : ""}`} id={id}>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
<div class={`html-embed__card${frameless ? " is-frameless" : ""}`}>
|
| 71 |
<div
|
| 72 |
id={mountId}
|
|
|
|
| 265 |
}
|
| 266 |
}
|
| 267 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
.html-embed__card {
|
| 269 |
background-color: var(--surface-bg);
|
| 270 |
border: 1px solid var(--border-color);
|
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -6,6 +6,7 @@ import FigRef from "../../components/FigRef.astro";
|
|
| 6 |
|
| 7 |
{/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
|
| 8 |
{/* TODO: shorten the vllm inference benchmark or put stuff into the appendix */}
|
|
|
|
| 9 |
{/* TODO: add a plot for the table with the benchmark results */}
|
| 10 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 11 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
|
@@ -18,18 +19,13 @@ import FigRef from "../../components/FigRef.astro";
|
|
| 18 |
|
| 19 |
With the infrastructure and setup in place, we now systematically work through our research questions. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
|
| 20 |
|
| 21 |
-
###
|
| 22 |
|
| 23 |
-
We
|
| 24 |
-
|
| 25 |
-
We train on eight datasets under identical conditions and compare their final evaluation performance.
|
| 26 |
-
|
| 27 |
-
DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. <mark>TLDR: DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.</mark>
|
| 28 |
|
| 29 |
<HtmlEmbed
|
| 30 |
id="baselines-comparison"
|
| 31 |
src="d3-benchmark-comparison.html"
|
| 32 |
-
title="Baseline Comparison"
|
| 33 |
desc="Comparison of baseline datasets across different evaluation metrics. Use the dropdown to switch metrics."
|
| 34 |
config={{
|
| 35 |
baselines: [],
|
|
@@ -48,24 +44,17 @@ DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef ta
|
|
| 48 |
|
| 49 |
The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
|
| 50 |
|
| 51 |
-
####
|
| 52 |
-
|
| 53 |
-
Prior synthetic datasets bundle multiple prompts together. We want to understand what makes them tick.
|
| 54 |
-
|
| 55 |
-
<mark>Which individual prompts from existing synthetic methods actually match DCLM?</mark>
|
| 56 |
|
| 57 |
-
We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source.
|
| 58 |
|
| 59 |
<Sidenote>
|
| 60 |
The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
|
| 61 |
</Sidenote>
|
| 62 |
|
| 63 |
-
Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. <mark>TLDR: Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.</mark>
|
| 64 |
-
|
| 65 |
<HtmlEmbed
|
| 66 |
id="dissecting-baselines"
|
| 67 |
src="d3-benchmark-comparison.html"
|
| 68 |
-
title="Dissecting Synthetic Baselines"
|
| 69 |
desc="Individual prompt performance from existing synthetic datasets compared to DCLM and FineWeb-Edu-HQ."
|
| 70 |
config={{
|
| 71 |
baselines: ["dclm", "nemotron_hq_synth", "rewire"],
|
|
@@ -99,18 +88,13 @@ Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performa
|
|
| 99 |
|
| 100 |
Can we design prompts that consistently beat DCLM?
|
| 101 |
|
| 102 |
-
###
|
| 103 |
-
|
| 104 |
-
Since most existing prompts fail to beat DCLM, we designed new prompt formats targeting different skills. <mark>Can any of them outperform the baseline?</mark>
|
| 105 |
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) fall short (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats. <mark>TLDR: Math, table, FAQ, and tutorial prompts beat the DCLM baseline, while article, commentary, and discussion are at or below DCLM level.</mark>
|
| 109 |
|
| 110 |
<HtmlEmbed
|
| 111 |
id="new-prompts"
|
| 112 |
src="d3-benchmark-comparison.html"
|
| 113 |
-
title="New Prompt Performance"
|
| 114 |
desc="Seven new prompts compared against DCLM and FineWeb-Edu-HQ."
|
| 115 |
config={{
|
| 116 |
datasetNames: {
|
|
@@ -135,11 +119,7 @@ We want to know whether using a stronger model leads to better synthetic data. W
|
|
| 135 |
|
| 136 |
#### Does the model size matter?
|
| 137 |
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts.
|
| 141 |
-
|
| 142 |
-
The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. <mark>TLDR: Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.</mark>
|
| 143 |
|
| 144 |
<Sidenote>
|
| 145 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
|
@@ -148,7 +128,6 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 148 |
<HtmlEmbed
|
| 149 |
id="model-size"
|
| 150 |
src="d3-benchmark-comparison.html"
|
| 151 |
-
title="Model Size"
|
| 152 |
desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
|
| 153 |
config={{
|
| 154 |
setups: {
|
|
@@ -182,16 +161,11 @@ On high-quality source data, we see no evidence that larger models help. But REW
|
|
| 182 |
|
| 183 |
#### Do we need better models for rephrasing low-quality data?
|
| 184 |
|
| 185 |
-
The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case).
|
| 186 |
-
|
| 187 |
-
We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
|
| 188 |
-
|
| 189 |
-
The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
|
| 190 |
|
| 191 |
<HtmlEmbed
|
| 192 |
id="size-quality"
|
| 193 |
src="d3-benchmark-comparison.html"
|
| 194 |
-
title="Model Size vs Data Quality"
|
| 195 |
desc="1B vs 12B model on HQ vs LQ data. Use the Setup dropdown to compare across prompts."
|
| 196 |
config={{
|
| 197 |
setups: {
|
|
@@ -243,20 +217,15 @@ Since model size barely matters, does the model family make a difference?
|
|
| 243 |
|
| 244 |
#### Does the model family matter?
|
| 245 |
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on four prompts. Use the Setup dropdown to compare across prompts.
|
| 249 |
-
|
| 250 |
-
SmolLM2 consistently and clearly outperforms all others across all four prompts (see <FigRef target="model-family" />). <mark>TLDR: Model family matters a lot. SmolLM2 dominates, likely due to [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its training data.</mark>
|
| 251 |
|
| 252 |
<Sidenote>
|
| 253 |
-
We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit rewrite tasks in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
|
| 254 |
</Sidenote>
|
| 255 |
|
| 256 |
<HtmlEmbed
|
| 257 |
id="model-family"
|
| 258 |
src="d3-benchmark-comparison.html"
|
| 259 |
-
title="Model Family"
|
| 260 |
desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
|
| 261 |
config={{
|
| 262 |
setups: {
|
|
@@ -316,16 +285,11 @@ SmolLM2 is already a year old. Are newer model generations better?
|
|
| 316 |
|
| 317 |
#### Does the model generation matter?
|
| 318 |
|
| 319 |
-
We
|
| 320 |
-
|
| 321 |
-
We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt.
|
| 322 |
-
|
| 323 |
-
While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />). <mark>TLDR: Newer model generations tend to produce slightly better synthetic data.</mark>
|
| 324 |
|
| 325 |
<HtmlEmbed
|
| 326 |
id="model-generation"
|
| 327 |
src="d3-benchmark-comparison.html"
|
| 328 |
-
title="Model Generation: Qwen Tutorial"
|
| 329 |
desc="Qwen model generations (1.5 to 3) on the tutorial prompt."
|
| 330 |
config={{
|
| 331 |
datasetNames: {
|
|
@@ -340,9 +304,9 @@ While the differences are small, we find a consistent trend: newer versions lead
|
|
| 340 |
/>
|
| 341 |
|
| 342 |
<Note title="Summary: Impact of the Rephrasing Model" variant="info">
|
| 343 |
-
**Model size**: 1B is sufficient. Larger models do not help.
|
| 344 |
-
**Model family**: SmolLM2 dominates across all prompts.
|
| 345 |
-
**Model generation**: Newer is slightly better.
|
| 346 |
**Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
|
| 347 |
</Note>
|
| 348 |
|
|
@@ -354,16 +318,11 @@ So far we've always mixed synthetic data with a <Glossary term="source dataset"
|
|
| 354 |
|
| 355 |
#### Is synthetic data enough?
|
| 356 |
|
| 357 |
-
We
|
| 358 |
-
|
| 359 |
-
We compare synthetic-only training vs mixed training (synthetic + source) for [tutorial](#tutorial) and [faq](#faq) prompts on DCLM and FineWeb-Edu-HQ sources.
|
| 360 |
-
|
| 361 |
-
Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines. <mark>TLDR: Synthetic data alone is not enough. Mixing with original data consistently improves performance.</mark>
|
| 362 |
|
| 363 |
<HtmlEmbed
|
| 364 |
id="synthetic-only"
|
| 365 |
src="d3-benchmark-comparison.html"
|
| 366 |
-
title="Is Synthetic Data Enough?"
|
| 367 |
desc="Synthetic-only vs mixed training. Use the Setup dropdown to compare across source datasets."
|
| 368 |
config={{
|
| 369 |
setups: {
|
|
@@ -391,20 +350,15 @@ Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mi
|
|
| 391 |
}}
|
| 392 |
/>
|
| 393 |
|
| 394 |
-
So
|
| 395 |
|
| 396 |
#### Does the mix-in dataset matter?
|
| 397 |
|
| 398 |
-
We
|
| 399 |
-
|
| 400 |
-
We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data.
|
| 401 |
-
|
| 402 |
-
DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). <mark>TLDR: The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.</mark>
|
| 403 |
|
| 404 |
<HtmlEmbed
|
| 405 |
id="mixin-dataset"
|
| 406 |
src="d3-benchmark-comparison.html"
|
| 407 |
-
title="Mix-in Dataset Effect"
|
| 408 |
desc="Effect of different mix-in datasets. Use the Setup dropdown to compare HQ vs LQ source data."
|
| 409 |
config={{
|
| 410 |
setups: {
|
|
@@ -441,16 +395,11 @@ The mix-in dataset matters enormously. But what about the source dataset we feed
|
|
| 441 |
|
| 442 |
#### Does the source dataset matter?
|
| 443 |
|
| 444 |
-
We
|
| 445 |
-
|
| 446 |
-
We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [tutorial](#tutorial) and [faq](#faq) prompts. We test two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ).
|
| 447 |
-
|
| 448 |
-
When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). This corroborates our finding that the mix-in matters much more than the source. <mark>TLDR: Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.</mark>
|
| 449 |
|
| 450 |
<HtmlEmbed
|
| 451 |
id="source-dataset-mixin-source"
|
| 452 |
src="d3-benchmark-comparison.html"
|
| 453 |
-
title="Source Dataset (Mix-in = Source)"
|
| 454 |
desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
|
| 455 |
config={{
|
| 456 |
setups: {
|
|
@@ -481,7 +430,6 @@ When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ
|
|
| 481 |
<HtmlEmbed
|
| 482 |
id="source-dataset-fixed-mixin"
|
| 483 |
src="d3-benchmark-comparison.html"
|
| 484 |
-
title="Source Dataset (Fixed Mix-in: FineWeb-Edu-HQ)"
|
| 485 |
desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
|
| 486 |
config={{
|
| 487 |
setups: {
|
|
@@ -513,22 +461,15 @@ This is exciting because it shows the potential of upcycling low-quality data th
|
|
| 513 |
|
| 514 |
#### Does increased diversity help?
|
| 515 |
|
| 516 |
-
|
| 517 |
-
|
| 518 |
-
We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies.
|
| 519 |
-
|
| 520 |
-
No significant improvement from any diversity strategy. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
|
| 521 |
|
| 522 |
<Sidenote>
|
| 523 |
Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
|
| 524 |
</Sidenote>
|
| 525 |
|
| 526 |
-
<mark>TLDR: At our 20B token scale, diversity does not compound. Mixing datasets averages rather than improves performance, though larger-scale experiments may tell a different story.</mark>
|
| 527 |
-
|
| 528 |
<HtmlEmbed
|
| 529 |
id="diversity"
|
| 530 |
src="d3-benchmark-comparison.html"
|
| 531 |
-
title="Diversity"
|
| 532 |
desc="Different diversity strategies. Use the Setup dropdown to compare approaches."
|
| 533 |
config={{
|
| 534 |
setups: {
|
|
@@ -571,20 +512,23 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
|
|
| 571 |
}}
|
| 572 |
/>
|
| 573 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 574 |
Let's turn to some unexpected findings from our experiments.
|
| 575 |
|
| 576 |
### Do Typos in the Prompt Hurt?
|
| 577 |
|
| 578 |
-
|
| 579 |
-
|
| 580 |
-
We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale.
|
| 581 |
-
|
| 582 |
-
Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />). <mark>TLDR: Typos in prompts do not hurt downstream performance.</mark>
|
| 583 |
|
| 584 |
<HtmlEmbed
|
| 585 |
id="typos-effect"
|
| 586 |
src="d3-benchmark-comparison.html"
|
| 587 |
-
title="Effect of Typos in Prompt"
|
| 588 |
desc="REWIRE prompt with original typos vs improved version at 1B and 12B scale."
|
| 589 |
config={{
|
| 590 |
datasetNames: {
|
|
@@ -612,9 +556,7 @@ TODO: Run this analysis and add a small report
|
|
| 612 |
|
| 613 |
### Math Rephrasing: When "Worse" Outputs Win
|
| 614 |
|
| 615 |
-
We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
|
| 616 |
-
|
| 617 |
-
We compare SmolLM2 (messy, variable outputs) vs Qwen3 (clean, structured outputs) for [math](#math) rephrasing.
|
| 618 |
|
| 619 |
**Qwen3 produced beautiful, structured outputs:**
|
| 620 |
|
|
@@ -663,4 +605,4 @@ SmolLM2's quality distribution was actually reasonable:
|
|
| 663 |
| Partial | 30+ tokens but missing structure | 25% |
|
| 664 |
| Poor | {'<'}30 tokens | 8% |
|
| 665 |
|
| 666 |
-
|
|
|
|
| 6 |
|
| 7 |
{/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
|
| 8 |
{/* TODO: shorten the vllm inference benchmark or put stuff into the appendix */}
|
| 9 |
+
{/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
|
| 10 |
{/* TODO: add a plot for the table with the benchmark results */}
|
| 11 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 12 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
|
|
|
| 19 |
|
| 20 |
With the infrastructure and setup in place, we now systematically work through our research questions. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
|
| 21 |
|
| 22 |
+
### How Do Existing Datasets Compare?
|
| 23 |
|
| 24 |
+
We train on eight datasets under identical conditions and compare their final evaluation performance. DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
<HtmlEmbed
|
| 27 |
id="baselines-comparison"
|
| 28 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 29 |
desc="Comparison of baseline datasets across different evaluation metrics. Use the dropdown to switch metrics."
|
| 30 |
config={{
|
| 31 |
baselines: [],
|
|
|
|
| 44 |
|
| 45 |
The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
|
| 46 |
|
| 47 |
+
#### Which Individual Prompts Match DCLM?
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source. Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.
|
| 50 |
|
| 51 |
<Sidenote>
|
| 52 |
The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
|
| 53 |
</Sidenote>
|
| 54 |
|
|
|
|
|
|
|
| 55 |
<HtmlEmbed
|
| 56 |
id="dissecting-baselines"
|
| 57 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 58 |
desc="Individual prompt performance from existing synthetic datasets compared to DCLM and FineWeb-Edu-HQ."
|
| 59 |
config={{
|
| 60 |
baselines: ["dclm", "nemotron_hq_synth", "rewire"],
|
|
|
|
| 88 |
|
| 89 |
Can we design prompts that consistently beat DCLM?
|
| 90 |
|
| 91 |
+
### Can New Prompts Beat DCLM?
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
Since most existing prompts fail to beat DCLM, we designed seven novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
|
|
|
|
|
|
|
| 94 |
|
| 95 |
<HtmlEmbed
|
| 96 |
id="new-prompts"
|
| 97 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 98 |
desc="Seven new prompts compared against DCLM and FineWeb-Edu-HQ."
|
| 99 |
config={{
|
| 100 |
datasetNames: {
|
|
|
|
| 119 |
|
| 120 |
#### Does the model size matter?
|
| 121 |
|
| 122 |
+
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
<Sidenote>
|
| 125 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
|
|
|
| 128 |
<HtmlEmbed
|
| 129 |
id="model-size"
|
| 130 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 131 |
desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
|
| 132 |
config={{
|
| 133 |
setups: {
|
|
|
|
| 161 |
|
| 162 |
#### Do we need better models for rephrasing low-quality data?
|
| 163 |
|
| 164 |
+
The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
<HtmlEmbed
|
| 167 |
id="size-quality"
|
| 168 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 169 |
desc="1B vs 12B model on HQ vs LQ data. Use the Setup dropdown to compare across prompts."
|
| 170 |
config={{
|
| 171 |
setups: {
|
|
|
|
| 217 |
|
| 218 |
#### Does the model family matter?
|
| 219 |
|
| 220 |
+
We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on four prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all four prompts (see <FigRef target="model-family" />).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
|
| 222 |
<Sidenote>
|
| 223 |
+
We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
|
| 224 |
</Sidenote>
|
| 225 |
|
| 226 |
<HtmlEmbed
|
| 227 |
id="model-family"
|
| 228 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 229 |
desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
|
| 230 |
config={{
|
| 231 |
setups: {
|
|
|
|
| 285 |
|
| 286 |
#### Does the model generation matter?
|
| 287 |
|
| 288 |
+
We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt. While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />) especially cumulative from version 1.5 to 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 289 |
|
| 290 |
<HtmlEmbed
|
| 291 |
id="model-generation"
|
| 292 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 293 |
desc="Qwen model generations (1.5 to 3) on the tutorial prompt."
|
| 294 |
config={{
|
| 295 |
datasetNames: {
|
|
|
|
| 304 |
/>
|
| 305 |
|
| 306 |
<Note title="Summary: Impact of the Rephrasing Model" variant="info">
|
| 307 |
+
**Model size**: 1B is sufficient. Larger models do not help.<br/>
|
| 308 |
+
**Model family**: SmolLM2 dominates across all prompts.<br/>
|
| 309 |
+
**Model generation**: Newer is slightly better.<br/>
|
| 310 |
**Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
|
| 311 |
</Note>
|
| 312 |
|
|
|
|
| 318 |
|
| 319 |
#### Is synthetic data enough?
|
| 320 |
|
| 321 |
+
We compare synthetic-only training vs mixed training (synthetic + source) for [tutorial](#tutorial) and [faq](#faq) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 322 |
|
| 323 |
<HtmlEmbed
|
| 324 |
id="synthetic-only"
|
| 325 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 326 |
desc="Synthetic-only vs mixed training. Use the Setup dropdown to compare across source datasets."
|
| 327 |
config={{
|
| 328 |
setups: {
|
|
|
|
| 350 |
}}
|
| 351 |
/>
|
| 352 |
|
| 353 |
+
So synthetic data alone does not seem to be enough. But how much does the specific choice of mix-in dataset affect performance?
|
| 354 |
|
| 355 |
#### Does the mix-in dataset matter?
|
| 356 |
|
| 357 |
+
We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data. DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 358 |
|
| 359 |
<HtmlEmbed
|
| 360 |
id="mixin-dataset"
|
| 361 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 362 |
desc="Effect of different mix-in datasets. Use the Setup dropdown to compare HQ vs LQ source data."
|
| 363 |
config={{
|
| 364 |
setups: {
|
|
|
|
| 395 |
|
| 396 |
#### Does the source dataset matter?
|
| 397 |
|
| 398 |
+
We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [tutorial](#tutorial) and [faq](#faq) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 399 |
|
| 400 |
<HtmlEmbed
|
| 401 |
id="source-dataset-mixin-source"
|
| 402 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 403 |
desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
|
| 404 |
config={{
|
| 405 |
setups: {
|
|
|
|
| 430 |
<HtmlEmbed
|
| 431 |
id="source-dataset-fixed-mixin"
|
| 432 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 433 |
desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
|
| 434 |
config={{
|
| 435 |
setups: {
|
|
|
|
| 461 |
|
| 462 |
#### Does increased diversity help?
|
| 463 |
|
| 464 |
+
We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies. None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 465 |
|
| 466 |
<Sidenote>
|
| 467 |
Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
|
| 468 |
</Sidenote>
|
| 469 |
|
|
|
|
|
|
|
| 470 |
<HtmlEmbed
|
| 471 |
id="diversity"
|
| 472 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 473 |
desc="Different diversity strategies. Use the Setup dropdown to compare approaches."
|
| 474 |
config={{
|
| 475 |
setups: {
|
|
|
|
| 512 |
}}
|
| 513 |
/>
|
| 514 |
|
| 515 |
+
<Note title="Summary: Impact of the Dataset Choices" variant="info">
|
| 516 |
+
**Synthetic-only**: Not enough. Always mix with original data.<br/>
|
| 517 |
+
**Mix-in dataset**: Major performance driver, sometimes more important than the synthetic data itself.<br/>
|
| 518 |
+
**Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
|
| 519 |
+
**Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
|
| 520 |
+
**Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
|
| 521 |
+
</Note>
|
| 522 |
+
|
| 523 |
Let's turn to some unexpected findings from our experiments.
|
| 524 |
|
| 525 |
### Do Typos in the Prompt Hurt?
|
| 526 |
|
| 527 |
+
We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale. Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 528 |
|
| 529 |
<HtmlEmbed
|
| 530 |
id="typos-effect"
|
| 531 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 532 |
desc="REWIRE prompt with original typos vs improved version at 1B and 12B scale."
|
| 533 |
config={{
|
| 534 |
datasetNames: {
|
|
|
|
| 556 |
|
| 557 |
### Math Rephrasing: When "Worse" Outputs Win
|
| 558 |
|
| 559 |
+
We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
|
|
|
|
|
|
|
| 560 |
|
| 561 |
**Qwen3 produced beautiful, structured outputs:**
|
| 562 |
|
|
|
|
| 605 |
| Partial | 30+ tokens but missing structure | 25% |
|
| 606 |
| Poor | {'<'}30 tokens | 8% |
|
| 607 |
|
| 608 |
+
For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.
|
app/src/content/chapters/infrastructure.mdx
CHANGED
|
@@ -373,7 +373,6 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
|
|
| 373 |
<HtmlEmbed
|
| 374 |
id="optimization-sweep"
|
| 375 |
src="d3-optimization-sweep.html"
|
| 376 |
-
title="Throughput Optimization Sweep"
|
| 377 |
desc="Throughput optimization across 18 models in two tiers. Tier 0 tunes serving parameters (tp, mns, mnbt). Tier 1 adds gpu-memory-utilization and speculative decoding. Shape encodes tier, color encodes model family."
|
| 378 |
/>
|
| 379 |
|
|
|
|
| 373 |
<HtmlEmbed
|
| 374 |
id="optimization-sweep"
|
| 375 |
src="d3-optimization-sweep.html"
|
|
|
|
| 376 |
desc="Throughput optimization across 18 models in two tiers. Tier 0 tunes serving parameters (tp, mns, mnbt). Tier 1 adds gpu-memory-utilization and speculative decoding. Shape encodes tier, color encodes model family."
|
| 377 |
/>
|
| 378 |
|
app/src/content/chapters/introduction.mdx
CHANGED
|
@@ -46,7 +46,6 @@ Here's a preview of where we end up: FinePhrase, our best configuration, clearly
|
|
| 46 |
<HtmlEmbed
|
| 47 |
id="finephrase-vs-baselines"
|
| 48 |
src="d3-benchmark-comparison.html"
|
| 49 |
-
title="FinePhrase vs Synthetic Baselines"
|
| 50 |
desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
|
| 51 |
config={{
|
| 52 |
defaultView: "line",
|
|
|
|
| 46 |
<HtmlEmbed
|
| 47 |
id="finephrase-vs-baselines"
|
| 48 |
src="d3-benchmark-comparison.html"
|
|
|
|
| 49 |
desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
|
| 50 |
config={{
|
| 51 |
defaultView: "line",
|