finephrase / app /src /content /chapters /3-experiments.mdx
joelniklaus's picture
joelniklaus HF Staff
remove essentialweb since it is not trained for long enough
fb9415e
import HtmlEmbed from "../../components/HtmlEmbed.astro";
import Note from "../../components/Note.astro";
import Sidenote from "../../components/Sidenote.astro";
import Glossary from "../../components/Glossary.astro";
{/* TODO: Integrate decay experiment as another analysis for proxy */}
{/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
{/* TODO: run variance experiments with pretraining from scratch */}
{/* TODO: baselines mixed with fw-edu-hq usually improve upon just baselines, but not sure if/how to present this */}
{/*
Notes:
- Finepdfs-edu outperforms even DCLM quite clearly. This would change the whole story completely so it would be quite time consuming to adapt. Therefore we leave it out for now.
*/}
## Experiments
Time to put all of this to the test. We ran 90 experiments to systematically answer our questions, and the journey took some unexpected turns. Here's the full landscape of what we explored, with source datasets flowing through prompt strategies to model families:
<HtmlEmbed
id="experiment-overview"
src="experiment-overview.html"
data="rephrasing_metadata.json"
desc="Flow of experiments from source datasets through prompt strategies to model families. Hover over nodes and links to see experiment counts."
/>
We start by seeing how existing datasets stack up, then dissect what makes their prompts tick. From there we design our own prompts, explore how the rephrasing model affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse. Each major section ends with a summary box highlighting the key takeaways.
### How Do Existing Datasets Compare?
First things first: where does the bar sit? We establish baselines and train on eight popular datasets under identical conditions and compare their evaluation performance:
<HtmlEmbed
id="baselines-comparison"
src="d3-benchmark-comparison.html"
desc="Comparison of baseline datasets across different evaluation metrics. Use the dropdown to switch metrics."
config={{
datasets: {
cosmopedia: "Cosmopedia",
dclm: "DCLM",
fw_edu_hq: "FineWeb-Edu-HQ",
fw_edu_lq: "FineWeb-Edu-LQ",
nemotron_hq_synth: "Nemotron-HQ-Synth",
rewire: "REWIRE",
synth_query_reasoning_answer: "SYNTH",
"ultra-fineweb": "Ultra-FineWeb"
}
}}
/>
DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin. The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our target to beat for everything that follows.
Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?
#### Which Individual Prompts Match DCLM?
We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source:
<Sidenote>
The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
</Sidenote>
<HtmlEmbed
id="dissecting-baselines"
src="d3-benchmark-comparison.html"
desc="Individual prompt performance from existing synthetic datasets compared to the DCLM baseline."
config={{
datasets: {
"mix-fw_edu_hq-diverse_qa_pairs_1b_hq": { display: "Diverse QA Pairs", color: "#c5e384" },
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true },
"mix-fw_edu_hq-extract_knowledge_1b_hq": { display: "Extract Knowledge", color: "#3d6b00" },
"mix-fw_edu_hq-guided_rewrite_original_1b_hq": { display: "Guided Rewrite", color: "#6aabff" },
nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900", shaded: true },
rewire: { display: "REWIRE", color: "#1877F2", shaded: true },
"mix-fw_edu_hq-distill_1b_hq": { display: "Distill", color: "#a0c95c" },
"mix-fw_edu_hq-wikipedia_style_rephrasing_1b_hq": { display: "Wikipedia Rephrasing", color: "#7fb034" },
"mix-fw_edu_hq-knowledge_list_1b_hq": { display: "Knowledge List", color: "#5e960e" },
"mix-fw_edu_hq-continue_1b_hq": { display: "Continue", color: "#e8713a" },
"mix-fw_edu_hq-summarize_1b_hq": { display: "Summarize", color: "#c4451c" }
}
}}
/>
On aggregate, only [diverse_qa_pairs](#diverse_qa_pairs) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate.
But the aggregate hides a striking pattern. Switch to individual benchmarks with the dropdown and you'll see that DCLM dominates on HellaSwag and PIQA (commonsense reasoning), beating every single synthetic prompt. Meanwhile, almost all synthetic prompts comfortably beat DCLM on ARC (science knowledge) and SQuAD (reading comprehension). Rephrasing is essentially trading commonsense reasoning for factual recall. The aggregate score papers over this because gains on one side roughly cancel losses on the other. Keep an eye on this trade-off as you read on: it explains why mixing in original data matters, why DCLM is the best mix-in, and why synthetic-only training underperforms.
Can we do better with our own prompts?
### Can New Prompts Beat DCLM?
Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([article](#article), [commentary](#commentary), [discussion](#discussion), [explanation](#explanation), [faq](#faq), [math](#math), [narrative](#narrative), [table](#table), [tutorial](#tutorial)), all using Gemma-3-1B on FineWeb-Edu-HQ:
<HtmlEmbed
id="new-prompts"
src="d3-benchmark-comparison.html"
desc="Nine new prompts compared against the DCLM baseline."
config={{
datasets: {
"mix-fw_edu_hq-article_1b_hq": "Article",
"mix-fw_edu_hq-commentary_1b_hq": "Commentary",
"mix-fw_edu_hq-discussion_1b_hq": "Discussion",
"mix-fw_edu_hq-explanation_1b_hq": "Explanation",
"mix-fw_edu_hq-faq_1b_hq": "FAQ",
"mix-fw_edu_hq-math_1b_hq": "Math",
"mix-fw_edu_hq-narrative_1b_hq": "Narrative",
"mix-fw_edu_hq-table_1b_hq": "Table",
"mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
}}
/>
Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level. The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
The commonsense-vs-knowledge trade-off from the previous section persists here too: switch to HellaSwag or PIQA and every single prompt, including the four winners, falls below DCLM. The new prompts win on aggregate because their ARC and SQuAD gains outweigh the commonsense losses, not because they improve across the board.
Each prompt also has a distinct benchmark signature. [Table](#table) produces the strongest ARC boost (+7.5pp over DCLM), [math](#math) is the only prompt that meaningfully moves GSM8K (+1.5pp, all others are within ±0.5pp) and also has the largest SQuAD gain (+11.2pp), and [tutorial](#tutorial) is the only prompt that improves DROP (+1.4pp). GSM8K's resistance is notable: math reasoning appears to require math-specific content, not just any pedagogical restructuring.
So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
### Impact of the Rephrasing Model
We look at this from three angles: model size, model family, and model generation.
#### Does the model size matter?
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [math](#math), [tutorial](#tutorial), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts. Use the Setup dropdown to switch between them:
<Sidenote>
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
</Sidenote>
<HtmlEmbed
id="model-size"
src="d3-benchmark-comparison.html"
desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
config={{
setups: {
"Gemma-3: Math": {
datasets: {
"mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
"mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
"mix-fw_edu_hq-math_4b_hq": "Gemma-3 4B",
"mix-fw_edu_hq-math_1b_hq": "Gemma-3 1B",
"mix-fw_edu_hq-math_270m_hq": "Gemma-3 270M",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Gemma-3: REWIRE": {
datasets: {
"mix-fw_edu_hq-guided_rewrite_original_27b_hq": "Gemma-3 27B",
"mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Gemma-3 12B",
"mix-fw_edu_hq-guided_rewrite_original_4b_hq": "Gemma-3 4B",
"mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Gemma-3 1B",
"mix-fw_edu_hq-guided_rewrite_original_270m_hq": "Gemma-3 270M",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Gemma-3: Tutorial": {
datasets: {
"mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
"mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
"mix-fw_edu_hq-tutorial_4b_hq": "Gemma-3 4B",
"mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3 1B",
"mix-fw_edu_hq-tutorial_270m_hq": "Gemma-3 270M",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"SmolLM2: Tutorial": {
datasets: {
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
"mix-fw_edu_hq-tutorial_smollm2_360m_hq": "SmolLM2 360M",
"mix-fw_edu_hq-tutorial_smollm2_135m_hq": "SmolLM2 135M",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
}
}
}}
/>
For [math](#math) and [tutorial](#tutorial), the 270M model underperforms, but 1B through 27B show no significant difference.
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), bigger models don't buy you better synthetic data. This aligns with findings from @demystifyingsynth, who showed that scaling generators from 8B to 70B parameters did not yield superior pretraining data. This is great news for cost: you can use cheap, fast models for most rephrasing tasks.
That raises an interesting follow-up. REWIRE claims that you specifically need large models to salvage low-quality data. Does that hold up?
#### Do we need better models for rephrasing low-quality data?
REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing Gemma-3-1B vs Gemma-3-12B on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts:
<HtmlEmbed
id="size-quality"
src="d3-benchmark-comparison.html"
desc="1B vs 12B model on HQ vs LQ data. Use the Setup dropdown to compare across prompts."
config={{
setups: {
"Continue Prompt": {
datasets: {
"mix-fw_edu_hq-continue_12b_hq": "12B, HQ Source",
"mix-fw_edu_hq-continue_1b_hq": "1B, HQ Source",
"mix-fw_edu_hq-continue_1b_lq": "1B, LQ Source",
"mix-fw_edu_hq-continue_12b_lq": "12B, LQ Source",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Summarize Prompt": {
datasets: {
"mix-fw_edu_hq-summarize_1b_hq": "1B, HQ Source",
"mix-fw_edu_hq-summarize_12b_hq": "12B, HQ Source",
"mix-fw_edu_hq-summarize_1b_lq": "1B, LQ Source",
"mix-fw_edu_hq-summarize_12b_lq": "12B, LQ Source",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"FAQ Prompt": {
datasets: {
"mix-fw_edu_hq-faq_1b_hq": "1B, HQ Source",
"mix-fw_edu_hq-faq_1b_lq": "1B, LQ Source",
"mix-fw_edu_hq-faq_12b_hq": "12B, HQ Source",
"mix-fw_edu_hq-faq_12b_lq": "12B, LQ Source",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Tutorial Prompt": {
datasets: {
"mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
"mix-fw_edu_hq-tutorial_12b_hq": "12B, HQ Source",
"mix-fw_edu_hq-tutorial_12b_lq": "12B, LQ Source",
"mix-fw_edu_hq-tutorial_1b_lq": "1B, LQ Source",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
}
}
}}
/>
The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins. We see no consistent advantage of using larger models for low-quality data.
So model size doesn't matter much. But what if you're using the wrong model family entirely?
#### Does the model family matter?
We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts:
<Sidenote>
We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
</Sidenote>
<HtmlEmbed
id="model-family"
src="d3-benchmark-comparison.html"
desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
config={{
setups: {
"Article Prompt": {
datasets: {
"mix-fw_edu_hq-article_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-article_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-article_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-article_1b_hq": "Gemma-3",
"mix-fw_edu_hq-article_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-article_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Discussion Prompt": {
datasets: {
"mix-fw_edu_hq-discussion_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-discussion_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-discussion_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-discussion_1b_hq": "Gemma-3",
"mix-fw_edu_hq-discussion_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-discussion_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Explanation Prompt": {
datasets: {
"mix-fw_edu_hq-explanation_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-explanation_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-explanation_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-explanation_1b_hq": "Gemma-3",
"mix-fw_edu_hq-explanation_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-explanation_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"FAQ Prompt": {
datasets: {
"mix-fw_edu_hq-faq_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-faq_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-faq_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-faq_1b_hq": "Gemma-3",
"mix-fw_edu_hq-faq_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-faq_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Math Prompt": {
datasets: {
"mix-fw_edu_hq-math_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-math_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-math_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-math_1b_hq": "Gemma-3",
"mix-fw_edu_hq-math_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Narrative Prompt": {
datasets: {
"mix-fw_edu_hq-narrative_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-narrative_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-narrative_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-narrative_1b_hq": "Gemma-3",
"mix-fw_edu_hq-narrative_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-narrative_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Table Prompt": {
datasets: {
"mix-fw_edu_hq-table_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-table_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-table_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-table_1b_hq": "Gemma-3",
"mix-fw_edu_hq-table_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-table_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Tutorial Prompt": {
datasets: {
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-tutorial_falcon3_1b_hq": "Falcon3",
"mix-fw_edu_hq-tutorial_granite3_1b_hq": "Granite3",
"mix-fw_edu_hq-tutorial_1b_hq": "Gemma-3",
"mix-fw_edu_hq-tutorial_llama3.2_1b_hq": "Llama-3.2",
"mix-fw_edu_hq-tutorial_qwen3_1.7b_hq": "Qwen3",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
}
}
}}
/>
The result is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt.
But where does that advantage actually come from? Switch to SQuAD: SmolLM2 leads by roughly +10pp over the average of the other model families, consistently across all prompts. It also pulls ahead on TriviaQA (+1 to +5pp). On HellaSwag, PIQA, and GSM8K, the differences between model families are tiny (1-2pp). SmolLM2's aggregate dominance is largely a QA story.
SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
<Sidenote>
[SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
</Sidenote>
#### Does the model generation matter?
We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt:
<HtmlEmbed
id="model-generation"
src="d3-benchmark-comparison.html"
desc="Qwen model generations (1.5 to 3) on the tutorial prompt."
config={{
datasets: {
"mix-fw_edu_hq-tutorial_qwen3_1.7b_hq": "Qwen3 (1.7B)",
"mix-fw_edu_hq-tutorial_qwen2.5_1.5b_hq": "Qwen2.5 (1.5B)",
"mix-fw_edu_hq-tutorial_qwen2_1.5b_hq": "Qwen2 (1.5B)",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true },
"mix-fw_edu_hq-tutorial_qwen1.5_1.8b_hq": "Qwen1.5 (1.8B)"
}
}}
/>
The differences are small, but there is a consistent upward trend: newer versions lead to slightly higher evaluation performance, especially cumulative from version 1.5 to 3.
Putting together our findings on model size, family, and generation:
<Note title="Summary: Impact of the Rephrasing Model" variant="info">
**Model size**: 1B is sufficient. Larger models do not help.<br/>
**Model family**: SmolLM2 dominates across all prompts.<br/>
**Model generation**: Newer is slightly better.<br/>
**Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
</Note>
We've thoroughly explored the model dimension. The next obvious question: how much do the dataset choices matter?
### Impact of the Dataset Choices
So far we've always mixed synthetic data with a <Glossary term="source dataset" definition="The original dataset that gets rephrased by the language model to produce synthetic data." /> and a <Glossary term="mix-in dataset" definition="The non-synthetic dataset mixed with the rephrased data during training. This can be the same as or different from the source dataset." />. But do we even need the original data? And if so, which dataset should we mix in?
#### Is synthetic data enough?
The dream scenario would be generating all your training data synthetically, no curation needed. We test this by comparing synthetic-only training vs mixed training (synthetic + source) across all our prompts on DCLM and FineWeb-Edu-HQ sources:
<HtmlEmbed
id="synthetic-only"
src="d3-benchmark-comparison.html"
desc="Synthetic-only vs mixed training. Use the Setup dropdown to compare across source datasets."
config={{
hideAverage: true,
setups: {
"DCLM Source": {
datasets: {
"mix-dclm-article_1b_dclm": "Mix: Article + DCLM",
"mix-dclm-commentary_1b_dclm": "Mix: Commentary + DCLM",
"mix-dclm-discussion_1b_dclm": "Mix: Discussion + DCLM",
"mix-dclm-faq_1b_dclm": "Mix: FAQ + DCLM",
"mix-dclm-math_1b_dclm": "Mix: Math + DCLM",
"mix-dclm-table_1b_dclm": "Mix: Table + DCLM",
"mix-dclm-tutorial_1b_dclm": "Mix: Tutorial + DCLM",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true },
article_1b_dclm: "Article Only",
commentary_1b_dclm: "Commentary Only",
discussion_1b_dclm: "Discussion Only",
faq_1b_dclm: "FAQ Only",
math_1b_dclm: "Math Only",
table_1b_dclm: "Table Only",
tutorial_1b_dclm: "Tutorial Only"
}
},
"FineWeb-Edu-HQ Source": {
datasets: {
"mix-fw_edu_hq-article_1b_hq": "Mix: Article + FineWeb-Edu-HQ",
"mix-fw_edu_hq-commentary_1b_hq": "Mix: Commentary + FineWeb-Edu-HQ",
"mix-fw_edu_hq-discussion_1b_hq": "Mix: Discussion + FineWeb-Edu-HQ",
"mix-fw_edu_hq-faq_1b_hq": "Mix: FAQ + FineWeb-Edu-HQ",
"mix-fw_edu_hq-math_1b_hq": "Mix: Math + FineWeb-Edu-HQ",
"mix-fw_edu_hq-table_1b_hq": "Mix: Table + FineWeb-Edu-HQ",
"mix-fw_edu_hq-tutorial_1b_hq": "Mix: Tutorial + FineWeb-Edu-HQ",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true },
article_1b_hq: "Article Only",
commentary_1b_hq: "Commentary Only",
discussion_1b_hq: "Discussion Only",
faq_1b_hq: "FAQ Only",
math_1b_hq: "Math Only",
table_1b_hq: "Table Only",
tutorial_1b_hq: "Tutorial Only"
}
}
}
}}
/>
Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type. This echoes @demystifyingsynth, who found that pure synthetic data never outperforms natural web text alone, but mixing roughly 30% rephrased synthetic data with natural text can accelerate convergence by 5-10x.
The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
#### Does the mix-in dataset matter?
We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data:
<HtmlEmbed
id="mixin-dataset"
src="d3-benchmark-comparison.html"
desc="Effect of different mix-in datasets. Use the Setup dropdown to compare HQ vs LQ source data."
config={{
setups: {
"HQ Source": {
datasets: {
"mix-dclm-tutorial_1b_hq": { display: "Mix-in: DCLM", color: "#4e79a7" },
"mix-fw_edu_hq-tutorial_1b_hq": { display: "Mix-in: FineWeb-Edu-HQ", color: "#59a14f" },
dclm: { display: "DCLM", color: "#4e79a7", shaded: true },
"mix-fw_edu_lq-tutorial_1b_hq": { display: "Mix-in: FineWeb-Edu-LQ", color: "#e15759" },
"mix-cosmopedia-tutorial_1b_hq": { display: "Mix-in: Cosmopedia", color: "#f28e2b" },
cosmopedia: { display: "Cosmopedia", color: "#f28e2b", shaded: true },
fw_edu_hq: { display: "FineWeb-Edu-HQ", color: "#59a14f", shaded: true },
fw_edu_lq: { display: "FineWeb-Edu-LQ", color: "#e15759", shaded: true }
}
},
"LQ Source": {
datasets: {
dclm: { display: "DCLM", color: "#4e79a7", shaded: true },
"mix-fw_edu_hq-tutorial_1b_lq": { display: "Mix-in: FineWeb-Edu-HQ", color: "#59a14f" },
"mix-dclm-tutorial_1b_lq": { display: "Mix-in: DCLM", color: "#4e79a7" },
"mix-cosmopedia-tutorial_1b_lq": { display: "Mix-in: Cosmopedia", color: "#f28e2b" },
cosmopedia: { display: "Cosmopedia", color: "#f28e2b", shaded: true },
"mix-fw_edu_lq-tutorial_1b_lq": { display: "Mix-in: FineWeb-Edu-LQ", color: "#e15759" },
fw_edu_hq: { display: "FineWeb-Edu-HQ", color: "#59a14f", shaded: true },
fw_edu_lq: { display: "FineWeb-Edu-LQ", color: "#e15759", shaded: true }
}
}
}
}}
/>
DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones. This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
The per-benchmark view reveals that DCLM and FineWeb-Edu-HQ as mix-ins have complementary strengths, and the balance between them shifts depending on the source data quality. With HQ source, switch to HellaSwag and PIQA: DCLM as mix-in recovers most of the commonsense signal that rephrasing destroys, while FineWeb-Edu-HQ does not. Switch to SQuAD and DROP: FineWeb-Edu-HQ pulls ahead on reading comprehension. Their macro scores are virtually identical (0.143 vs 0.143), but DCLM edges ahead on micro because its commonsense gains are spread across more benchmarks.
DCLM's commonsense recovery is remarkably stable: across all 15 runs with DCLM as mix-in, HellaSwag scores land in a tight range of 0.086-0.092, while the 124 FW-Edu-HQ mix-in runs spread much wider (0.069-0.098). DCLM essentially clamps commonsense performance to a narrow band regardless of what you do with the synthetic portion.
Now switch to the LQ Source setup. Here FineWeb-Edu-HQ actually overtakes DCLM on both macro and micro. The reason is visible on ARC: FineWeb-Edu-HQ as mix-in scores +6pp over DCLM as mix-in, a gap far larger than with HQ source (+1pp). When the source data is low-quality, the rephrased output carries less knowledge on its own, so the mix-in's knowledge content matters more, and FineWeb-Edu-HQ's educational focus pays off. Meanwhile the HellaSwag gap narrows (-0.8pp vs -1.2pp with HQ source). The practical takeaway: DCLM is the better mix-in for high-quality sources, but FineWeb-Edu-HQ can be the better choice when rephrasing low-quality data.
If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
#### Does the source dataset matter?
We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [faq](#faq) and [tutorial](#tutorial) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). First, here's what happens when mix-in varies with source:
<HtmlEmbed
id="source-dataset-mixin-source"
src="d3-benchmark-comparison.html"
desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
config={{
setups: {
"FAQ Prompt": {
datasets: {
"mix-dclm-faq_1b_dclm": "Source: DCLM",
"mix-fw_edu_hq-faq_1b_hq": "Source: FineWeb-Edu-HQ",
"mix-fw_edu_lq-faq_1b_lq": "Source: FineWeb-Edu-LQ",
"mix-cosmopedia-faq_1b_cosmopedia": "Source: Cosmopedia",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Tutorial Prompt": {
datasets: {
"mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
"mix-dclm-tutorial_1b_dclm": "Source: DCLM",
"mix-cosmopedia-tutorial_1b_cosmopedia": "Source: Cosmopedia",
"mix-fw_edu_lq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
}
}
}}
/>
Source quality appears to matter here: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia. But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes:
<HtmlEmbed
id="source-dataset-fixed-mixin"
src="d3-benchmark-comparison.html"
desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
config={{
setups: {
"FAQ Prompt": {
datasets: {
"mix-fw_edu_hq-faq_1b_dclm": "Source: DCLM",
"mix-fw_edu_hq-faq_1b_hq": "Source: FineWeb-Edu-HQ",
"mix-fw_edu_hq-faq_1b_lq": "Source: FineWeb-Edu-LQ",
"mix-fw_edu_hq-faq_1b_cosmopedia": "Source: Cosmopedia",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Tutorial Prompt": {
datasets: {
"mix-fw_edu_hq-tutorial_1b_dclm": "Source: DCLM",
"mix-fw_edu_hq-tutorial_1b_hq": "Source: FineWeb-Edu-HQ",
"mix-fw_edu_hq-tutorial_1b_cosmopedia": "Source: Cosmopedia",
"mix-fw_edu_hq-tutorial_1b_lq": "Source: FineWeb-Edu-LQ",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
}
}
}}
/>
This is exciting: it means you can rephrase even low-quality data and still get competitive results, as long as you pair it with a strong mix-in dataset. That opens up a much larger pool of source data to draw from. But can we squeeze out even more performance by increasing diversity in the synthetic portion?
#### Does increased diversity help?
We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies:
<Sidenote>
Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
</Sidenote>
<HtmlEmbed
id="diversity"
src="d3-benchmark-comparison.html"
desc="Different diversity strategies. Use the Setup dropdown to compare approaches."
config={{
setups: {
"Mixing Prompts": {
datasets: {
"mix-fw_edu_hq-tutorial_1b_hq-fw_edu_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts + FineWeb-Edu-HQ",
"mix-fw_edu_hq-math_1b_hq": "Math + FineWeb-Edu-HQ",
"mix-tutorial_1b_hq-faq_1b_hq-table_1b_hq-math_1b_hq": "All Prompts (No Source)",
"mix-fw_edu_hq-table_1b_hq": "Table + FineWeb-Edu-HQ",
"mix-fw_edu_hq-faq_1b_hq": "FAQ + FineWeb-Edu-HQ",
"mix-fw_edu_hq-tutorial_1b_hq": "Tutorial + FineWeb-Edu-HQ",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Mixing Models": {
datasets: {
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2",
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq-tutorial_falcon3_1b_hq": "SmolLM2 + Falcon3",
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq-tutorial_llama3.2_1b_hq": "SmolLM2 + Llama-3.2",
"mix-fw_edu_hq-tutorial_llama3.2_1b_hq-tutorial_granite3_1b_hq": "Llama-3.2 + Granite3",
"mix-fw_edu_hq-tutorial_llama3.2_1b_hq": "Llama-3.2",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
},
"Mixing Both": {
datasets: {
"mix-fw_edu_hq-faq_smollm2_1.7b_hq": "FAQ (SmolLM2)",
"mix-fw_edu_hq-faq_smollm2_1.7b_hq-tutorial_falcon3_1b_hq": "FAQ (SmolLM2) + Tutorial (Falcon3)",
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "Tutorial (SmolLM2)",
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq-tutorial_falcon3_1b_hq": "Tutorial (SmolLM2) + Tutorial (Falcon3)",
"mix-fw_edu_hq-tutorial_falcon3_1b_hq": "Tutorial (Falcon3)",
"mix-fw_edu_hq-faq_falcon3_1b_hq": "FAQ (Falcon3)",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
}
}
}
}}
/>
None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds. This was a bit disappointing. @syntheticcpt found that simple paraphrasing quickly saturates in their continued pretraining setting, while their entity-graph-based EntiGraph approach scales log-linearly by externalizing diversity to a combinatorial structure over entities. Our prompts may already capture enough structural diversity that additional mixing has diminishing returns at 20B tokens, but diversity benefits may emerge at larger scales where the model can better exploit the varied signal.
Putting together our findings on synthetic-only training, mix-in choice, source quality, and diversity:
<Note title="Summary: Impact of the Dataset Choices" variant="info">
**Synthetic-only**: Not enough. Always mix with original data.<br/>
**Mix-in dataset**: Major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge). Best choice depends on source quality.<br/>
**Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
**Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
**Practical takeaway**: Invest in a high-quality mix-in dataset. DCLM for high-quality sources, FineWeb-Edu-HQ for low-quality ones.
</Note>
We've covered prompts, models, and datasets. One last fun question: how sensitive is all of this to tiny details in the prompt itself?
### Do Typos in the Prompt Hurt?
While implementing the REWIRE prompt, we noticed it contained several typos and grammatical errors. So we cleaned it up and ran both versions:
<HtmlEmbed
id="typos-effect"
src="d3-benchmark-comparison.html"
desc="REWIRE prompt with original typos vs improved version at 1B and 12B scale."
config={{
datasets: {
"mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Original (12B)",
"mix-fw_edu_hq-guided_rewrite_improved_12b_hq": "Improved (12B)",
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true },
"mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Original (1B)",
"mix-fw_edu_hq-guided_rewrite_improved_1b_hq": "Improved (1B)"
}
}}
/>
Typos don't hurt at all. For the 1B model, the typo-laden [original](#guided_rewrite_original) actually performs slightly better than the [improved version](#guided_rewrite_improved). So much for prompt polish.
With that final detail in hand, let's take stock of everything we've found.
### Takeaways
Let's step back and summarize what we learned:
<table className="wrap-text" style={{width: '100%', tableLayout: 'fixed', borderCollapse: 'collapse', marginBottom: '1.5rem'}}>
<colgroup>
<col style={{width: '40%'}} />
<col style={{width: '60%'}} />
</colgroup>
<thead>
<tr><th style={{textAlign: 'left'}}>Question</th><th style={{textAlign: 'left'}}>Answer</th></tr>
</thead>
<tbody>
<tr><td>How do existing datasets compare?</td><td>DCLM, Nemotron-HQ-Synth, and REWIRE lead. Most synthetic baselines fall behind.</td></tr>
<tr><td>Which individual prompts from the synthetic baselines match DCLM?</td><td>Only Diverse QA Pairs and REWIRE's Guided Rewrite.</td></tr>
<tr><td>Can new prompts beat DCLM?</td><td>Yes. FAQ, Math, Table, and Tutorial all outperform DCLM. Article, Commentary, Discussion, Explanation, and Narrative do not.</td></tr>
<tr><td>Does model size matter?</td><td>Not much. 1B is sufficient for simple prompts, 4B for complex ones.</td></tr>
<tr><td>Do we need better models for low-quality data?</td><td>No consistent advantage from larger models on low-quality sources.</td></tr>
<tr><td>Does the model family matter?</td><td>Yes. SmolLM2 dominates across all prompts.</td></tr>
<tr><td>Does the model generation matter?</td><td>Slightly. Newer Qwen versions trend better.</td></tr>
<tr><td>Is synthetic data enough?</td><td>No. Always mix synthetic with original data.</td></tr>
<tr><td>Does the mix-in dataset matter?</td><td>Yes, a major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge), and the best choice depends on source data quality.</td></tr>
<tr><td>Does the source dataset matter?</td><td>Not with a strong mix-in. Even low-quality sources produce competitive results.</td></tr>
<tr><td>Does increased diversity help?</td><td>No, performance averages rather than compounds.</td></tr>
<tr><td>Do typos in the prompt hurt?</td><td>No. Typos have no negative effect on downstream performance.</td></tr>
</tbody>
</table>
So what actually matters? Prompt design, above all else. Structured formats like FAQ, Math, Table, and Tutorial consistently beat curated baselines. Everything else is surprisingly forgiving: a 1B model handles simple prompts just fine, 4B covers the complex ones, and going bigger buys you nothing. Source data quality barely matters either, as long as you mix in strong original data. That last point is worth emphasizing: low-quality sources with a good mix-in match high-quality sources, which means you can draw from a much larger and more diverse data pool. The recipe we landed on is simple: pick a structured prompt, use the smallest model that handles it, blend with high-quality original data, and pour the saved compute into volume.
Now let's look more closely at *why* these things work the way they do.