finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 13

Commit

09c855a

1 Parent(s): 7adb03a

remove titles from charts, add summary note, remove highlighting and rephrase experiment paragraphs

Browse files

Files changed (4) hide show

app/src/components/HtmlEmbed.astro +0 -21
app/src/content/chapters/experiments.mdx +31 -89
app/src/content/chapters/infrastructure.mdx +0 -1
app/src/content/chapters/introduction.mdx +0 -1

app/src/components/HtmlEmbed.astro CHANGED Viewed

@@ -1,7 +1,6 @@
 ---
 interface Props {
   src: string;
-  title?: string;
   desc?: string;
   caption?: string;
   frameless?: boolean;
@@ -13,7 +12,6 @@ interface Props {
 }
 const {
   src,
-  title,
   desc,
   caption,
   frameless = false,
@@ -69,11 +67,6 @@ const htmlWithId =
 {
   html ? (
     <figure class={`html-embed${wide ? " html-embed--wide" : ""}`} id={id}>
-      {title && (
-        <figcaption class="html-embed__title" style={`text-align:${align}`}>
-          {title}
-        </figcaption>
-      )}
       <div class={`html-embed__card${frameless ? " is-frameless" : ""}`}>
         <div
           id={mountId}
@@ -272,20 +265,6 @@ const htmlWithId =
     }
   }
-  .html-embed__title {
-    text-align: left;
-    font-weight: 600;
-    font-size: 0.95rem;
-    color: var(--text-color);
-    margin: 0;
-    padding: 0;
-    padding-bottom: var(--spacing-1);
-    position: relative;
-    display: block;
-    width: 100%;
-    background: var(--page-bg);
-    z-index: var(--z-elevated);
-  }
   .html-embed__card {
     background-color: var(--surface-bg);
     border: 1px solid var(--border-color);

 ---
 interface Props {
   src: string;
   desc?: string;
   caption?: string;
   frameless?: boolean;
 }
 const {
   src,
   desc,
   caption,
   frameless = false,
 {
   html ? (
     <figure class={`html-embed${wide ? " html-embed--wide" : ""}`} id={id}>
       <div class={`html-embed__card${frameless ? " is-frameless" : ""}`}>
         <div
           id={mountId}
     }
   }
   .html-embed__card {
     background-color: var(--surface-bg);
     border: 1px solid var(--border-color);

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -6,6 +6,7 @@ import FigRef from "../../components/FigRef.astro";
 {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
 {/* TODO: shorten the vllm inference benchmark or put stuff into the appendix */}
 {/* TODO: add a plot for the table with the benchmark results */}
 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
@@ -18,18 +19,13 @@ import FigRef from "../../components/FigRef.astro";
 With the infrastructure and setup in place, we now systematically work through our research questions. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
-### Baselines
-We start by surveying the landscape. <mark>How do existing pretraining datasets compare when used to train our 1.2B model?</mark>
-We train on eight datasets under identical conditions and compare their final evaluation performance.
-DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. <mark>TLDR: DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.</mark>
 <HtmlEmbed
   id="baselines-comparison"
   src="d3-benchmark-comparison.html"
-  title="Baseline Comparison"
   desc="Comparison of baseline datasets across different evaluation metrics. Use the dropdown to switch metrics."
   config={{
     baselines: [],
@@ -48,24 +44,17 @@ DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef ta
 The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
-#### Dissecting the Synthetic Baselines
-Prior synthetic datasets bundle multiple prompts together. We want to understand what makes them tick.
-<mark>Which individual prompts from existing synthetic methods actually match DCLM?</mark>
-We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source.
 <Sidenote>
 The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
 </Sidenote>
-Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. <mark>TLDR: Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.</mark>
 <HtmlEmbed
   id="dissecting-baselines"
   src="d3-benchmark-comparison.html"
-  title="Dissecting Synthetic Baselines"
   desc="Individual prompt performance from existing synthetic datasets compared to DCLM and FineWeb-Edu-HQ."
   config={{
     baselines: ["dclm", "nemotron_hq_synth", "rewire"],
@@ -99,18 +88,13 @@ Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performa
 Can we design prompts that consistently beat DCLM?
-### Which New Prompts Work Well?
-Since most existing prompts fail to beat DCLM, we designed new prompt formats targeting different skills. <mark>Can any of them outperform the baseline?</mark>
-We test seven novel prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion)) using Gemma-3-1B on FineWeb-Edu-HQ.
-Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) fall short (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats. <mark>TLDR: Math, table, FAQ, and tutorial prompts beat the DCLM baseline, while article, commentary, and discussion are at or below DCLM level.</mark>
 <HtmlEmbed
   id="new-prompts"
   src="d3-benchmark-comparison.html"
-  title="New Prompt Performance"
   desc="Seven new prompts compared against DCLM and FineWeb-Edu-HQ."
   config={{
     datasetNames: {
@@ -135,11 +119,7 @@ We want to know whether using a stronger model leads to better synthetic data. W
 #### Does the model size matter?
-A natural assumption is that bigger models produce higher-quality rephrasings. <mark>Do they?</mark>
-We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts.
-The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. <mark>TLDR: Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.</mark>
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
@@ -148,7 +128,6 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
 <HtmlEmbed
   id="model-size"
   src="d3-benchmark-comparison.html"
-  title="Model Size"
   desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
   config={{
     setups: {
@@ -182,16 +161,11 @@ On high-quality source data, we see no evidence that larger models help. But REW
 #### Do we need better models for rephrasing low-quality data?
-The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
-We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
-The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
 <HtmlEmbed
   id="size-quality"
   src="d3-benchmark-comparison.html"
-  title="Model Size vs Data Quality"
   desc="1B vs 12B model on HQ vs LQ data. Use the Setup dropdown to compare across prompts."
   config={{
     setups: {
@@ -243,20 +217,15 @@ Since model size barely matters, does the model family make a difference?
 #### Does the model family matter?
-Different model families may be better suited for rephrasing based on their training data. <mark>Do some families produce better synthetic data than others?</mark>
-We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on four prompts. Use the Setup dropdown to compare across prompts.
-SmolLM2 consistently and clearly outperforms all others across all four prompts (see <FigRef target="model-family" />). <mark>TLDR: Model family matters a lot. SmolLM2 dominates, likely due to [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its training data.</mark>
 <Sidenote>
-We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit rewrite tasks in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
 </Sidenote>
 <HtmlEmbed
   id="model-family"
   src="d3-benchmark-comparison.html"
-  title="Model Family"
   desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
   config={{
     setups: {
@@ -316,16 +285,11 @@ SmolLM2 is already a year old. Are newer model generations better?
 #### Does the model generation matter?
-We've seen that model family matters. But within a family, <mark>do newer versions produce better synthetic data?</mark>
-We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt.
-While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />). <mark>TLDR: Newer model generations tend to produce slightly better synthetic data.</mark>
 <HtmlEmbed
   id="model-generation"
   src="d3-benchmark-comparison.html"
-  title="Model Generation: Qwen Tutorial"
   desc="Qwen model generations (1.5 to 3) on the tutorial prompt."
   config={{
     datasetNames: {
@@ -340,9 +304,9 @@ While the differences are small, we find a consistent trend: newer versions lead
 />
 <Note title="Summary: Impact of the Rephrasing Model" variant="info">
-**Model size**: 1B is sufficient. Larger models do not help.
-**Model family**: SmolLM2 dominates across all prompts.
-**Model generation**: Newer is slightly better.
 **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
 </Note>
@@ -354,16 +318,11 @@ So far we've always mixed synthetic data with a <Glossary term="source dataset"
 #### Is synthetic data enough?
-We start with the most fundamental question: <mark>can we train on synthetic data alone, or do we need to mix it with original data?</mark>
-We compare synthetic-only training vs mixed training (synthetic + source) for [tutorial](#tutorial) and [faq](#faq) prompts on DCLM and FineWeb-Edu-HQ sources.
-Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines. <mark>TLDR: Synthetic data alone is not enough. Mixing with original data consistently improves performance.</mark>
 <HtmlEmbed
   id="synthetic-only"
   src="d3-benchmark-comparison.html"
-  title="Is Synthetic Data Enough?"
   desc="Synthetic-only vs mixed training. Use the Setup dropdown to compare across source datasets."
   config={{
     setups: {
@@ -391,20 +350,15 @@ Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mi
   }}
 />
-So the mix-in dataset clearly matters. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
-We just saw that mixing in original data is essential. <mark>How much does the choice of mix-in dataset affect performance?</mark>
-We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data.
-DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). <mark>TLDR: The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.</mark>
 <HtmlEmbed
   id="mixin-dataset"
   src="d3-benchmark-comparison.html"
-  title="Mix-in Dataset Effect"
   desc="Effect of different mix-in datasets. Use the Setup dropdown to compare HQ vs LQ source data."
   config={{
     setups: {
@@ -441,16 +395,11 @@ The mix-in dataset matters enormously. But what about the source dataset we feed
 #### Does the source dataset matter?
-We know the mix-in dataset is critical. <mark>But does the quality of the source documents we feed to the rephrasing model also matter?</mark>
-We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [tutorial](#tutorial) and [faq](#faq) prompts. We test two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ).
-When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). This corroborates our finding that the mix-in matters much more than the source. <mark>TLDR: Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.</mark>
 <HtmlEmbed
   id="source-dataset-mixin-source"
   src="d3-benchmark-comparison.html"
-  title="Source Dataset (Mix-in = Source)"
   desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
   config={{
     setups: {
@@ -481,7 +430,6 @@ When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ
 <HtmlEmbed
   id="source-dataset-fixed-mixin"
   src="d3-benchmark-comparison.html"
-  title="Source Dataset (Fixed Mix-in: FineWeb-Edu-HQ)"
   desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
   config={{
     setups: {
@@ -513,22 +461,15 @@ This is exciting because it shows the potential of upcycling low-quality data th
 #### Does increased diversity help?
-Given that mixing matters, a natural next step is to maximize diversity in the synthetic portion. <mark>Does combining multiple prompts or model families increase performance?</mark>
-We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies.
-No significant improvement from any diversity strategy. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
 <Sidenote>
 Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
 </Sidenote>
-<mark>TLDR: At our 20B token scale, diversity does not compound. Mixing datasets averages rather than improves performance, though larger-scale experiments may tell a different story.</mark>
 <HtmlEmbed
   id="diversity"
   src="d3-benchmark-comparison.html"
-  title="Diversity"
   desc="Different diversity strategies. Use the Setup dropdown to compare approaches."
   config={{
     setups: {
@@ -571,20 +512,23 @@ Interestingly, when mixing enough different prompts together, we don't seem to n
   }}
 />
 Let's turn to some unexpected findings from our experiments.
 ### Do Typos in the Prompt Hurt?
-The original REWIRE prompt contains many typos and grammar errors. <mark>Do these imperfections degrade the quality of the synthetic data?</mark>
-We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale.
-Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />). <mark>TLDR: Typos in prompts do not hurt downstream performance.</mark>
 <HtmlEmbed
   id="typos-effect"
   src="d3-benchmark-comparison.html"
-  title="Effect of Typos in Prompt"
   desc="REWIRE prompt with original typos vs improved version at 1B and 12B scale."
   config={{
     datasetNames: {
@@ -612,9 +556,7 @@ TODO: Run this analysis and add a small report
 ### Math Rephrasing: When "Worse" Outputs Win
-We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better. <mark>Does higher-quality output actually lead to better training data?</mark>
-We compare SmolLM2 (messy, variable outputs) vs Qwen3 (clean, structured outputs) for [math](#math) rephrasing.
 **Qwen3 produced beautiful, structured outputs:**
@@ -663,4 +605,4 @@ SmolLM2's quality distribution was actually reasonable:
 | Partial | 30+ tokens but missing structure | 25% |
 | Poor | {'<'}30 tokens | 8% |
-<mark>TLDR: For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.</mark>

 {/* TODO: think about what dataset to build and release as artifact: do more rephrasing with smollm2 */}
 {/* TODO: shorten the vllm inference benchmark or put stuff into the appendix */}
+{/* TODO: potentially make a widget for data exploration: look at the same few samples generated by different models or transformed with different prompts */}
 {/* TODO: add a plot for the table with the benchmark results */}
 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
 With the infrastructure and setup in place, we now systematically work through our research questions. We start by benchmarking existing datasets and dissecting what makes their prompts tick. Then we test our own prompt designs, explore how the rephrasing model (size, family, generation) affects quality, and investigate the interplay between synthetic and original data. Along the way, we stumble into some surprising findings about typos and template collapse.
+### How Do Existing Datasets Compare?
+We train on eight datasets under identical conditions and compare their final evaluation performance. DCLM, Nemotron-HQ-Synth, and REWIRE lead by a significant margin (see <FigRef target="baselines-comparison" />). The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our primary comparison target for all following experiments.
 <HtmlEmbed
   id="baselines-comparison"
   src="d3-benchmark-comparison.html"
   desc="Comparison of baseline datasets across different evaluation metrics. Use the dropdown to switch metrics."
   config={{
     baselines: [],
 The synthetic baselines use different prompts internally. Which individual prompts actually carry the weight?
+#### Which Individual Prompts Match DCLM?
+We isolate each prompt from Nemotron-HQ-Synth ([diverse_qa_pairs](#diverse_qa_pairs), [extract_knowledge](#extract_knowledge), [distill](#distill), [wikipedia_style_rephrasing](#wikipedia_style_rephrasing), [knowledge_list](#knowledge_list)), the REWIRE [guided_rewrite](#guided_rewrite_original) prompt, and the two prompts from BeyondWeb [@beyondweb] ([continue](#continue), [summarize](#summarize)), all using Gemma-3-1B on FineWeb-Edu-HQ as source. Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM (see <FigRef target="dissecting-baselines" />). The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts do not reach DCLM level. Apart from two prompts, no existing synthetic method outperforms the DCLM baseline.
 <Sidenote>
 The BeyondWeb dataset was never released and the paper omits key details, yet claims strong performance. We tested their [continue](#continue) and [summarize](#summarize) prompts to verify those claims and make the knowledge publicly available.
 </Sidenote>
 <HtmlEmbed
   id="dissecting-baselines"
   src="d3-benchmark-comparison.html"
   desc="Individual prompt performance from existing synthetic datasets compared to DCLM and FineWeb-Edu-HQ."
   config={{
     baselines: ["dclm", "nemotron_hq_synth", "rewire"],
 Can we design prompts that consistently beat DCLM?
+### Can New Prompts Beat DCLM?
+Since most existing prompts fail to beat DCLM, we designed seven novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform both FineWeb-Edu-HQ and DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
 <HtmlEmbed
   id="new-prompts"
   src="d3-benchmark-comparison.html"
   desc="Seven new prompts compared against DCLM and FineWeb-Edu-HQ."
   config={{
     datasetNames: {
 #### Does the model size matter?
+We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
 <HtmlEmbed
   id="model-size"
   src="d3-benchmark-comparison.html"
   desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
   config={{
     setups: {
 #### Do we need better models for rephrasing low-quality data?
+The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts. The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see <FigRef target="size-quality" />). We see no consistent advantage of using larger models for low-quality data.
 <HtmlEmbed
   id="size-quality"
   src="d3-benchmark-comparison.html"
   desc="1B vs 12B model on HQ vs LQ data. Use the Setup dropdown to compare across prompts."
   config={{
     setups: {
 #### Does the model family matter?
+We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on four prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all four prompts (see <FigRef target="model-family" />).
 <Sidenote>
+We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
 </Sidenote>
 <HtmlEmbed
   id="model-family"
   src="d3-benchmark-comparison.html"
   desc="Model families compared at ~1B scale. Use the Setup dropdown to compare across prompts."
   config={{
     setups: {
 #### Does the model generation matter?
+We compare Qwen models from versions 1.5 [@qwen], 2 [@qwen2], 2.5 [@qwen25], and 3 on the [tutorial](#tutorial) prompt. While the differences are small, we find a consistent trend: newer versions lead to higher evaluation performance (see <FigRef target="model-generation" />) especially cumulative from version 1.5 to 3.
 <HtmlEmbed
   id="model-generation"
   src="d3-benchmark-comparison.html"
   desc="Qwen model generations (1.5 to 3) on the tutorial prompt."
   config={{
     datasetNames: {
 />
 <Note title="Summary: Impact of the Rephrasing Model" variant="info">
+**Model size**: 1B is sufficient. Larger models do not help.<br/>
+**Model family**: SmolLM2 dominates across all prompts.<br/>
+**Model generation**: Newer is slightly better.<br/>
 **Practical takeaway**: Use the newest, best-rephrasing 1B model you can find.
 </Note>
 #### Is synthetic data enough?
+We compare synthetic-only training vs mixed training (synthetic + source) for [tutorial](#tutorial) and [faq](#faq) prompts on DCLM and FineWeb-Edu-HQ sources. Synthetic-only training beats FineWeb-Edu-HQ but falls short of both DCLM and mixed training (see <FigRef target="synthetic-only" />). Mixed training consistently improves over both the synthetic-only and original-data-only baselines.
 <HtmlEmbed
   id="synthetic-only"
   src="d3-benchmark-comparison.html"
   desc="Synthetic-only vs mixed training. Use the Setup dropdown to compare across source datasets."
   config={{
     setups: {
   }}
 />
+So synthetic data alone does not seem to be enough. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
+We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, then mix in one of four datasets: DCLM, Cosmopedia, FineWeb-Edu-HQ, or FineWeb-Edu-LQ. Use the Setup dropdown to also see results with LQ source data. DCLM and FineWeb-Edu-HQ outperform Cosmopedia and FineWeb-Edu-LQ as mix-in datasets. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones (see <FigRef target="mixin-dataset" />). The mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
 <HtmlEmbed
   id="mixin-dataset"
   src="d3-benchmark-comparison.html"
   desc="Effect of different mix-in datasets. Use the Setup dropdown to compare HQ vs LQ source data."
   config={{
     setups: {
 #### Does the source dataset matter?
+We rephrase four datasets (DCLM, Cosmopedia, FineWeb-Edu-HQ, FineWeb-Edu-LQ) with [tutorial](#tutorial) and [faq](#faq) prompts, testing two regimes: (a) mix-in equals source, and (b) fixed mix-in (FineWeb-Edu-HQ). When mix-in varies with source, source quality appears to matter: FineWeb-Edu-HQ and DCLM clearly outperform FineWeb-Edu-LQ and Cosmopedia (see <FigRef target="source-dataset-mixin-source" />). But when we fix the mix-in to FineWeb-Edu-HQ, the source effect nearly vanishes (see <FigRef target="source-dataset-fixed-mixin" />). Source dataset quality is secondary to mix-in dataset quality. With a strong mix-in, even low-quality sources produce competitive synthetic data.
 <HtmlEmbed
   id="source-dataset-mixin-source"
   src="d3-benchmark-comparison.html"
   desc="Effect of source dataset when mix-in equals source. Use the Setup dropdown to compare prompts."
   config={{
     setups: {
 <HtmlEmbed
   id="source-dataset-fixed-mixin"
   src="d3-benchmark-comparison.html"
   desc="Effect of source dataset with FineWeb-Edu-HQ as fixed mix-in. Use the Setup dropdown to compare prompts."
   config={{
     setups: {
 #### Does increased diversity help?
+We test three diversity strategies: mixing prompts, mixing model families, and mixing both. Use the Setup dropdown to compare strategies. None of them show a significant improvement over the best individual configuration. Performance averages rather than compounds (see <FigRef target="diversity" />). However, our ablations train on only 20B tokens, so it is possible that diversity benefits only emerge at larger scales where the model can better exploit the varied signal.
 <Sidenote>
 Interestingly, when mixing enough different prompts together, we don't seem to need the source dataset for good performance. This could mean that diverse synthetic data can substitute for the original data, but a single synthetic dataset cannot.
 </Sidenote>
 <HtmlEmbed
   id="diversity"
   src="d3-benchmark-comparison.html"
   desc="Different diversity strategies. Use the Setup dropdown to compare approaches."
   config={{
     setups: {
   }}
 />
+<Note title="Summary: Impact of the Dataset Choices" variant="info">
+**Synthetic-only**: Not enough. Always mix with original data.<br/>
+**Mix-in dataset**: Major performance driver, sometimes more important than the synthetic data itself.<br/>
+**Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
+**Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
+**Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
+</Note>
 Let's turn to some unexpected findings from our experiments.
 ### Do Typos in the Prompt Hurt?
+We compare REWIRE's [original prompt](#guided_rewrite_original) (with typos) against an [improved version](#guided_rewrite_improved), at both 1B and 12B scale. Surprisingly, typos don't have a negative effect on downstream model performance. For the 1B model, the typo-laden original actually performs slightly better (see <FigRef target="typos-effect" />).
 <HtmlEmbed
   id="typos-effect"
   src="d3-benchmark-comparison.html"
   desc="REWIRE prompt with original typos vs improved version at 1B and 12B scale."
   config={{
     datasetNames: {
 ### Math Rephrasing: When "Worse" Outputs Win
+We compared two ~1.7B parameter models for generating math word problems: SmolLM2 and Qwen3. SmolLM2's outputs looked objectively worse, yet models trained on them performed better.
 **Qwen3 produced beautiful, structured outputs:**
 | Partial | 30+ tokens but missing structure | 25% |
 | Poor | {'<'}30 tokens | 8% |
+For pretraining data, diversity beats consistency. Models that don't follow instructions perfectly can produce better training data than those that do.

app/src/content/chapters/infrastructure.mdx CHANGED Viewed

@@ -373,7 +373,6 @@ The benchmark config defines **801 unique configurations** across 8 experiment g
 <HtmlEmbed
   id="optimization-sweep"
   src="d3-optimization-sweep.html"
-  title="Throughput Optimization Sweep"
   desc="Throughput optimization across 18 models in two tiers. Tier 0 tunes serving parameters (tp, mns, mnbt). Tier 1 adds gpu-memory-utilization and speculative decoding. Shape encodes tier, color encodes model family."
 />

 <HtmlEmbed
   id="optimization-sweep"
   src="d3-optimization-sweep.html"
   desc="Throughput optimization across 18 models in two tiers. Tier 0 tunes serving parameters (tp, mns, mnbt). Tier 1 adds gpu-memory-utilization and speculative decoding. Shape encodes tier, color encodes model family."
 />

app/src/content/chapters/introduction.mdx CHANGED Viewed

@@ -46,7 +46,6 @@ Here's a preview of where we end up: FinePhrase, our best configuration, clearly
 <HtmlEmbed
   id="finephrase-vs-baselines"
   src="d3-benchmark-comparison.html"
-  title="FinePhrase vs Synthetic Baselines"
   desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
   config={{
     defaultView: "line",

 <HtmlEmbed
   id="finephrase-vs-baselines"
   src="d3-benchmark-comparison.html"
   desc="FinePhrase compared against synthetic data baselines across evaluation metrics."
   config={{
     defaultView: "line",