joelniklaus HF Staff commited on
Commit
33a5dfc
·
1 Parent(s): 4d3a248

add more analyses for specific benchmarks

Browse files
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -83,7 +83,11 @@ The BeyondWeb dataset was never released and the paper omits key details, yet cl
83
  }}
84
  />
85
 
86
- Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate. Can we do better with our own prompts?
 
 
 
 
87
 
88
  ### Can New Prompts Beat DCLM?
89
 
@@ -111,6 +115,10 @@ Since most existing prompts fail to beat DCLM, we designed nine novel prompt for
111
 
112
  Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level. The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
113
 
 
 
 
 
114
  So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
115
 
116
  ### Impact of the Rephrasing Model
@@ -183,7 +191,7 @@ That raises an interesting follow-up. REWIRE claims that you specifically need l
183
 
184
  #### Do we need better models for rephrasing low-quality data?
185
 
186
- REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts:
187
 
188
  <HtmlEmbed
189
  id="size-quality"
@@ -343,6 +351,8 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
343
 
344
  The result is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt.
345
 
 
 
346
  SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
347
  <Sidenote>
348
  [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
@@ -439,6 +449,8 @@ The dream scenario would be generating all your training data synthetically, no
439
 
440
  Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type.
441
 
 
 
442
  OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
443
 
444
  #### Does the mix-in dataset matter?
@@ -481,6 +493,12 @@ We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, th
481
 
482
  DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones. This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
483
 
 
 
 
 
 
 
484
  If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
485
 
486
  #### Does the source dataset matter?
@@ -603,10 +621,10 @@ Putting together our findings on synthetic-only training, mix-in choice, source
603
 
604
  <Note title="Summary: Impact of the Dataset Choices" variant="info">
605
  **Synthetic-only**: Not enough. Always mix with original data.<br/>
606
- **Mix-in dataset**: Major performance driver, sometimes more important than the synthetic data itself.<br/>
607
  **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
608
  **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
609
- **Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
610
  </Note>
611
 
612
  We've covered prompts, models, and datasets. One last fun question: how sensitive is all of this to tiny details in the prompt itself?
 
83
  }}
84
  />
85
 
86
+ On aggregate, only [diverse_qa_pairs](#diverse_qa_pairs) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate.
87
+
88
+ But the aggregate hides a striking pattern. Switch to individual benchmarks with the dropdown and you'll see that DCLM dominates on HellaSwag and PIQA (commonsense reasoning), beating every single synthetic prompt. Meanwhile, almost all synthetic prompts comfortably beat DCLM on ARC (science knowledge) and SQuAD (reading comprehension). Rephrasing is essentially trading commonsense reasoning for factual recall. The aggregate score papers over this because gains on one side roughly cancel losses on the other. Keep an eye on this trade-off as you read on: it explains why mixing in original data matters, why DCLM is the best mix-in, and why synthetic-only training underperforms.
89
+
90
+ Can we do better with our own prompts?
91
 
92
  ### Can New Prompts Beat DCLM?
93
 
 
115
 
116
  Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level. The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
117
 
118
+ The commonsense-vs-knowledge trade-off from the previous section persists here too: switch to HellaSwag or PIQA and every single prompt, including the four winners, falls below DCLM. The new prompts win on aggregate because their ARC and SQuAD gains outweigh the commonsense losses, not because they improve across the board.
119
+
120
+ Each prompt also has a distinct benchmark signature. [Table](#table) produces the strongest ARC boost (+7.5pp over DCLM), [math](#math) is the only prompt that meaningfully moves GSM8K (+1.5pp, all others are within ±0.5pp) and also has the largest SQuAD gain (+11.2pp), and [tutorial](#tutorial) is the only prompt that improves DROP (+1.4pp). GSM8K's resistance is notable: math reasoning appears to require math-specific content, not just any pedagogical restructuring.
121
+
122
  So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
123
 
124
  ### Impact of the Rephrasing Model
 
191
 
192
  #### Do we need better models for rephrasing low-quality data?
193
 
194
+ REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing Gemma-3-1B vs Gemma-3-12B on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts:
195
 
196
  <HtmlEmbed
197
  id="size-quality"
 
351
 
352
  The result is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt.
353
 
354
+ But where does that advantage actually come from? Switch to SQuAD: SmolLM2 leads by roughly +10pp over the average of the other model families, consistently across all prompts. It also pulls ahead on TriviaQA (+1 to +5pp). On HellaSwag, PIQA, and GSM8K, the differences between model families are tiny (1-2pp). SmolLM2's aggregate dominance is largely a QA story.
355
+
356
  SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
357
  <Sidenote>
358
  [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
 
449
 
450
  Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type.
451
 
452
+ The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
453
+
454
  OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
455
 
456
  #### Does the mix-in dataset matter?
 
493
 
494
  DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones. This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
495
 
496
+ The per-benchmark view reveals that DCLM and FineWeb-Edu-HQ as mix-ins have complementary strengths, and the balance between them shifts depending on the source data quality. With HQ source, switch to HellaSwag and PIQA: DCLM as mix-in recovers most of the commonsense signal that rephrasing destroys, while FineWeb-Edu-HQ does not. Switch to SQuAD and DROP: FineWeb-Edu-HQ pulls ahead on reading comprehension. Their macro scores are virtually identical (0.143 vs 0.143), but DCLM edges ahead on micro because its commonsense gains are spread across more benchmarks.
497
+
498
+ DCLM's commonsense recovery is remarkably stable: across all 15 runs with DCLM as mix-in, HellaSwag scores land in a tight range of 0.086-0.092, while the 124 FW-Edu-HQ mix-in runs spread much wider (0.069-0.098). DCLM essentially clamps commonsense performance to a narrow band regardless of what you do with the synthetic portion.
499
+
500
+ Now switch to the LQ Source setup. Here FineWeb-Edu-HQ actually overtakes DCLM on both macro and micro. The reason is visible on ARC: FineWeb-Edu-HQ as mix-in scores +6pp over DCLM as mix-in, a gap far larger than with HQ source (+1pp). When the source data is low-quality, the rephrased output carries less knowledge on its own, so the mix-in's knowledge content matters more, and FineWeb-Edu-HQ's educational focus pays off. Meanwhile the HellaSwag gap narrows (-0.8pp vs -1.2pp with HQ source). The practical takeaway: DCLM is the better mix-in for high-quality sources, but FineWeb-Edu-HQ can be the better choice when rephrasing low-quality data.
501
+
502
  If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
503
 
504
  #### Does the source dataset matter?
 
621
 
622
  <Note title="Summary: Impact of the Dataset Choices" variant="info">
623
  **Synthetic-only**: Not enough. Always mix with original data.<br/>
624
+ **Mix-in dataset**: Major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge). Best choice depends on source quality.<br/>
625
  **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
626
  **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
627
+ **Practical takeaway**: Invest in a high-quality mix-in dataset. DCLM for high-quality sources, FineWeb-Edu-HQ for low-quality ones.
628
  </Note>
629
 
630
  We've covered prompts, models, and datasets. One last fun question: how sensitive is all of this to tiny details in the prompt itself?