finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 9 days ago

Commit

33a5dfc

1 Parent(s): 4d3a248

add more analyses for specific benchmarks

Browse files

Files changed (1) hide show

app/src/content/chapters/3-experiments.mdx +22 -4

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -83,7 +83,11 @@ The BeyondWeb dataset was never released and the paper omits key details, yet cl
   }}
 />
-Only [diverse_qa_pairs](#diverse_qa_pairs) (driven by very strong SQuAD performance) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate. Can we do better with our own prompts?
 ### Can New Prompts Beat DCLM?
@@ -111,6 +115,10 @@ Since most existing prompts fail to beat DCLM, we designed nine novel prompt for
 Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level. The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
 So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
 ### Impact of the Rephrasing Model
@@ -183,7 +191,7 @@ That raises an interesting follow-up. REWIRE claims that you specifically need l
 #### Do we need better models for rephrasing low-quality data?
-REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts:
 <HtmlEmbed
   id="size-quality"
@@ -343,6 +351,8 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
 The result is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt.
 SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
 <Sidenote>
 [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
@@ -439,6 +449,8 @@ The dream scenario would be generating all your training data synthetically, no
 Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type.
 OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
@@ -481,6 +493,12 @@ We apply the [tutorial](#tutorial) prompt using Gemma-3-1B on FineWeb-Edu-HQ, th
 DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones. This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
 If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
 #### Does the source dataset matter?
@@ -603,10 +621,10 @@ Putting together our findings on synthetic-only training, mix-in choice, source
 <Note title="Summary: Impact of the Dataset Choices" variant="info">
 **Synthetic-only**: Not enough. Always mix with original data.<br/>
-**Mix-in dataset**: Major performance driver, sometimes more important than the synthetic data itself.<br/>
 **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
 **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
-**Practical takeaway**: Invest in a high-quality mix-in dataset. The source quality matters less.
 </Note>
 We've covered prompts, models, and datasets. One last fun question: how sensitive is all of this to tiny details in the prompt itself?

   }}
 />
+On aggregate, only [diverse_qa_pairs](#diverse_qa_pairs) and REWIRE's [guided_rewrite](#guided_rewrite_original) match DCLM. The BeyondWeb-inspired [continue](#continue) and [summarize](#summarize) prompts don't reach DCLM level. So out of all the prompts from prior work, only two actually match our baseline. That's a pretty underwhelming hit rate.
+But the aggregate hides a striking pattern. Switch to individual benchmarks with the dropdown and you'll see that DCLM dominates on HellaSwag and PIQA (commonsense reasoning), beating every single synthetic prompt. Meanwhile, almost all synthetic prompts comfortably beat DCLM on ARC (science knowledge) and SQuAD (reading comprehension). Rephrasing is essentially trading commonsense reasoning for factual recall. The aggregate score papers over this because gains on one side roughly cancel losses on the other. Keep an eye on this trade-off as you read on: it explains why mixing in original data matters, why DCLM is the best mix-in, and why synthetic-only training underperforms.
+Can we do better with our own prompts?
 ### Can New Prompts Beat DCLM?
 Four of them ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)) clearly outperform DCLM, while the other five sit at or below DCLM level. The winning prompts share a common trait: they all restructure the source content into pedagogically rich formats rather than just paraphrasing it.
+The commonsense-vs-knowledge trade-off from the previous section persists here too: switch to HellaSwag or PIQA and every single prompt, including the four winners, falls below DCLM. The new prompts win on aggregate because their ARC and SQuAD gains outweigh the commonsense losses, not because they improve across the board.
+Each prompt also has a distinct benchmark signature. [Table](#table) produces the strongest ARC boost (+7.5pp over DCLM), [math](#math) is the only prompt that meaningfully moves GSM8K (+1.5pp, all others are within ±0.5pp) and also has the largest SQuAD gain (+11.2pp), and [tutorial](#tutorial) is the only prompt that improves DROP (+1.4pp). GSM8K's resistance is notable: math reasoning appears to require math-specific content, not just any pedagogical restructuring.
 So far we've been using Gemma-3-1B for everything. A natural question is: can we squeeze out more performance by throwing a bigger or better model at the problem?
 ### Impact of the Rephrasing Model
 #### Do we need better models for rephrasing low-quality data?
+REWIRE [@rewire] used Llama-3.3 70B and argued that upcycling low-quality data requires large models. We put this to the test by comparing Gemma-3-1B vs Gemma-3-12B on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [faq](#faq), [tutorial](#tutorial)). Use the Setup dropdown to switch between prompts:
 <HtmlEmbed
   id="size-quality"
 The result is striking: SmolLM2 consistently and clearly outperforms all others across every single prompt.
+But where does that advantage actually come from? Switch to SQuAD: SmolLM2 leads by roughly +10pp over the average of the other model families, consistently across all prompts. It also pulls ahead on TriviaQA (+1 to +5pp). On HellaSwag, PIQA, and GSM8K, the differences between model families are tiny (1-2pp). SmolLM2's aggregate dominance is largely a QA story.
 SmolLM2 is already over a year old at this point. If model quality matters, should we just wait for the next generation?
 <Sidenote>
 [SmolLM3](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) was released during our experiments but is not compatible with the vLLM version we used for inference and dependency hell prohibited us from updating vLLM.
 Unfortunately, synthetic-only training falls short of both DCLM and mixed training. Mixing consistently improves over both the synthetic-only and original-data-only baselines, regardless of prompt type.
+The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
 OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
 DCLM outperforms other mix-in datasets across the board. Adding synthetic data improves performance for all mix-in datasets, with the effect especially pronounced for the weaker ones. This was one of our bigger surprises: the mix-in dataset is a major performance driver, sometimes more important than the synthetic data itself.
+The per-benchmark view reveals that DCLM and FineWeb-Edu-HQ as mix-ins have complementary strengths, and the balance between them shifts depending on the source data quality. With HQ source, switch to HellaSwag and PIQA: DCLM as mix-in recovers most of the commonsense signal that rephrasing destroys, while FineWeb-Edu-HQ does not. Switch to SQuAD and DROP: FineWeb-Edu-HQ pulls ahead on reading comprehension. Their macro scores are virtually identical (0.143 vs 0.143), but DCLM edges ahead on micro because its commonsense gains are spread across more benchmarks.
+DCLM's commonsense recovery is remarkably stable: across all 15 runs with DCLM as mix-in, HellaSwag scores land in a tight range of 0.086-0.092, while the 124 FW-Edu-HQ mix-in runs spread much wider (0.069-0.098). DCLM essentially clamps commonsense performance to a narrow band regardless of what you do with the synthetic portion.
+Now switch to the LQ Source setup. Here FineWeb-Edu-HQ actually overtakes DCLM on both macro and micro. The reason is visible on ARC: FineWeb-Edu-HQ as mix-in scores +6pp over DCLM as mix-in, a gap far larger than with HQ source (+1pp). When the source data is low-quality, the rephrased output carries less knowledge on its own, so the mix-in's knowledge content matters more, and FineWeb-Edu-HQ's educational focus pays off. Meanwhile the HellaSwag gap narrows (-0.8pp vs -1.2pp with HQ source). The practical takeaway: DCLM is the better mix-in for high-quality sources, but FineWeb-Edu-HQ can be the better choice when rephrasing low-quality data.
 If the mix-in dataset matters so much, what about the source dataset we're actually rephrasing?
 #### Does the source dataset matter?
 <Note title="Summary: Impact of the Dataset Choices" variant="info">
 **Synthetic-only**: Not enough. Always mix with original data.<br/>
+**Mix-in dataset**: Major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge). Best choice depends on source quality.<br/>
 **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
 **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
+**Practical takeaway**: Invest in a high-quality mix-in dataset. DCLM for high-quality sources, FineWeb-Edu-HQ for low-quality ones.
 </Note>
 We've covered prompts, models, and datasets. One last fun question: how sensitive is all of this to tiny details in the prompt itself?