finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on 28 days ago

Commit

d22ec15

1 Parent(s): b3215eb

add synthetic data proportion experiment

Browse files

Files changed (4) hide show

app/src/content/chapters/1-introduction.mdx +5 -2
app/src/content/chapters/3-experiments.mdx +86 -1
app/src/content/chapters/6-finephrase.mdx +6 -6
app/src/content/embeds/d3-benchmark-comparison.html +10 -2

app/src/content/chapters/1-introduction.mdx CHANGED Viewed

@@ -22,7 +22,7 @@ Reading time: One weekend
   config={{
     defaultView: "line",
     datasets: {
-      "mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
       cosmopedia: { display: "Cosmopedia", color: "#e15759" },
       nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
       rewire: { display: "REWIRE", color: "#1877F2" },
@@ -30,10 +30,13 @@ Reading time: One weekend
     },
     speedupAnnotation: {
       baselineRun: "nemotron_hq_synth",
-      targetRun: "mix-fw_edu_hq-table_smollm2_1.7b_hq"
     }
   }}
 />
 If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:

   config={{
     defaultView: "line",
     datasets: {
+      "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
       cosmopedia: { display: "Cosmopedia", color: "#e15759" },
       nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
       rewire: { display: "REWIRE", color: "#1877F2" },
     },
     speedupAnnotation: {
       baselineRun: "nemotron_hq_synth",
+      targetRun: "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq"
     }
   }}
 />
+<Sidenote>
+FinePhrase (table) here uses the best mixing ratio we found, 70% synthetic; see [how much synthetic data to mix](#how-much-synthetic-data-should-you-mix) for the full sweep.
+</Sidenote>
 If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -452,7 +452,90 @@ Unfortunately, synthetic-only training falls short of both DCLM and mixed traini
 The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
-OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
 #### Does the mix-in dataset matter?
@@ -622,6 +705,7 @@ Putting together our findings on synthetic-only training, mix-in choice, source
 <Note title="Summary: Impact of the Dataset Choices" variant="info">
 **Synthetic-only**: Not enough. Always mix with original data.<br/>
 **Mix-in dataset**: Major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge). Best choice depends on source quality.<br/>
 **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
 **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
@@ -674,6 +758,7 @@ Let's step back and summarize what we learned:
     <tr><td>Does the model family matter?</td><td>Yes. SmolLM2 dominates across all prompts.</td></tr>
     <tr><td>Does the model generation matter?</td><td>Slightly. Newer Qwen versions trend better.</td></tr>
     <tr><td>Is synthetic data enough?</td><td>No. Always mix synthetic with original data.</td></tr>
     <tr><td>Does the mix-in dataset matter?</td><td>Yes, a major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge), and the best choice depends on source data quality.</td></tr>
     <tr><td>Does the source dataset matter?</td><td>Not with a strong mix-in. Even low-quality sources produce competitive results.</td></tr>
     <tr><td>Does increased diversity help?</td><td>No, performance averages rather than compounds.</td></tr>

 The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
+OK, so we always mix in original data. But every experiment so far has split synthetic and original evenly, 50/50, without ever questioning that ratio. How much synthetic data is actually optimal?
+#### How Much Synthetic Data Should You Mix?
+To find out, we sweep the synthetic fraction from 10% to 90% in 10% steps for each of our four winning prompts ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)), holding the token budget fixed and changing only the blend. The generator is SmolLM2-1.7B, already our best rephraser by this point, and the mix-in stays FineWeb-Edu-HQ. Use the Setup dropdown to switch prompts and the Metric dropdown to drill into individual benchmarks:
+<HtmlEmbed
+  id="proportion-sweep"
+  src="d3-benchmark-comparison.html"
+  desc="Downstream performance as the synthetic fraction sweeps from 10% to 90% for each of the four winning prompts, mixed with FineWeb-Edu-HQ. Final Score shows the last checkpoint per fraction; Training Progression shows the full training curves. Use the Setup dropdown to switch prompts and the Metric dropdown for individual benchmarks."
+  config={{
+    defaultView: "bar",
+    defaultMetric: "agg_score_macro",
+    sortBars: false,
+    setups: {
+      "Math": {
+        datasets: {
+          "mix-0.9-fw_edu_hq-0.1-math_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
+          "mix-0.8-fw_edu_hq-0.2-math_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
+          "mix-0.7-fw_edu_hq-0.3-math_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
+          "mix-0.6-fw_edu_hq-0.4-math_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
+          "mix-0.5-fw_edu_hq-0.5-math_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
+          "mix-0.4-fw_edu_hq-0.6-math_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
+          "mix-0.3-fw_edu_hq-0.7-math_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
+          "mix-0.2-fw_edu_hq-0.8-math_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
+          "mix-0.1-fw_edu_hq-0.9-math_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Table": {
+        datasets: {
+          "mix-0.9-fw_edu_hq-0.1-table_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
+          "mix-0.8-fw_edu_hq-0.2-table_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
+          "mix-0.7-fw_edu_hq-0.3-table_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
+          "mix-0.6-fw_edu_hq-0.4-table_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
+          "mix-0.5-fw_edu_hq-0.5-table_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
+          "mix-0.4-fw_edu_hq-0.6-table_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
+          "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
+          "mix-0.2-fw_edu_hq-0.8-table_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
+          "mix-0.1-fw_edu_hq-0.9-table_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "FAQ": {
+        datasets: {
+          "mix-0.9-fw_edu_hq-0.1-faq_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
+          "mix-0.8-fw_edu_hq-0.2-faq_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
+          "mix-0.7-fw_edu_hq-0.3-faq_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
+          "mix-0.6-fw_edu_hq-0.4-faq_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
+          "mix-0.5-fw_edu_hq-0.5-faq_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
+          "mix-0.4-fw_edu_hq-0.6-faq_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
+          "mix-0.3-fw_edu_hq-0.7-faq_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
+          "mix-0.2-fw_edu_hq-0.8-faq_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
+          "mix-0.1-fw_edu_hq-0.9-faq_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Tutorial": {
+        datasets: {
+          "mix-0.9-fw_edu_hq-0.1-tutorial_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
+          "mix-0.8-fw_edu_hq-0.2-tutorial_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
+          "mix-0.7-fw_edu_hq-0.3-tutorial_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
+          "mix-0.6-fw_edu_hq-0.4-tutorial_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
+          "mix-0.5-fw_edu_hq-0.5-tutorial_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
+          "mix-0.4-fw_edu_hq-0.6-tutorial_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
+          "mix-0.3-fw_edu_hq-0.7-tutorial_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
+          "mix-0.2-fw_edu_hq-0.8-tutorial_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
+          "mix-0.1-fw_edu_hq-0.9-tutorial_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      }
+    }
+  }}
+/>
+Every prompt has a real optimum strictly between pure original and pure synthetic, and where that optimum sits is format-dependent: [math](#math) wants the most synthetic data (80%), [table](#table) peaks at 70%, and [faq](#faq) and [tutorial](#tutorial) top out at 60%. So the uniform 1/N weighting we (and most prior work) reach for by default lands in a sensible range only by accident. Encouragingly, the curves climb to their peak and then plateau or dip slightly rather than collapsing, so anywhere in the 60-80% band is a safe choice across all four formats.
+[Table](#table) is the strongest 2-way pairing overall: it peaks clearly above the DCLM baseline and edges out [math](#math), with [faq](#faq) and [tutorial](#tutorial) trailing. Switch the Metric dropdown to see where the gains come from. They concentrate in reading comprehension, where SQuAD climbs steadily as the synthetic share grows, and in GSM8K for the math prompt, while commonsense and world-knowledge benchmarks like HellaSwag, PIQA, and MMLU stay essentially flat. It is the same knowledge-for-commonsense trade we have seen all along: synthetic data buys targeted reading and math skills, not broad world knowledge.
+<Sidenote>
+Why did every other experiment use a flat 50/50 split? A fixed ratio keeps those comparisons fair, otherwise we would be measuring ratio tuning as much as the prompt or model itself. And 50/50 is benign: it sits on the plateau within a percentage point of every optimum and preserves the prompt ranking.
+</Sidenote>
+We now know roughly how much synthetic data to mix in. But we have only varied the amount, not the partner: does the specific choice of mix-in dataset matter as much as the ratio?
 #### Does the mix-in dataset matter?
 <Note title="Summary: Impact of the Dataset Choices" variant="info">
 **Synthetic-only**: Not enough. Always mix with original data.<br/>
+**Synthetic fraction**: 60-80% synthetic is optimal and format-dependent (math 80%, table 70%, faq/tutorial 60%). Uniform 1/N is only mid-range by accident.<br/>
 **Mix-in dataset**: Major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge). Best choice depends on source quality.<br/>
 **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
 **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
     <tr><td>Does the model family matter?</td><td>Yes. SmolLM2 dominates across all prompts.</td></tr>
     <tr><td>Does the model generation matter?</td><td>Slightly. Newer Qwen versions trend better.</td></tr>
     <tr><td>Is synthetic data enough?</td><td>No. Always mix synthetic with original data.</td></tr>
+    <tr><td>How much synthetic data should you mix?</td><td>60-80%, format-dependent (math 80%, table 70%, faq/tutorial 60%). Uniform 1/N was only mid-range by accident.</td></tr>
     <tr><td>Does the mix-in dataset matter?</td><td>Yes, a major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge), and the best choice depends on source data quality.</td></tr>
     <tr><td>Does the source dataset matter?</td><td>Not with a strong mix-in. Even low-quality sources produce competitive results.</td></tr>
     <tr><td>Does increased diversity help?</td><td>No, performance averages rather than compounds.</td></tr>

app/src/content/chapters/6-finephrase.mdx CHANGED Viewed

@@ -175,19 +175,19 @@ Browse some real examples from FinePhrase below. Each sample shows the original
 ### How Does FinePhrase Compare?
-In the introduction we showed a single FinePhrase prompt (table) against the baselines. Now that the full dataset is built, here's how all four FinePhrase prompts stack up against the strongest synthetic data baselines:
 <HtmlEmbed
   id="finephrase-all-prompts"
   src="d3-benchmark-comparison.html"
-  desc="All four FinePhrase prompts compared against synthetic data baselines across evaluation metrics."
   config={{
     defaultView: "line",
     datasets: {
-      "mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
-      "mix-fw_edu_hq-math_smollm2_1.7b_hq": { display: "FinePhrase (math)", color: "#E09530" },
-      "mix-fw_edu_hq-faq_smollm2_1.7b_hq": { display: "FinePhrase (faq)", color: "#D58228" },
-      "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": { display: "FinePhrase (tutorial)", color: "#CA7020" },
       cosmopedia: { display: "Cosmopedia", color: "#e15759" },
       nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
       rewire: { display: "REWIRE", color: "#1877F2" },

 ### How Does FinePhrase Compare?
+In the introduction we teased our single best configuration, the table prompt at its optimal 70% synthetic ratio. Now that the full dataset is built, here's how all four FinePhrase prompts stack up against the strongest synthetic data baselines, each mixed with FineWeb-Edu-HQ at its own best ratio from the [synthetic-fraction sweep](#how-much-synthetic-data-should-you-mix):
 <HtmlEmbed
   id="finephrase-all-prompts"
   src="d3-benchmark-comparison.html"
+  desc="All four FinePhrase prompts, each at its best mixing ratio, compared against synthetic data baselines across evaluation metrics."
   config={{
     defaultView: "line",
     datasets: {
+      "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
+      "mix-0.2-fw_edu_hq-0.8-math_smollm2_1.7b_hq": { display: "FinePhrase (math)", color: "#E09530" },
+      "mix-0.4-fw_edu_hq-0.6-faq_smollm2_1.7b_hq": { display: "FinePhrase (faq)", color: "#D58228" },
+      "mix-0.4-fw_edu_hq-0.6-tutorial_smollm2_1.7b_hq": { display: "FinePhrase (tutorial)", color: "#CA7020" },
       cosmopedia: { display: "Cosmopedia", color: "#e15759" },
       nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
       rewire: { display: "REWIRE", color: "#1877F2" },

app/src/content/embeds/d3-benchmark-comparison.html CHANGED Viewed

@@ -18,6 +18,7 @@
     },
     "defaultMetric":  "agg_score_macro",                                    // optional, default: "agg_score_macro"
     "defaultView":    "bar",                                                // optional, "bar" | "line", default: "bar"
     "defaultSetup":   "average",                                            // optional, setup name or "average", default: "average" when ≥2 setups
     "tokensPerStep":  2100000,                                              // optional, default: 2.1e6
     "runColumn":      "runname",                                            // optional, CSV column for series, default: "runname"
@@ -282,6 +283,7 @@
       const TOKENS_PER_STEP = cfg.tokensPerStep || 2.1e6;
       const defaultMetric = cfg.defaultMetric || 'agg_score_macro';
       const defaultView   = cfg.defaultView   || 'bar';
       const uid = Math.random().toString(36).slice(2, 8);
       // ─── DATASET ACCESSORS ───
@@ -522,7 +524,12 @@
           const row = rows.find(r => +r[STEP_COL] === maxStep);
           if (row) finalData.push({ name: displayName(raw), rawName: raw, value: +row[currentMetric] });
         }
-        finalData.sort((a, b) => b.value - a.value);
         const barData = finalData.filter(d => !isBaseline(d.rawName));
         const baselineData = finalData.filter(d => isBaseline(d.rawName));
@@ -1019,13 +1026,14 @@
         if (!items) return;
         items.innerHTML = '';
         const grouped = d3.group(allData, d => d[RUN_COL]);
         const sorted = Array.from(grouped.entries())
           .map(([raw, rows]) => {
             const maxStep = d3.max(rows, r => +r[STEP_COL]);
             const row = rows.find(r => +r[STEP_COL] === maxStep);
             return { raw, score: row ? +row[defaultMetric] : 0 };
           })
-          .sort((a, b) => b.score - a.score)
           .map(d => d.raw);
         sorted.filter(raw => !isBaseline(raw)).forEach(raw => {
           const name = displayName(raw);

     },
     "defaultMetric":  "agg_score_macro",                                    // optional, default: "agg_score_macro"
     "defaultView":    "bar",                                                // optional, "bar" | "line", default: "bar"
+    "sortBars":       true,                                                 // optional, bar view: true sorts bars by value (default), false keeps datasets declaration order
     "defaultSetup":   "average",                                            // optional, setup name or "average", default: "average" when ≥2 setups
     "tokensPerStep":  2100000,                                              // optional, default: 2.1e6
     "runColumn":      "runname",                                            // optional, CSV column for series, default: "runname"
       const TOKENS_PER_STEP = cfg.tokensPerStep || 2.1e6;
       const defaultMetric = cfg.defaultMetric || 'agg_score_macro';
       const defaultView   = cfg.defaultView   || 'bar';
+      const SORT_BARS     = cfg.sortBars !== false;
       const uid = Math.random().toString(36).slice(2, 8);
       // ─── DATASET ACCESSORS ───
           const row = rows.find(r => +r[STEP_COL] === maxStep);
           if (row) finalData.push({ name: displayName(raw), rawName: raw, value: +row[currentMetric] });
         }
+        if (SORT_BARS) {
+          finalData.sort((a, b) => b.value - a.value);
+        } else {
+          const order = Object.keys(DATASETS);
+          finalData.sort((a, b) => order.indexOf(a.rawName) - order.indexOf(b.rawName));
+        }
         const barData = finalData.filter(d => !isBaseline(d.rawName));
         const baselineData = finalData.filter(d => isBaseline(d.rawName));
         if (!items) return;
         items.innerHTML = '';
         const grouped = d3.group(allData, d => d[RUN_COL]);
+        const declOrder = Object.keys(DATASETS);
         const sorted = Array.from(grouped.entries())
           .map(([raw, rows]) => {
             const maxStep = d3.max(rows, r => +r[STEP_COL]);
             const row = rows.find(r => +r[STEP_COL] === maxStep);
             return { raw, score: row ? +row[defaultMetric] : 0 };
           })
+          .sort((a, b) => SORT_BARS ? b.score - a.score : declOrder.indexOf(a.raw) - declOrder.indexOf(b.raw))
           .map(d => d.raw);
         sorted.filter(raw => !isBaseline(raw)).forEach(raw => {
           const name = displayName(raw);