joelniklaus HF Staff commited on
Commit
d22ec15
Β·
1 Parent(s): b3215eb

add synthetic data proportion experiment

Browse files
app/src/content/chapters/1-introduction.mdx CHANGED
@@ -22,7 +22,7 @@ Reading time: One weekend
22
  config={{
23
  defaultView: "line",
24
  datasets: {
25
- "mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
26
  cosmopedia: { display: "Cosmopedia", color: "#e15759" },
27
  nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
28
  rewire: { display: "REWIRE", color: "#1877F2" },
@@ -30,10 +30,13 @@ Reading time: One weekend
30
  },
31
  speedupAnnotation: {
32
  baselineRun: "nemotron_hq_synth",
33
- targetRun: "mix-fw_edu_hq-table_smollm2_1.7b_hq"
34
  }
35
  }}
36
  />
 
 
 
37
 
38
  If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
39
 
 
22
  config={{
23
  defaultView: "line",
24
  datasets: {
25
+ "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
26
  cosmopedia: { display: "Cosmopedia", color: "#e15759" },
27
  nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
28
  rewire: { display: "REWIRE", color: "#1877F2" },
 
30
  },
31
  speedupAnnotation: {
32
  baselineRun: "nemotron_hq_synth",
33
+ targetRun: "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq"
34
  }
35
  }}
36
  />
37
+ <Sidenote>
38
+ FinePhrase (table) here uses the best mixing ratio we found, 70% synthetic; see [how much synthetic data to mix](#how-much-synthetic-data-should-you-mix) for the full sweep.
39
+ </Sidenote>
40
 
41
  If you read some of the latest LLM papers (e.g., Nemotron 3 [@nemotron3], Qwen3 [@qwen3], Phi-4 [@phi4], Arcee Trinity [@arceetrinitymanifesto; @arceetrinitylarge]), you may have noticed that synthetic data has become a key component for LLM training data. It is quickly becoming one of the standard tools for building high quality datasets for LLM training. If we look back we can see several paradigm shifts for LLM data, especially for pretraining, and synthetic data is the natural latest step:
42
 
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -452,7 +452,90 @@ Unfortunately, synthetic-only training falls short of both DCLM and mixed traini
452
 
453
  The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
454
 
455
- OK, so we need to mix in original data. But how much does the specific choice of mix-in dataset affect performance?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
456
 
457
  #### Does the mix-in dataset matter?
458
 
@@ -622,6 +705,7 @@ Putting together our findings on synthetic-only training, mix-in choice, source
622
 
623
  <Note title="Summary: Impact of the Dataset Choices" variant="info">
624
  **Synthetic-only**: Not enough. Always mix with original data.<br/>
 
625
  **Mix-in dataset**: Major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge). Best choice depends on source quality.<br/>
626
  **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
627
  **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
@@ -674,6 +758,7 @@ Let's step back and summarize what we learned:
674
  <tr><td>Does the model family matter?</td><td>Yes. SmolLM2 dominates across all prompts.</td></tr>
675
  <tr><td>Does the model generation matter?</td><td>Slightly. Newer Qwen versions trend better.</td></tr>
676
  <tr><td>Is synthetic data enough?</td><td>No. Always mix synthetic with original data.</td></tr>
 
677
  <tr><td>Does the mix-in dataset matter?</td><td>Yes, a major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge), and the best choice depends on source data quality.</td></tr>
678
  <tr><td>Does the source dataset matter?</td><td>Not with a strong mix-in. Even low-quality sources produce competitive results.</td></tr>
679
  <tr><td>Does increased diversity help?</td><td>No, performance averages rather than compounds.</td></tr>
 
452
 
453
  The per-benchmark view sharpens the picture. The benchmarks that benefit most from mixing are HellaSwag (+0.5 to +1.3pp) and, for most prompts, SQuAD (+4 to +12pp for Tutorial and FAQ). GSM8K doesn't move at all. The "always mix with original data" takeaway is driven primarily by commonsense recovery, not a uniform lift across all skills.
454
 
455
+ OK, so we always mix in original data. But every experiment so far has split synthetic and original evenly, 50/50, without ever questioning that ratio. How much synthetic data is actually optimal?
456
+
457
+ #### How Much Synthetic Data Should You Mix?
458
+
459
+ To find out, we sweep the synthetic fraction from 10% to 90% in 10% steps for each of our four winning prompts ([faq](#faq), [math](#math), [table](#table), [tutorial](#tutorial)), holding the token budget fixed and changing only the blend. The generator is SmolLM2-1.7B, already our best rephraser by this point, and the mix-in stays FineWeb-Edu-HQ. Use the Setup dropdown to switch prompts and the Metric dropdown to drill into individual benchmarks:
460
+
461
+ <HtmlEmbed
462
+ id="proportion-sweep"
463
+ src="d3-benchmark-comparison.html"
464
+ desc="Downstream performance as the synthetic fraction sweeps from 10% to 90% for each of the four winning prompts, mixed with FineWeb-Edu-HQ. Final Score shows the last checkpoint per fraction; Training Progression shows the full training curves. Use the Setup dropdown to switch prompts and the Metric dropdown for individual benchmarks."
465
+ config={{
466
+ defaultView: "bar",
467
+ defaultMetric: "agg_score_macro",
468
+ sortBars: false,
469
+ setups: {
470
+ "Math": {
471
+ datasets: {
472
+ "mix-0.9-fw_edu_hq-0.1-math_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
473
+ "mix-0.8-fw_edu_hq-0.2-math_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
474
+ "mix-0.7-fw_edu_hq-0.3-math_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
475
+ "mix-0.6-fw_edu_hq-0.4-math_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
476
+ "mix-0.5-fw_edu_hq-0.5-math_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
477
+ "mix-0.4-fw_edu_hq-0.6-math_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
478
+ "mix-0.3-fw_edu_hq-0.7-math_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
479
+ "mix-0.2-fw_edu_hq-0.8-math_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
480
+ "mix-0.1-fw_edu_hq-0.9-math_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
481
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
482
+ }
483
+ },
484
+ "Table": {
485
+ datasets: {
486
+ "mix-0.9-fw_edu_hq-0.1-table_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
487
+ "mix-0.8-fw_edu_hq-0.2-table_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
488
+ "mix-0.7-fw_edu_hq-0.3-table_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
489
+ "mix-0.6-fw_edu_hq-0.4-table_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
490
+ "mix-0.5-fw_edu_hq-0.5-table_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
491
+ "mix-0.4-fw_edu_hq-0.6-table_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
492
+ "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
493
+ "mix-0.2-fw_edu_hq-0.8-table_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
494
+ "mix-0.1-fw_edu_hq-0.9-table_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
495
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
496
+ }
497
+ },
498
+ "FAQ": {
499
+ datasets: {
500
+ "mix-0.9-fw_edu_hq-0.1-faq_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
501
+ "mix-0.8-fw_edu_hq-0.2-faq_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
502
+ "mix-0.7-fw_edu_hq-0.3-faq_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
503
+ "mix-0.6-fw_edu_hq-0.4-faq_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
504
+ "mix-0.5-fw_edu_hq-0.5-faq_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
505
+ "mix-0.4-fw_edu_hq-0.6-faq_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
506
+ "mix-0.3-fw_edu_hq-0.7-faq_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
507
+ "mix-0.2-fw_edu_hq-0.8-faq_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
508
+ "mix-0.1-fw_edu_hq-0.9-faq_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
509
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
510
+ }
511
+ },
512
+ "Tutorial": {
513
+ datasets: {
514
+ "mix-0.9-fw_edu_hq-0.1-tutorial_smollm2_1.7b_hq": { display: "10%", color: "#4575b4" },
515
+ "mix-0.8-fw_edu_hq-0.2-tutorial_smollm2_1.7b_hq": { display: "20%", color: "#576dab" },
516
+ "mix-0.7-fw_edu_hq-0.3-tutorial_smollm2_1.7b_hq": { display: "30%", color: "#6964a2" },
517
+ "mix-0.6-fw_edu_hq-0.4-tutorial_smollm2_1.7b_hq": { display: "40%", color: "#7b5c99" },
518
+ "mix-0.5-fw_edu_hq-0.5-tutorial_smollm2_1.7b_hq": { display: "50%", color: "#8d5490" },
519
+ "mix-0.4-fw_edu_hq-0.6-tutorial_smollm2_1.7b_hq": { display: "60%", color: "#a04c87" },
520
+ "mix-0.3-fw_edu_hq-0.7-tutorial_smollm2_1.7b_hq": { display: "70%", color: "#b2437e" },
521
+ "mix-0.2-fw_edu_hq-0.8-tutorial_smollm2_1.7b_hq": { display: "80%", color: "#c43b75" },
522
+ "mix-0.1-fw_edu_hq-0.9-tutorial_smollm2_1.7b_hq": { display: "90%", color: "#d6336c" },
523
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
524
+ }
525
+ }
526
+ }
527
+ }}
528
+ />
529
+
530
+ Every prompt has a real optimum strictly between pure original and pure synthetic, and where that optimum sits is format-dependent: [math](#math) wants the most synthetic data (80%), [table](#table) peaks at 70%, and [faq](#faq) and [tutorial](#tutorial) top out at 60%. So the uniform 1/N weighting we (and most prior work) reach for by default lands in a sensible range only by accident. Encouragingly, the curves climb to their peak and then plateau or dip slightly rather than collapsing, so anywhere in the 60-80% band is a safe choice across all four formats.
531
+
532
+ [Table](#table) is the strongest 2-way pairing overall: it peaks clearly above the DCLM baseline and edges out [math](#math), with [faq](#faq) and [tutorial](#tutorial) trailing. Switch the Metric dropdown to see where the gains come from. They concentrate in reading comprehension, where SQuAD climbs steadily as the synthetic share grows, and in GSM8K for the math prompt, while commonsense and world-knowledge benchmarks like HellaSwag, PIQA, and MMLU stay essentially flat. It is the same knowledge-for-commonsense trade we have seen all along: synthetic data buys targeted reading and math skills, not broad world knowledge.
533
+
534
+ <Sidenote>
535
+ Why did every other experiment use a flat 50/50 split? A fixed ratio keeps those comparisons fair, otherwise we would be measuring ratio tuning as much as the prompt or model itself. And 50/50 is benign: it sits on the plateau within a percentage point of every optimum and preserves the prompt ranking.
536
+ </Sidenote>
537
+
538
+ We now know roughly how much synthetic data to mix in. But we have only varied the amount, not the partner: does the specific choice of mix-in dataset matter as much as the ratio?
539
 
540
  #### Does the mix-in dataset matter?
541
 
 
705
 
706
  <Note title="Summary: Impact of the Dataset Choices" variant="info">
707
  **Synthetic-only**: Not enough. Always mix with original data.<br/>
708
+ **Synthetic fraction**: 60-80% synthetic is optimal and format-dependent (math 80%, table 70%, faq/tutorial 60%). Uniform 1/N is only mid-range by accident.<br/>
709
  **Mix-in dataset**: Major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge). Best choice depends on source quality.<br/>
710
  **Source dataset**: Secondary. With a strong mix-in, even low-quality sources work.<br/>
711
  **Diversity**: Does not compound at 20B token scale. Performance averages rather than improves.<br/>
 
758
  <tr><td>Does the model family matter?</td><td>Yes. SmolLM2 dominates across all prompts.</td></tr>
759
  <tr><td>Does the model generation matter?</td><td>Slightly. Newer Qwen versions trend better.</td></tr>
760
  <tr><td>Is synthetic data enough?</td><td>No. Always mix synthetic with original data.</td></tr>
761
+ <tr><td>How much synthetic data should you mix?</td><td>60-80%, format-dependent (math 80%, table 70%, faq/tutorial 60%). Uniform 1/N was only mid-range by accident.</td></tr>
762
  <tr><td>Does the mix-in dataset matter?</td><td>Yes, a major performance driver. DCLM and FineWeb-Edu-HQ have complementary strengths (commonsense vs knowledge), and the best choice depends on source data quality.</td></tr>
763
  <tr><td>Does the source dataset matter?</td><td>Not with a strong mix-in. Even low-quality sources produce competitive results.</td></tr>
764
  <tr><td>Does increased diversity help?</td><td>No, performance averages rather than compounds.</td></tr>
app/src/content/chapters/6-finephrase.mdx CHANGED
@@ -175,19 +175,19 @@ Browse some real examples from FinePhrase below. Each sample shows the original
175
 
176
  ### How Does FinePhrase Compare?
177
 
178
- In the introduction we showed a single FinePhrase prompt (table) against the baselines. Now that the full dataset is built, here's how all four FinePhrase prompts stack up against the strongest synthetic data baselines:
179
 
180
  <HtmlEmbed
181
  id="finephrase-all-prompts"
182
  src="d3-benchmark-comparison.html"
183
- desc="All four FinePhrase prompts compared against synthetic data baselines across evaluation metrics."
184
  config={{
185
  defaultView: "line",
186
  datasets: {
187
- "mix-fw_edu_hq-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
188
- "mix-fw_edu_hq-math_smollm2_1.7b_hq": { display: "FinePhrase (math)", color: "#E09530" },
189
- "mix-fw_edu_hq-faq_smollm2_1.7b_hq": { display: "FinePhrase (faq)", color: "#D58228" },
190
- "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": { display: "FinePhrase (tutorial)", color: "#CA7020" },
191
  cosmopedia: { display: "Cosmopedia", color: "#e15759" },
192
  nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
193
  rewire: { display: "REWIRE", color: "#1877F2" },
 
175
 
176
  ### How Does FinePhrase Compare?
177
 
178
+ In the introduction we teased our single best configuration, the table prompt at its optimal 70% synthetic ratio. Now that the full dataset is built, here's how all four FinePhrase prompts stack up against the strongest synthetic data baselines, each mixed with FineWeb-Edu-HQ at its own best ratio from the [synthetic-fraction sweep](#how-much-synthetic-data-should-you-mix):
179
 
180
  <HtmlEmbed
181
  id="finephrase-all-prompts"
182
  src="d3-benchmark-comparison.html"
183
+ desc="All four FinePhrase prompts, each at its best mixing ratio, compared against synthetic data baselines across evaluation metrics."
184
  config={{
185
  defaultView: "line",
186
  datasets: {
187
+ "mix-0.3-fw_edu_hq-0.7-table_smollm2_1.7b_hq": { display: "FinePhrase (table)", color: "#EBA937" },
188
+ "mix-0.2-fw_edu_hq-0.8-math_smollm2_1.7b_hq": { display: "FinePhrase (math)", color: "#E09530" },
189
+ "mix-0.4-fw_edu_hq-0.6-faq_smollm2_1.7b_hq": { display: "FinePhrase (faq)", color: "#D58228" },
190
+ "mix-0.4-fw_edu_hq-0.6-tutorial_smollm2_1.7b_hq": { display: "FinePhrase (tutorial)", color: "#CA7020" },
191
  cosmopedia: { display: "Cosmopedia", color: "#e15759" },
192
  nemotron_hq_synth: { display: "Nemotron-HQ-Synth", color: "#76b900" },
193
  rewire: { display: "REWIRE", color: "#1877F2" },
app/src/content/embeds/d3-benchmark-comparison.html CHANGED
@@ -18,6 +18,7 @@
18
  },
19
  "defaultMetric": "agg_score_macro", // optional, default: "agg_score_macro"
20
  "defaultView": "bar", // optional, "bar" | "line", default: "bar"
 
21
  "defaultSetup": "average", // optional, setup name or "average", default: "average" when β‰₯2 setups
22
  "tokensPerStep": 2100000, // optional, default: 2.1e6
23
  "runColumn": "runname", // optional, CSV column for series, default: "runname"
@@ -282,6 +283,7 @@
282
  const TOKENS_PER_STEP = cfg.tokensPerStep || 2.1e6;
283
  const defaultMetric = cfg.defaultMetric || 'agg_score_macro';
284
  const defaultView = cfg.defaultView || 'bar';
 
285
  const uid = Math.random().toString(36).slice(2, 8);
286
 
287
  // ─── DATASET ACCESSORS ───
@@ -522,7 +524,12 @@
522
  const row = rows.find(r => +r[STEP_COL] === maxStep);
523
  if (row) finalData.push({ name: displayName(raw), rawName: raw, value: +row[currentMetric] });
524
  }
525
- finalData.sort((a, b) => b.value - a.value);
 
 
 
 
 
526
 
527
  const barData = finalData.filter(d => !isBaseline(d.rawName));
528
  const baselineData = finalData.filter(d => isBaseline(d.rawName));
@@ -1019,13 +1026,14 @@
1019
  if (!items) return;
1020
  items.innerHTML = '';
1021
  const grouped = d3.group(allData, d => d[RUN_COL]);
 
1022
  const sorted = Array.from(grouped.entries())
1023
  .map(([raw, rows]) => {
1024
  const maxStep = d3.max(rows, r => +r[STEP_COL]);
1025
  const row = rows.find(r => +r[STEP_COL] === maxStep);
1026
  return { raw, score: row ? +row[defaultMetric] : 0 };
1027
  })
1028
- .sort((a, b) => b.score - a.score)
1029
  .map(d => d.raw);
1030
  sorted.filter(raw => !isBaseline(raw)).forEach(raw => {
1031
  const name = displayName(raw);
 
18
  },
19
  "defaultMetric": "agg_score_macro", // optional, default: "agg_score_macro"
20
  "defaultView": "bar", // optional, "bar" | "line", default: "bar"
21
+ "sortBars": true, // optional, bar view: true sorts bars by value (default), false keeps datasets declaration order
22
  "defaultSetup": "average", // optional, setup name or "average", default: "average" when β‰₯2 setups
23
  "tokensPerStep": 2100000, // optional, default: 2.1e6
24
  "runColumn": "runname", // optional, CSV column for series, default: "runname"
 
283
  const TOKENS_PER_STEP = cfg.tokensPerStep || 2.1e6;
284
  const defaultMetric = cfg.defaultMetric || 'agg_score_macro';
285
  const defaultView = cfg.defaultView || 'bar';
286
+ const SORT_BARS = cfg.sortBars !== false;
287
  const uid = Math.random().toString(36).slice(2, 8);
288
 
289
  // ─── DATASET ACCESSORS ───
 
524
  const row = rows.find(r => +r[STEP_COL] === maxStep);
525
  if (row) finalData.push({ name: displayName(raw), rawName: raw, value: +row[currentMetric] });
526
  }
527
+ if (SORT_BARS) {
528
+ finalData.sort((a, b) => b.value - a.value);
529
+ } else {
530
+ const order = Object.keys(DATASETS);
531
+ finalData.sort((a, b) => order.indexOf(a.rawName) - order.indexOf(b.rawName));
532
+ }
533
 
534
  const barData = finalData.filter(d => !isBaseline(d.rawName));
535
  const baselineData = finalData.filter(d => isBaseline(d.rawName));
 
1026
  if (!items) return;
1027
  items.innerHTML = '';
1028
  const grouped = d3.group(allData, d => d[RUN_COL]);
1029
+ const declOrder = Object.keys(DATASETS);
1030
  const sorted = Array.from(grouped.entries())
1031
  .map(([raw, rows]) => {
1032
  const maxStep = d3.max(rows, r => +r[STEP_COL]);
1033
  const row = rows.find(r => +r[STEP_COL] === maxStep);
1034
  return { raw, score: row ? +row[defaultMetric] : 0 };
1035
  })
1036
+ .sort((a, b) => SORT_BARS ? b.score - a.score : declOrder.indexOf(a.raw) - declOrder.indexOf(b.raw))
1037
  .map(d => d.raw);
1038
  sorted.filter(raw => !isBaseline(raw)).forEach(raw => {
1039
  const name = displayName(raw);