joelniklaus HF Staff commited on
Commit
4138808
·
1 Parent(s): 5e8e08b

add new results

Browse files
app/src/content/assets/data/benchmark-results.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0359f44cbbe97ee8f7ea598152a5053a322a81af818de890606e0daa6c15fd3a
3
- size 1378100
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19eac7b4c7d51ef51fde0893bd2e5f646c501eafb4da20cff189bad2e2d45262
3
+ size 1513555
app/src/content/assets/data/rephrasing_metadata.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cac779aca41bc6f868d99a7c7fcc43343591b40ace727098341d52285c1ff856
3
- size 152802
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:232a5ab20c1eb108a06dac83379c404a5ae0489831cc7fe7011edf4d3d237afb
3
+ size 181172
app/src/content/chapters/3-experiments.mdx CHANGED
@@ -10,6 +10,7 @@ import FigRef from "../../components/FigRef.astro";
10
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
11
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
12
  {/* TODO: brainstorm better banner, be artsy */}
 
13
  {/* TODO: run variance experiments with pretraining from scratch */}
14
  {/* TODO: run scaling experiments with longer pretraining phase */}
15
  {/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
@@ -100,19 +101,21 @@ Can we design prompts that consistently beat DCLM?
100
 
101
  ### Can New Prompts Beat DCLM?
102
 
103
- Since most existing prompts fail to beat DCLM, we designed seven novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
104
 
105
  <HtmlEmbed
106
  id="new-prompts"
107
  src="d3-benchmark-comparison.html"
108
- desc="Seven new prompts compared against the DCLM baseline."
109
  config={{
110
  datasets: {
111
  "mix-fw_edu_hq-math_1b_hq": "Math",
112
  "mix-fw_edu_hq-table_1b_hq": "Table",
113
  "mix-fw_edu_hq-faq_1b_hq": "FAQ",
114
  "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
 
115
  "mix-fw_edu_hq-article_1b_hq": "Article",
 
116
  "mix-fw_edu_hq-commentary_1b_hq": "Commentary",
117
  "mix-fw_edu_hq-discussion_1b_hq": "Discussion",
118
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
@@ -243,7 +246,7 @@ Since model size barely matters, does the model family make a difference?
243
 
244
  #### Does the model family matter?
245
 
246
- We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on six prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all six prompts (see <FigRef target="model-family" />).
247
 
248
  <Sidenote>
249
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
@@ -320,6 +323,28 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
320
  "mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3",
321
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
322
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
323
  }
324
  }
325
  }}
@@ -583,7 +608,7 @@ Here are the key takeaways from our experiments:
583
  - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
584
  A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
585
  - **Q: Can new prompts beat DCLM?**<br/>
586
- A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
587
  - **Q: Does model size matter?**<br/>
588
  A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
589
  - **Q: Do we need better models for low-quality data?**<br/>
 
10
  {/* TODO: Integrate decay experiment as another analysis for proxy */}
11
  {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
12
  {/* TODO: brainstorm better banner, be artsy */}
13
+ {/* TODO: add essential web as baseline (raw) */}
14
  {/* TODO: run variance experiments with pretraining from scratch */}
15
  {/* TODO: run scaling experiments with longer pretraining phase */}
16
  {/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
 
101
 
102
  ### Can New Prompts Beat DCLM?
103
 
104
+ Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion), [narrative](#narrative), [explanation](#explanation)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [narrative](#narrative), [explanation](#explanation), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
105
 
106
  <HtmlEmbed
107
  id="new-prompts"
108
  src="d3-benchmark-comparison.html"
109
+ desc="Nine new prompts compared against the DCLM baseline."
110
  config={{
111
  datasets: {
112
  "mix-fw_edu_hq-math_1b_hq": "Math",
113
  "mix-fw_edu_hq-table_1b_hq": "Table",
114
  "mix-fw_edu_hq-faq_1b_hq": "FAQ",
115
  "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
116
+ "mix-fw_edu_hq-narrative_1b_hq": "Narrative",
117
  "mix-fw_edu_hq-article_1b_hq": "Article",
118
+ "mix-fw_edu_hq-explanation_1b_hq": "Explanation",
119
  "mix-fw_edu_hq-commentary_1b_hq": "Commentary",
120
  "mix-fw_edu_hq-discussion_1b_hq": "Discussion",
121
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
 
246
 
247
  #### Does the model family matter?
248
 
249
+ We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all eight prompts (see <FigRef target="model-family" />).
250
 
251
  <Sidenote>
252
  We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
 
323
  "mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3",
324
  dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
325
  }
326
+ },
327
+ "Narrative Prompt": {
328
+ datasets: {
329
+ "mix-fw_edu_hq-narrative_smollm2_1.7b_hq": "SmolLM2",
330
+ "mix-fw_edu_hq-narrative_falcon3_1b_hq": "Falcon3",
331
+ "mix-fw_edu_hq-narrative_granite3_1b_hq": "Granite3",
332
+ "mix-fw_edu_hq-narrative_1b_hq": "Gemma-3",
333
+ "mix-fw_edu_hq-narrative_llama3.2_1b_hq": "Llama-3.2",
334
+ "mix-fw_edu_hq-narrative_qwen3_1.7b_hq": "Qwen3",
335
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
336
+ }
337
+ },
338
+ "Explanation Prompt": {
339
+ datasets: {
340
+ "mix-fw_edu_hq-explanation_smollm2_1.7b_hq": "SmolLM2",
341
+ "mix-fw_edu_hq-explanation_falcon3_1b_hq": "Falcon3",
342
+ "mix-fw_edu_hq-explanation_granite3_1b_hq": "Granite3",
343
+ "mix-fw_edu_hq-explanation_1b_hq": "Gemma-3",
344
+ "mix-fw_edu_hq-explanation_llama3.2_1b_hq": "Llama-3.2",
345
+ "mix-fw_edu_hq-explanation_qwen3_1.7b_hq": "Qwen3",
346
+ dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
347
+ }
348
  }
349
  }
350
  }}
 
608
  - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
609
  A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
610
  - **Q: Can new prompts beat DCLM?**<br/>
611
+ A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM. Narrative, Explanation, Article, Commentary, and Discussion do not.
612
  - **Q: Does model size matter?**<br/>
613
  A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
614
  - **Q: Do we need better models for low-quality data?**<br/>