Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
4138808
1
Parent(s): 5e8e08b
add new results
Browse files
app/src/content/assets/data/benchmark-results.csv
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:19eac7b4c7d51ef51fde0893bd2e5f646c501eafb4da20cff189bad2e2d45262
|
| 3 |
+
size 1513555
|
app/src/content/assets/data/rephrasing_metadata.json
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:232a5ab20c1eb108a06dac83379c404a5ae0489831cc7fe7011edf4d3d237afb
|
| 3 |
+
size 181172
|
app/src/content/chapters/3-experiments.mdx
CHANGED
|
@@ -10,6 +10,7 @@ import FigRef from "../../components/FigRef.astro";
|
|
| 10 |
{/* TODO: Integrate decay experiment as another analysis for proxy */}
|
| 11 |
{/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
|
| 12 |
{/* TODO: brainstorm better banner, be artsy */}
|
|
|
|
| 13 |
{/* TODO: run variance experiments with pretraining from scratch */}
|
| 14 |
{/* TODO: run scaling experiments with longer pretraining phase */}
|
| 15 |
{/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
|
|
@@ -100,19 +101,21 @@ Can we design prompts that consistently beat DCLM?
|
|
| 100 |
|
| 101 |
### Can New Prompts Beat DCLM?
|
| 102 |
|
| 103 |
-
Since most existing prompts fail to beat DCLM, we designed
|
| 104 |
|
| 105 |
<HtmlEmbed
|
| 106 |
id="new-prompts"
|
| 107 |
src="d3-benchmark-comparison.html"
|
| 108 |
-
desc="
|
| 109 |
config={{
|
| 110 |
datasets: {
|
| 111 |
"mix-fw_edu_hq-math_1b_hq": "Math",
|
| 112 |
"mix-fw_edu_hq-table_1b_hq": "Table",
|
| 113 |
"mix-fw_edu_hq-faq_1b_hq": "FAQ",
|
| 114 |
"mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
|
|
|
|
| 115 |
"mix-fw_edu_hq-article_1b_hq": "Article",
|
|
|
|
| 116 |
"mix-fw_edu_hq-commentary_1b_hq": "Commentary",
|
| 117 |
"mix-fw_edu_hq-discussion_1b_hq": "Discussion",
|
| 118 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
|
@@ -243,7 +246,7 @@ Since model size barely matters, does the model family make a difference?
|
|
| 243 |
|
| 244 |
#### Does the model family matter?
|
| 245 |
|
| 246 |
-
We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on
|
| 247 |
|
| 248 |
<Sidenote>
|
| 249 |
We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
|
|
@@ -320,6 +323,28 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
|
|
| 320 |
"mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3",
|
| 321 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 322 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 323 |
}
|
| 324 |
}
|
| 325 |
}}
|
|
@@ -583,7 +608,7 @@ Here are the key takeaways from our experiments:
|
|
| 583 |
- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
|
| 584 |
A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
|
| 585 |
- **Q: Can new prompts beat DCLM?**<br/>
|
| 586 |
-
A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
|
| 587 |
- **Q: Does model size matter?**<br/>
|
| 588 |
A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
|
| 589 |
- **Q: Do we need better models for low-quality data?**<br/>
|
|
|
|
| 10 |
{/* TODO: Integrate decay experiment as another analysis for proxy */}
|
| 11 |
{/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
|
| 12 |
{/* TODO: brainstorm better banner, be artsy */}
|
| 13 |
+
{/* TODO: add essential web as baseline (raw) */}
|
| 14 |
{/* TODO: run variance experiments with pretraining from scratch */}
|
| 15 |
{/* TODO: run scaling experiments with longer pretraining phase */}
|
| 16 |
{/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
|
|
|
|
| 101 |
|
| 102 |
### Can New Prompts Beat DCLM?
|
| 103 |
|
| 104 |
+
Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion), [narrative](#narrative), [explanation](#explanation)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [narrative](#narrative), [explanation](#explanation), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
|
| 105 |
|
| 106 |
<HtmlEmbed
|
| 107 |
id="new-prompts"
|
| 108 |
src="d3-benchmark-comparison.html"
|
| 109 |
+
desc="Nine new prompts compared against the DCLM baseline."
|
| 110 |
config={{
|
| 111 |
datasets: {
|
| 112 |
"mix-fw_edu_hq-math_1b_hq": "Math",
|
| 113 |
"mix-fw_edu_hq-table_1b_hq": "Table",
|
| 114 |
"mix-fw_edu_hq-faq_1b_hq": "FAQ",
|
| 115 |
"mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
|
| 116 |
+
"mix-fw_edu_hq-narrative_1b_hq": "Narrative",
|
| 117 |
"mix-fw_edu_hq-article_1b_hq": "Article",
|
| 118 |
+
"mix-fw_edu_hq-explanation_1b_hq": "Explanation",
|
| 119 |
"mix-fw_edu_hq-commentary_1b_hq": "Commentary",
|
| 120 |
"mix-fw_edu_hq-discussion_1b_hq": "Discussion",
|
| 121 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
|
|
|
| 246 |
|
| 247 |
#### Does the model family matter?
|
| 248 |
|
| 249 |
+
We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all eight prompts (see <FigRef target="model-family" />).
|
| 250 |
|
| 251 |
<Sidenote>
|
| 252 |
We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
|
|
|
|
| 323 |
"mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3",
|
| 324 |
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 325 |
}
|
| 326 |
+
},
|
| 327 |
+
"Narrative Prompt": {
|
| 328 |
+
datasets: {
|
| 329 |
+
"mix-fw_edu_hq-narrative_smollm2_1.7b_hq": "SmolLM2",
|
| 330 |
+
"mix-fw_edu_hq-narrative_falcon3_1b_hq": "Falcon3",
|
| 331 |
+
"mix-fw_edu_hq-narrative_granite3_1b_hq": "Granite3",
|
| 332 |
+
"mix-fw_edu_hq-narrative_1b_hq": "Gemma-3",
|
| 333 |
+
"mix-fw_edu_hq-narrative_llama3.2_1b_hq": "Llama-3.2",
|
| 334 |
+
"mix-fw_edu_hq-narrative_qwen3_1.7b_hq": "Qwen3",
|
| 335 |
+
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 336 |
+
}
|
| 337 |
+
},
|
| 338 |
+
"Explanation Prompt": {
|
| 339 |
+
datasets: {
|
| 340 |
+
"mix-fw_edu_hq-explanation_smollm2_1.7b_hq": "SmolLM2",
|
| 341 |
+
"mix-fw_edu_hq-explanation_falcon3_1b_hq": "Falcon3",
|
| 342 |
+
"mix-fw_edu_hq-explanation_granite3_1b_hq": "Granite3",
|
| 343 |
+
"mix-fw_edu_hq-explanation_1b_hq": "Gemma-3",
|
| 344 |
+
"mix-fw_edu_hq-explanation_llama3.2_1b_hq": "Llama-3.2",
|
| 345 |
+
"mix-fw_edu_hq-explanation_qwen3_1.7b_hq": "Qwen3",
|
| 346 |
+
dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
|
| 347 |
+
}
|
| 348 |
}
|
| 349 |
}
|
| 350 |
}}
|
|
|
|
| 608 |
- **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
|
| 609 |
A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
|
| 610 |
- **Q: Can new prompts beat DCLM?**<br/>
|
| 611 |
+
A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM. Narrative, Explanation, Article, Commentary, and Discussion do not.
|
| 612 |
- **Q: Does model size matter?**<br/>
|
| 613 |
A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
|
| 614 |
- **Q: Do we need better models for low-quality data?**<br/>
|