finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Mar 3

Commit

4138808

1 Parent(s): 5e8e08b

add new results

Browse files

Files changed (3) hide show

app/src/content/assets/data/benchmark-results.csv +2 -2
app/src/content/assets/data/rephrasing_metadata.json +2 -2
app/src/content/chapters/3-experiments.mdx +29 -4

app/src/content/assets/data/benchmark-results.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0359f44cbbe97ee8f7ea598152a5053a322a81af818de890606e0daa6c15fd3a
-size 1378100

 version https://git-lfs.github.com/spec/v1
+oid sha256:19eac7b4c7d51ef51fde0893bd2e5f646c501eafb4da20cff189bad2e2d45262
+size 1513555

app/src/content/assets/data/rephrasing_metadata.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cac779aca41bc6f868d99a7c7fcc43343591b40ace727098341d52285c1ff856
-size 152802

 version https://git-lfs.github.com/spec/v1
+oid sha256:232a5ab20c1eb108a06dac83379c404a5ae0489831cc7fe7011edf4d3d237afb
+size 181172

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -10,6 +10,7 @@ import FigRef from "../../components/FigRef.astro";
 {/* TODO: Integrate decay experiment as another analysis for proxy */}
 {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
 {/* TODO: brainstorm better banner, be artsy */}
 {/* TODO: run variance experiments with pretraining from scratch */}
 {/* TODO: run scaling experiments with longer pretraining phase */}
 {/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
@@ -100,19 +101,21 @@ Can we design prompts that consistently beat DCLM?
 ### Can New Prompts Beat DCLM?
-Since most existing prompts fail to beat DCLM, we designed seven novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
 <HtmlEmbed
   id="new-prompts"
   src="d3-benchmark-comparison.html"
-  desc="Seven new prompts compared against the DCLM baseline."
   config={{
     datasets: {
       "mix-fw_edu_hq-math_1b_hq": "Math",
       "mix-fw_edu_hq-table_1b_hq": "Table",
       "mix-fw_edu_hq-faq_1b_hq": "FAQ",
       "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
       "mix-fw_edu_hq-article_1b_hq": "Article",
       "mix-fw_edu_hq-commentary_1b_hq": "Commentary",
       "mix-fw_edu_hq-discussion_1b_hq": "Discussion",
       dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
@@ -243,7 +246,7 @@ Since model size barely matters, does the model family make a difference?
 #### Does the model family matter?
-We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on six prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all six prompts (see <FigRef target="model-family" />).
 <Sidenote>
 We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
@@ -320,6 +323,28 @@ We hypothesize that SmolLM2's consistently strong rephrasing performance origina
           "mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3",
           dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
         }
       }
     }
   }}
@@ -583,7 +608,7 @@ Here are the key takeaways from our experiments:
 - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
   A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
 - **Q: Can new prompts beat DCLM?**<br/>
-  A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM.
 - **Q: Does model size matter?**<br/>
   A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
 - **Q: Do we need better models for low-quality data?**<br/>

 {/* TODO: Integrate decay experiment as another analysis for proxy */}
 {/* TODO: share on a bunch of discords/slacks/hackernews/locallama */}
 {/* TODO: brainstorm better banner, be artsy */}
+{/* TODO: add essential web as baseline (raw) */}
 {/* TODO: run variance experiments with pretraining from scratch */}
 {/* TODO: run scaling experiments with longer pretraining phase */}
 {/* TODO: filter docs before/after rephrasing (non-mathematical document for math prompt) */}
 ### Can New Prompts Beat DCLM?
+Since most existing prompts fail to beat DCLM, we designed nine novel prompt formats targeting different skills ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial), [article](#article), [commentary](#commentary), [discussion](#discussion), [narrative](#narrative), [explanation](#explanation)), all using Gemma-3-1B on FineWeb-Edu-HQ. Four prompts ([math](#math), [table](#table), [faq](#faq), [tutorial](#tutorial)) outperform DCLM, while [article](#article), [narrative](#narrative), [explanation](#explanation), [commentary](#commentary), and [discussion](#discussion) are at or below DCLM level (see <FigRef target="new-prompts" />). The best-performing prompts all restructure the source content into pedagogically rich formats.
 <HtmlEmbed
   id="new-prompts"
   src="d3-benchmark-comparison.html"
+  desc="Nine new prompts compared against the DCLM baseline."
   config={{
     datasets: {
       "mix-fw_edu_hq-math_1b_hq": "Math",
       "mix-fw_edu_hq-table_1b_hq": "Table",
       "mix-fw_edu_hq-faq_1b_hq": "FAQ",
       "mix-fw_edu_hq-tutorial_1b_hq": "Tutorial",
+      "mix-fw_edu_hq-narrative_1b_hq": "Narrative",
       "mix-fw_edu_hq-article_1b_hq": "Article",
+      "mix-fw_edu_hq-explanation_1b_hq": "Explanation",
       "mix-fw_edu_hq-commentary_1b_hq": "Commentary",
       "mix-fw_edu_hq-discussion_1b_hq": "Discussion",
       dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
 #### Does the model family matter?
+We test six model families (SmolLM2, Falcon3 [@falcon3], Qwen3, Gemma-3, Granite3 [@granite3], Llama-3.2) at ~1B scale on eight prompts. Use the Setup dropdown to compare across prompts. SmolLM2 consistently and clearly outperforms all others across all eight prompts (see <FigRef target="model-family" />).
 <Sidenote>
 We hypothesize that SmolLM2's consistently strong rephrasing performance originates from explicit [rewrite tasks](https://huggingface.co/datasets/HuggingFaceTB/smoltalk/viewer/smol-rewrite?row=0&views%5B%5D=smol_rewrite_train) in its instruction tuning data (smoltalk). This would mean the model already "knows" how to rewrite well before we even prompt it.
           "mix-fw_edu_hq-math_qwen3_1.7b_hq": "Qwen3",
           dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
         }
+      },
+      "Narrative Prompt": {
+        datasets: {
+          "mix-fw_edu_hq-narrative_smollm2_1.7b_hq": "SmolLM2",
+          "mix-fw_edu_hq-narrative_falcon3_1b_hq": "Falcon3",
+          "mix-fw_edu_hq-narrative_granite3_1b_hq": "Granite3",
+          "mix-fw_edu_hq-narrative_1b_hq": "Gemma-3",
+          "mix-fw_edu_hq-narrative_llama3.2_1b_hq": "Llama-3.2",
+          "mix-fw_edu_hq-narrative_qwen3_1.7b_hq": "Qwen3",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
+      },
+      "Explanation Prompt": {
+        datasets: {
+          "mix-fw_edu_hq-explanation_smollm2_1.7b_hq": "SmolLM2",
+          "mix-fw_edu_hq-explanation_falcon3_1b_hq": "Falcon3",
+          "mix-fw_edu_hq-explanation_granite3_1b_hq": "Granite3",
+          "mix-fw_edu_hq-explanation_1b_hq": "Gemma-3",
+          "mix-fw_edu_hq-explanation_llama3.2_1b_hq": "Llama-3.2",
+          "mix-fw_edu_hq-explanation_qwen3_1.7b_hq": "Qwen3",
+          dclm: { display: "Baseline (DCLM)", color: "#8b8b8b", baseline: true }
+        }
       }
     }
   }}
 - **Q: Which individual prompts from the synthetic baselines match DCLM?**<br/>
   A: Only Diverse QA Pairs and REWIRE's Guided Rewrite.
 - **Q: Can new prompts beat DCLM?**<br/>
+  A: Yes. Math, Table, FAQ, and Tutorial all outperform DCLM. Narrative, Explanation, Article, Commentary, and Discussion do not.
 - **Q: Does model size matter?**<br/>
   A: Not much. 1B is sufficient for simple prompts, 4B for complex ones.
 - **Q: Do we need better models for low-quality data?**<br/>