finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 16

Commit

5240fe4

1 Parent(s): 09c855a

added smollm size experiment

Browse files

Files changed (2) hide show

app/src/content/assets/data/benchmark-results.csv +2 -2
app/src/content/chapters/experiments.mdx +14 -4

app/src/content/assets/data/benchmark-results.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9563753d717de5b5392d8d517514d8975b55023ed4b84c165c996d42b4c152c4
-size 1200822

 version https://git-lfs.github.com/spec/v1
+oid sha256:192c9e3c649b9be030f1c84f5795badc76d6f22261c4bd33c6ac219a4e0cbc45
+size 1218699

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -11,7 +11,6 @@ import FigRef from "../../components/FigRef.astro";
 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
 {/* TODO: Add appendix section of weird unexplainable results? */}
-{/* TODO: Add the experiment with smaller smollm2 models */}
 {/* TODO: Add the experiment with the rewire prompt at larger scales */}
 {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
@@ -121,6 +120,8 @@ We want to know whether using a stronger model leads to better synthetic data. W
 We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
 </Sidenote>
@@ -128,10 +129,10 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
 <HtmlEmbed
   id="model-size"
   src="d3-benchmark-comparison.html"
-  desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
   config={{
     setups: {
-      "Tutorial Prompt": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
           "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
@@ -142,7 +143,7 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
       },
-      "Math Prompt": {
         datasetNames: {
           "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
           "mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
@@ -152,6 +153,15 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
           dclm: "DCLM",
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
       }
     }
   }}

 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
 {/* TODO: Add appendix section of weird unexplainable results? */}
 {/* TODO: Add the experiment with the rewire prompt at larger scales */}
 {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
 We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
+We see a similar pattern with SmolLM2 (135M, 360M, 1.7B) on the [tutorial](#tutorial) prompt: up to the 1B size, we see a clear performance gradient from smaller to larger models. This confirms across model families that you need at least a ~1B parameter model to get meaningful gains from rephrasing but after that there are no further improvements with larger models.
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
 </Sidenote>
 <HtmlEmbed
   id="model-size"
   src="d3-benchmark-comparison.html"
+  desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
   config={{
     setups: {
+      "Gemma-3: Tutorial": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
           "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
       },
+      "Gemma-3: Math": {
         datasetNames: {
           "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
           "mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
           dclm: "DCLM",
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
+      },
+      "SmolLM2: Tutorial": {
+        datasetNames: {
+          "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
+          "mix-fw_edu_hq-tutorial_smollm2_360m_hq": "SmolLM2 360M",
+          "mix-fw_edu_hq-tutorial_smollm2_135m_hq": "SmolLM2 135M",
+          dclm: "DCLM",
+          fw_edu_hq: "FineWeb-Edu-HQ"
+        }
       }
     }
   }}