finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 16

Commit

0f45b18

1 Parent(s): 5240fe4

add guided rewrite experiment

Browse files

Files changed (2) hide show

app/src/content/assets/data/benchmark-results.csv +2 -2
app/src/content/chapters/experiments.mdx +17 -5

app/src/content/assets/data/benchmark-results.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:192c9e3c649b9be030f1c84f5795badc76d6f22261c4bd33c6ac219a4e0cbc45
-size 1218699

 version https://git-lfs.github.com/spec/v1
+oid sha256:4ff88dedc4e0c1d7dd13f29a3bd9a68072119f1c2c5a9c48f7a6f2c893778615
+size 1245861

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -11,8 +11,6 @@ import FigRef from "../../components/FigRef.astro";
 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
 {/* TODO: Add appendix section of weird unexplainable results? */}
-{/* TODO: Add the experiment with the rewire prompt at larger scales */}
-{/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
 ## Experiments
@@ -118,9 +116,12 @@ We want to know whether using a stronger model leads to better synthetic data. W
 #### Does the model size matter?
-We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
-We see a similar pattern with SmolLM2 (135M, 360M, 1.7B) on the [tutorial](#tutorial) prompt: up to the 1B size, we see a clear performance gradient from smaller to larger models. This confirms across model families that you need at least a ~1B parameter model to get meaningful gains from rephrasing but after that there are no further improvements with larger models.
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
@@ -154,6 +155,17 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
       },
       "SmolLM2: Tutorial": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",

 {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
 {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
 {/* TODO: Add appendix section of weird unexplainable results? */}
 ## Experiments
 #### Does the model size matter?
+We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial), [math](#math), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts (use the Setup dropdown in <FigRef target="model-size" /> to switch between them).
+For [tutorial](#tutorial) and [math](#math), the 270M model underperforms, but 1B through 27B show no significant difference.
+SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
+The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
+This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
+The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), larger models do not improve synthetic data quality.
 <Sidenote>
 It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
       },
+      "Gemma-3: REWIRE": {
+        datasetNames: {
+          "mix-fw_edu_hq-guided_rewrite_original_27b_hq": "Gemma-3 27B",
+          "mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Gemma-3 12B",
+          "mix-fw_edu_hq-guided_rewrite_original_4b_hq": "Gemma-3 4B",
+          "mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Gemma-3 1B",
+          "mix-fw_edu_hq-guided_rewrite_original_270m_hq": "Gemma-3 270M",
+          dclm: "DCLM",
+          fw_edu_hq: "FineWeb-Edu-HQ"
+        }
+      },
       "SmolLM2: Tutorial": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",