joelniklaus HF Staff commited on
Commit
0f45b18
·
1 Parent(s): 5240fe4

add guided rewrite experiment

Browse files
app/src/content/assets/data/benchmark-results.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:192c9e3c649b9be030f1c84f5795badc76d6f22261c4bd33c6ac219a4e0cbc45
3
- size 1218699
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ff88dedc4e0c1d7dd13f29a3bd9a68072119f1c2c5a9c48f7a6f2c893778615
3
+ size 1245861
app/src/content/chapters/experiments.mdx CHANGED
@@ -11,8 +11,6 @@ import FigRef from "../../components/FigRef.astro";
11
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
12
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
13
  {/* TODO: Add appendix section of weird unexplainable results? */}
14
- {/* TODO: Add the experiment with the rewire prompt at larger scales */}
15
- {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
16
 
17
  ## Experiments
18
 
@@ -118,9 +116,12 @@ We want to know whether using a stronger model leads to better synthetic data. W
118
 
119
  #### Does the model size matter?
120
 
121
- We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
122
-
123
- We see a similar pattern with SmolLM2 (135M, 360M, 1.7B) on the [tutorial](#tutorial) prompt: up to the 1B size, we see a clear performance gradient from smaller to larger models. This confirms across model families that you need at least a ~1B parameter model to get meaningful gains from rephrasing but after that there are no further improvements with larger models.
 
 
 
124
 
125
  <Sidenote>
126
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
@@ -154,6 +155,17 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
154
  fw_edu_hq: "FineWeb-Edu-HQ"
155
  }
156
  },
 
 
 
 
 
 
 
 
 
 
 
157
  "SmolLM2: Tutorial": {
158
  datasetNames: {
159
  "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
 
11
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
12
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
13
  {/* TODO: Add appendix section of weird unexplainable results? */}
 
 
14
 
15
  ## Experiments
16
 
 
116
 
117
  #### Does the model size matter?
118
 
119
+ We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial), [math](#math), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts (use the Setup dropdown in <FigRef target="model-size" /> to switch between them).
120
+ For [tutorial](#tutorial) and [math](#math), the 270M model underperforms, but 1B through 27B show no significant difference.
121
+ SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
122
+ The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
123
+ This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
124
+ The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), larger models do not improve synthetic data quality.
125
 
126
  <Sidenote>
127
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
 
155
  fw_edu_hq: "FineWeb-Edu-HQ"
156
  }
157
  },
158
+ "Gemma-3: REWIRE": {
159
+ datasetNames: {
160
+ "mix-fw_edu_hq-guided_rewrite_original_27b_hq": "Gemma-3 27B",
161
+ "mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Gemma-3 12B",
162
+ "mix-fw_edu_hq-guided_rewrite_original_4b_hq": "Gemma-3 4B",
163
+ "mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Gemma-3 1B",
164
+ "mix-fw_edu_hq-guided_rewrite_original_270m_hq": "Gemma-3 270M",
165
+ dclm: "DCLM",
166
+ fw_edu_hq: "FineWeb-Edu-HQ"
167
+ }
168
+ },
169
  "SmolLM2: Tutorial": {
170
  datasetNames: {
171
  "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",