joelniklaus HF Staff commited on
Commit
5240fe4
·
1 Parent(s): 09c855a

added smollm size experiment

Browse files
app/src/content/assets/data/benchmark-results.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9563753d717de5b5392d8d517514d8975b55023ed4b84c165c996d42b4c152c4
3
- size 1200822
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:192c9e3c649b9be030f1c84f5795badc76d6f22261c4bd33c6ac219a4e0cbc45
3
+ size 1218699
app/src/content/chapters/experiments.mdx CHANGED
@@ -11,7 +11,6 @@ import FigRef from "../../components/FigRef.astro";
11
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
12
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
13
  {/* TODO: Add appendix section of weird unexplainable results? */}
14
- {/* TODO: Add the experiment with smaller smollm2 models */}
15
  {/* TODO: Add the experiment with the rewire prompt at larger scales */}
16
  {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
17
 
@@ -121,6 +120,8 @@ We want to know whether using a stronger model leads to better synthetic data. W
121
 
122
  We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
123
 
 
 
124
  <Sidenote>
125
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
126
  </Sidenote>
@@ -128,10 +129,10 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
128
  <HtmlEmbed
129
  id="model-size"
130
  src="d3-benchmark-comparison.html"
131
- desc="Gemma-3 model sizes (270M to 27B). Use the Setup dropdown to compare across prompts."
132
  config={{
133
  setups: {
134
- "Tutorial Prompt": {
135
  datasetNames: {
136
  "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
137
  "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
@@ -142,7 +143,7 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
142
  fw_edu_hq: "FineWeb-Edu-HQ"
143
  }
144
  },
145
- "Math Prompt": {
146
  datasetNames: {
147
  "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
148
  "mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
@@ -152,6 +153,15 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
152
  dclm: "DCLM",
153
  fw_edu_hq: "FineWeb-Edu-HQ"
154
  }
 
 
 
 
 
 
 
 
 
155
  }
156
  }
157
  }}
 
11
  {/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
12
  {/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
13
  {/* TODO: Add appendix section of weird unexplainable results? */}
 
14
  {/* TODO: Add the experiment with the rewire prompt at larger scales */}
15
  {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
16
 
 
120
 
121
  We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
122
 
123
+ We see a similar pattern with SmolLM2 (135M, 360M, 1.7B) on the [tutorial](#tutorial) prompt: up to the 1B size, we see a clear performance gradient from smaller to larger models. This confirms across model families that you need at least a ~1B parameter model to get meaningful gains from rephrasing but after that there are no further improvements with larger models.
124
+
125
  <Sidenote>
126
  It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
127
  </Sidenote>
 
129
  <HtmlEmbed
130
  id="model-size"
131
  src="d3-benchmark-comparison.html"
132
+ desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
133
  config={{
134
  setups: {
135
+ "Gemma-3: Tutorial": {
136
  datasetNames: {
137
  "mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
138
  "mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
 
143
  fw_edu_hq: "FineWeb-Edu-HQ"
144
  }
145
  },
146
+ "Gemma-3: Math": {
147
  datasetNames: {
148
  "mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
149
  "mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
 
153
  dclm: "DCLM",
154
  fw_edu_hq: "FineWeb-Edu-HQ"
155
  }
156
+ },
157
+ "SmolLM2: Tutorial": {
158
+ datasetNames: {
159
+ "mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
160
+ "mix-fw_edu_hq-tutorial_smollm2_360m_hq": "SmolLM2 360M",
161
+ "mix-fw_edu_hq-tutorial_smollm2_135m_hq": "SmolLM2 135M",
162
+ dclm: "DCLM",
163
+ fw_edu_hq: "FineWeb-Edu-HQ"
164
+ }
165
  }
166
  }
167
  }}