Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
5240fe4
1
Parent(s): 09c855a
added smollm size experiment
Browse files
app/src/content/assets/data/benchmark-results.csv
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:192c9e3c649b9be030f1c84f5795badc76d6f22261c4bd33c6ac219a4e0cbc45
|
| 3 |
+
size 1218699
|
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -11,7 +11,6 @@ import FigRef from "../../components/FigRef.astro";
|
|
| 11 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 12 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
| 13 |
{/* TODO: Add appendix section of weird unexplainable results? */}
|
| 14 |
-
{/* TODO: Add the experiment with smaller smollm2 models */}
|
| 15 |
{/* TODO: Add the experiment with the rewire prompt at larger scales */}
|
| 16 |
{/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
|
| 17 |
|
|
@@ -121,6 +120,8 @@ We want to know whether using a stronger model leads to better synthetic data. W
|
|
| 121 |
|
| 122 |
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
|
| 123 |
|
|
|
|
|
|
|
| 124 |
<Sidenote>
|
| 125 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
| 126 |
</Sidenote>
|
|
@@ -128,10 +129,10 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 128 |
<HtmlEmbed
|
| 129 |
id="model-size"
|
| 130 |
src="d3-benchmark-comparison.html"
|
| 131 |
-
desc="
|
| 132 |
config={{
|
| 133 |
setups: {
|
| 134 |
-
"
|
| 135 |
datasetNames: {
|
| 136 |
"mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
|
| 137 |
"mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
|
|
@@ -142,7 +143,7 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 142 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 143 |
}
|
| 144 |
},
|
| 145 |
-
"
|
| 146 |
datasetNames: {
|
| 147 |
"mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
|
| 148 |
"mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
|
|
@@ -152,6 +153,15 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 152 |
dclm: "DCLM",
|
| 153 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 154 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
}
|
| 156 |
}
|
| 157 |
}}
|
|
|
|
| 11 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 12 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
| 13 |
{/* TODO: Add appendix section of weird unexplainable results? */}
|
|
|
|
| 14 |
{/* TODO: Add the experiment with the rewire prompt at larger scales */}
|
| 15 |
{/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
|
| 16 |
|
|
|
|
| 120 |
|
| 121 |
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial) and [math](#math) prompts. Use the Setup dropdown to switch between prompts. The 270M model underperforms, but 1B through 27B show no significant difference on either prompt (see <FigRef target="model-size" />). Even for the harder [math](#math) prompt, larger models do not help. Beyond a baseline capability (reached at 1B), larger models do not improve synthetic data quality.
|
| 122 |
|
| 123 |
+
We see a similar pattern with SmolLM2 (135M, 360M, 1.7B) on the [tutorial](#tutorial) prompt: up to the 1B size, we see a clear performance gradient from smaller to larger models. This confirms across model families that you need at least a ~1B parameter model to get meaningful gains from rephrasing but after that there are no further improvements with larger models.
|
| 124 |
+
|
| 125 |
<Sidenote>
|
| 126 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
| 127 |
</Sidenote>
|
|
|
|
| 129 |
<HtmlEmbed
|
| 130 |
id="model-size"
|
| 131 |
src="d3-benchmark-comparison.html"
|
| 132 |
+
desc="Model sizes across Gemma-3 and SmolLM2. Use the Setup dropdown to compare across models and prompts."
|
| 133 |
config={{
|
| 134 |
setups: {
|
| 135 |
+
"Gemma-3: Tutorial": {
|
| 136 |
datasetNames: {
|
| 137 |
"mix-fw_edu_hq-tutorial_27b_hq": "Gemma-3 27B",
|
| 138 |
"mix-fw_edu_hq-tutorial_12b_hq": "Gemma-3 12B",
|
|
|
|
| 143 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 144 |
}
|
| 145 |
},
|
| 146 |
+
"Gemma-3: Math": {
|
| 147 |
datasetNames: {
|
| 148 |
"mix-fw_edu_hq-math_27b_hq": "Gemma-3 27B",
|
| 149 |
"mix-fw_edu_hq-math_12b_hq": "Gemma-3 12B",
|
|
|
|
| 153 |
dclm: "DCLM",
|
| 154 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 155 |
}
|
| 156 |
+
},
|
| 157 |
+
"SmolLM2: Tutorial": {
|
| 158 |
+
datasetNames: {
|
| 159 |
+
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
|
| 160 |
+
"mix-fw_edu_hq-tutorial_smollm2_360m_hq": "SmolLM2 360M",
|
| 161 |
+
"mix-fw_edu_hq-tutorial_smollm2_135m_hq": "SmolLM2 135M",
|
| 162 |
+
dclm: "DCLM",
|
| 163 |
+
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 164 |
+
}
|
| 165 |
}
|
| 166 |
}
|
| 167 |
}}
|