Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
0f45b18
1
Parent(s): 5240fe4
add guided rewrite experiment
Browse files
app/src/content/assets/data/benchmark-results.csv
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4ff88dedc4e0c1d7dd13f29a3bd9a68072119f1c2c5a9c48f7a6f2c893778615
|
| 3 |
+
size 1245861
|
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -11,8 +11,6 @@ import FigRef from "../../components/FigRef.astro";
|
|
| 11 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 12 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
| 13 |
{/* TODO: Add appendix section of weird unexplainable results? */}
|
| 14 |
-
{/* TODO: Add the experiment with the rewire prompt at larger scales */}
|
| 15 |
-
{/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
|
| 16 |
|
| 17 |
## Experiments
|
| 18 |
|
|
@@ -118,9 +116,12 @@ We want to know whether using a stronger model leads to better synthetic data. W
|
|
| 118 |
|
| 119 |
#### Does the model size matter?
|
| 120 |
|
| 121 |
-
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial)
|
| 122 |
-
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
<Sidenote>
|
| 126 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
|
@@ -154,6 +155,17 @@ It is possible that larger models produce richer or more nuanced rephrasings tha
|
|
| 154 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 155 |
}
|
| 156 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
"SmolLM2: Tutorial": {
|
| 158 |
datasetNames: {
|
| 159 |
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
|
|
|
|
| 11 |
{/* TODO: Analyze if certain models are more verbose than others (how many tokens did they produce per prompt?) (wait for last rephrasing job to be done) */}
|
| 12 |
{/* TODO: Run dclm and edu score impact analysis on model verbosity data (wait for last rephrasing job to be done) */}
|
| 13 |
{/* TODO: Add appendix section of weird unexplainable results? */}
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## Experiments
|
| 16 |
|
|
|
|
| 116 |
|
| 117 |
#### Does the model size matter?
|
| 118 |
|
| 119 |
+
We compare all Gemma-3 sizes (270M, 1B, 4B, 12B, 27B) on the [tutorial](#tutorial), [math](#math), and REWIRE's [guided_rewrite](#guided_rewrite_original) prompts (use the Setup dropdown in <FigRef target="model-size" /> to switch between them).
|
| 120 |
+
For [tutorial](#tutorial) and [math](#math), the 270M model underperforms, but 1B through 27B show no significant difference.
|
| 121 |
+
SmolLM2 (135M, 360M, 1.7B) tells the same story on [tutorial](#tutorial): there is a clear performance gradient up to the 1B range.
|
| 122 |
+
The one exception is [guided_rewrite](#guided_rewrite_original), where the 4B model edges ahead of the 1B, while 4B through 27B remain equivalent.
|
| 123 |
+
This prompt is substantially more complex (detailed rewriting instructions, quality criteria, multi-step formatting requirements), which likely raises the minimum capability threshold.
|
| 124 |
+
The takeaway: beyond a baseline capability (reached around 1B for simple prompts and 4B for complex ones), larger models do not improve synthetic data quality.
|
| 125 |
|
| 126 |
<Sidenote>
|
| 127 |
It is possible that larger models produce richer or more nuanced rephrasings that our benchmark suite does not capture. Our evaluations measure a fixed set of skills, and subtler improvements in data quality could go undetected.
|
|
|
|
| 155 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 156 |
}
|
| 157 |
},
|
| 158 |
+
"Gemma-3: REWIRE": {
|
| 159 |
+
datasetNames: {
|
| 160 |
+
"mix-fw_edu_hq-guided_rewrite_original_27b_hq": "Gemma-3 27B",
|
| 161 |
+
"mix-fw_edu_hq-guided_rewrite_original_12b_hq": "Gemma-3 12B",
|
| 162 |
+
"mix-fw_edu_hq-guided_rewrite_original_4b_hq": "Gemma-3 4B",
|
| 163 |
+
"mix-fw_edu_hq-guided_rewrite_original_1b_hq": "Gemma-3 1B",
|
| 164 |
+
"mix-fw_edu_hq-guided_rewrite_original_270m_hq": "Gemma-3 270M",
|
| 165 |
+
dclm: "DCLM",
|
| 166 |
+
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 167 |
+
}
|
| 168 |
+
},
|
| 169 |
"SmolLM2: Tutorial": {
|
| 170 |
datasetNames: {
|
| 171 |
"mix-fw_edu_hq-tutorial_smollm2_1.7b_hq": "SmolLM2 1.7B",
|