Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
d00db4f
1
Parent(s): cfb5e0c
add summarize results
Browse files
app/src/content/assets/data/benchmark-results.csv
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
app/src/content/chapters/experiments.mdx
CHANGED
|
@@ -13,8 +13,6 @@ import Glossary from "../../components/Glossary.astro";
|
|
| 13 |
{/* TODO: Add the experiment with smaller smollm2 models */}
|
| 14 |
{/* TODO: Add the experiment with the rewire prompt at larger scales */}
|
| 15 |
{/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
|
| 16 |
-
{/* TODO: add results to output.csv for summarize trainings */}
|
| 17 |
-
{/* TODO: add summarize results to better models for rephrasing lq data and to dissecting baselines */}
|
| 18 |
|
| 19 |
## Experiments
|
| 20 |
|
|
@@ -180,7 +178,7 @@ On high-quality source data, we see no evidence that larger models help. But REW
|
|
| 180 |
|
| 181 |
The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
|
| 182 |
|
| 183 |
-
We compare 1B vs 12B models on HQ vs LQ source data across
|
| 184 |
|
| 185 |
The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
|
| 186 |
|
|
@@ -201,6 +199,16 @@ The results are mixed: for some prompts 12B helps slightly with LQ data, but for
|
|
| 201 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 202 |
}
|
| 203 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
"Tutorial Prompt": {
|
| 205 |
datasetNames: {
|
| 206 |
"mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
|
|
|
|
| 13 |
{/* TODO: Add the experiment with smaller smollm2 models */}
|
| 14 |
{/* TODO: Add the experiment with the rewire prompt at larger scales */}
|
| 15 |
{/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Experiments
|
| 18 |
|
|
|
|
| 178 |
|
| 179 |
The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
|
| 180 |
|
| 181 |
+
We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
|
| 182 |
|
| 183 |
The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
|
| 184 |
|
|
|
|
| 199 |
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 200 |
}
|
| 201 |
},
|
| 202 |
+
"Summarize Prompt": {
|
| 203 |
+
datasetNames: {
|
| 204 |
+
"mix-fw_edu_hq-summarize_1b_hq": "1B, HQ Source",
|
| 205 |
+
"mix-fw_edu_hq-summarize_12b_hq": "12B, HQ Source",
|
| 206 |
+
"mix-fw_edu_hq-summarize_1b_lq": "1B, LQ Source",
|
| 207 |
+
"mix-fw_edu_hq-summarize_12b_lq": "12B, LQ Source",
|
| 208 |
+
dclm: "DCLM",
|
| 209 |
+
fw_edu_hq: "FineWeb-Edu-HQ"
|
| 210 |
+
}
|
| 211 |
+
},
|
| 212 |
"Tutorial Prompt": {
|
| 213 |
datasetNames: {
|
| 214 |
"mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
|