joelniklaus HF Staff commited on
Commit
d00db4f
·
1 Parent(s): cfb5e0c

add summarize results

Browse files
app/src/content/assets/data/benchmark-results.csv CHANGED
The diff for this file is too large to render. See raw diff
 
app/src/content/chapters/experiments.mdx CHANGED
@@ -13,8 +13,6 @@ import Glossary from "../../components/Glossary.astro";
13
  {/* TODO: Add the experiment with smaller smollm2 models */}
14
  {/* TODO: Add the experiment with the rewire prompt at larger scales */}
15
  {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
16
- {/* TODO: add results to output.csv for summarize trainings */}
17
- {/* TODO: add summarize results to better models for rephrasing lq data and to dissecting baselines */}
18
 
19
  ## Experiments
20
 
@@ -180,7 +178,7 @@ On high-quality source data, we see no evidence that larger models help. But REW
180
 
181
  The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
182
 
183
- We compare 1B vs 12B models on HQ vs LQ source data across three prompts ([continue](#continue), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
184
 
185
  The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
186
 
@@ -201,6 +199,16 @@ The results are mixed: for some prompts 12B helps slightly with LQ data, but for
201
  fw_edu_hq: "FineWeb-Edu-HQ"
202
  }
203
  },
 
 
 
 
 
 
 
 
 
 
204
  "Tutorial Prompt": {
205
  datasetNames: {
206
  "mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",
 
13
  {/* TODO: Add the experiment with smaller smollm2 models */}
14
  {/* TODO: Add the experiment with the rewire prompt at larger scales */}
15
  {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
 
 
16
 
17
  ## Experiments
18
 
 
178
 
179
  The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
180
 
181
+ We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
182
 
183
  The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
184
 
 
199
  fw_edu_hq: "FineWeb-Edu-HQ"
200
  }
201
  },
202
+ "Summarize Prompt": {
203
+ datasetNames: {
204
+ "mix-fw_edu_hq-summarize_1b_hq": "1B, HQ Source",
205
+ "mix-fw_edu_hq-summarize_12b_hq": "12B, HQ Source",
206
+ "mix-fw_edu_hq-summarize_1b_lq": "1B, LQ Source",
207
+ "mix-fw_edu_hq-summarize_12b_lq": "12B, LQ Source",
208
+ dclm: "DCLM",
209
+ fw_edu_hq: "FineWeb-Edu-HQ"
210
+ }
211
+ },
212
  "Tutorial Prompt": {
213
  datasetNames: {
214
  "mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",