finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 10

Commit

d00db4f

1 Parent(s): cfb5e0c

add summarize results

Browse files

Files changed (2) hide show

app/src/content/assets/data/benchmark-results.csv +0 -0
app/src/content/chapters/experiments.mdx +11 -3

app/src/content/assets/data/benchmark-results.csv CHANGED Viewed

The diff for this file is too large to render. See raw diff

app/src/content/chapters/experiments.mdx CHANGED Viewed

@@ -13,8 +13,6 @@ import Glossary from "../../components/Glossary.astro";
 {/* TODO: Add the experiment with smaller smollm2 models */}
 {/* TODO: Add the experiment with the rewire prompt at larger scales */}
 {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
-{/* TODO: add results to output.csv for summarize trainings */}
-{/* TODO: add summarize results to better models for rephrasing lq data and to dissecting baselines */}
 ## Experiments
@@ -180,7 +178,7 @@ On high-quality source data, we see no evidence that larger models help. But REW
 The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
-We compare 1B vs 12B models on HQ vs LQ source data across three prompts ([continue](#continue), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
 The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
@@ -201,6 +199,16 @@ The results are mixed: for some prompts 12B helps slightly with LQ data, but for
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
       },
       "Tutorial Prompt": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",

 {/* TODO: Add the experiment with smaller smollm2 models */}
 {/* TODO: Add the experiment with the rewire prompt at larger scales */}
 {/* TODO: also run the model size experiment for the REWIRE prompt since the original authors claim that larger models are necessary there */}
 ## Experiments
 The REWIRE [@rewire] paper claims that upcycling low-quality data requires large models (Llama-3.3 70B in their case). <mark>Does this claim hold?</mark>
+We compare 1B vs 12B models on HQ vs LQ source data across four prompts ([continue](#continue), [summarize](#summarize), [tutorial](#tutorial), [faq](#faq)). Use the Setup dropdown to switch between prompts.
 The results are mixed: for some prompts 12B helps slightly with LQ data, but for the [FAQ](#faq) prompt the 1B model actually wins (see [Model Size vs Data Quality](#size-quality)). We see no consistent advantage of using larger models for low-quality data. <mark>TLDR: We cannot reproduce the claim that large models are needed for low-quality data.</mark>
           fw_edu_hq: "FineWeb-Edu-HQ"
         }
       },
+      "Summarize Prompt": {
+        datasetNames: {
+          "mix-fw_edu_hq-summarize_1b_hq": "1B, HQ Source",
+          "mix-fw_edu_hq-summarize_12b_hq": "12B, HQ Source",
+          "mix-fw_edu_hq-summarize_1b_lq": "1B, LQ Source",
+          "mix-fw_edu_hq-summarize_12b_lq": "12B, LQ Source",
+          dclm: "DCLM",
+          fw_edu_hq: "FineWeb-Edu-HQ"
+        }
+      },
       "Tutorial Prompt": {
         datasetNames: {
           "mix-fw_edu_hq-tutorial_1b_hq": "1B, HQ Source",