finephrase

Running on CPU Upgrade

App Files Files Community

joelniklaus HF Staff commited on Feb 9

Commit

d6ec10c

1 Parent(s): d7053a5

display source datasets as accordions

Browse files

Files changed (1) hide show

app/src/content/chapters/setup.mdx +25 -21

app/src/content/chapters/setup.mdx CHANGED Viewed

@@ -1,3 +1,5 @@
 ## Setup
 ### Synthetic Data for Pretraining
@@ -25,27 +27,29 @@ We conduct large-scale document rephrasing experiments using instruction-tuned l
 ### Source Datasets
-TODO: in the blog, we could make this into a widget where you have a tab for each dataset and then if you click on the tab you can see the description (maybe even some samples).
-We compare against several baseline datasets for pretraining and data rephrasing:
- **DCLM (DataComp-LM)** [@datacomp] **:**  A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens.
- **Fineweb-Edu-HQ and Fineweb-Edu-LQ** [@fineweb] **:**  Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing.
- **Ultra-Fineweb-1.4** [@ultrafineweb] **:**  A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality.
- **Nemotron-HQ-Synth** [@nemotroncc] **:**  Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3].
- **Cosmopedia** [@cosmopedia] **:**  A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters.
- **SYNTH** [@synthpleias] **:**  A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content.
- **REWIRE** [@rewire] **:**  A method for recycling the web with guided rewrite that enriches low-quality documents discarded by filtering pipelines to make them useful for training. Experiments show that mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales respectively across 22 tasks.
-We use source data and seed data interchangeably.
-TODO: put this where we first mention source/seed data
 ### Ablation Setup

+import Accordion from "../../components/Accordion.astro";
 ## Setup
 ### Synthetic Data for Pretraining
 ### Source Datasets
+We compare against several baseline datasets for pretraining and data rephrasing. We use source data and seed data interchangeably throughout.
+<Accordion title="DCLM (DataComp-LM)" open>
+  A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
+</Accordion>
+<Accordion title="FineWeb-Edu-HQ / FineWeb-Edu-LQ">
+  Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing [@fineweb].
+</Accordion>
+<Accordion title="Ultra-Fineweb-1.4">
+  A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
+</Accordion>
+<Accordion title="Nemotron-HQ-Synth">
+  Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
+</Accordion>
+<Accordion title="Cosmopedia">
+  A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters [@cosmopedia].
+</Accordion>
+<Accordion title="SYNTH">
+  A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content [@synthpleias].
+</Accordion>
+<Accordion title="REWIRE">
+  A method for recycling the web with guided rewrite that enriches low-quality documents discarded by filtering pipelines to make them useful for training. Experiments show that mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales respectively across 22 tasks [@rewire].
+</Accordion>
 ### Ablation Setup