joelniklaus HF Staff commited on
Commit
d6ec10c
·
1 Parent(s): d7053a5

display source datasets as accordions

Browse files
Files changed (1) hide show
  1. app/src/content/chapters/setup.mdx +25 -21
app/src/content/chapters/setup.mdx CHANGED
@@ -1,3 +1,5 @@
 
 
1
  ## Setup
2
 
3
  ### Synthetic Data for Pretraining
@@ -25,27 +27,29 @@ We conduct large-scale document rephrasing experiments using instruction-tuned l
25
 
26
  ### Source Datasets
27
 
28
- TODO: in the blog, we could make this into a widget where you have a tab for each dataset and then if you click on the tab you can see the description (maybe even some samples).
29
-
30
- We compare against several baseline datasets for pretraining and data rephrasing:
31
-
32
- **DCLM (DataComp-LM)** [@datacomp] **:** A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens.
33
-
34
- **Fineweb-Edu-HQ and Fineweb-Edu-LQ** [@fineweb] **:** Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing.
35
-
36
- **Ultra-Fineweb-1.4** [@ultrafineweb] **:** A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality.
37
-
38
- **Nemotron-HQ-Synth** [@nemotroncc] **:** Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3].
39
-
40
- **Cosmopedia** [@cosmopedia] **:** A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters.
41
-
42
- **SYNTH** [@synthpleias] **:** A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content.
43
-
44
- **REWIRE** [@rewire] **:** A method for recycling the web with guided rewrite that enriches low-quality documents discarded by filtering pipelines to make them useful for training. Experiments show that mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales respectively across 22 tasks.
45
-
46
- We use source data and seed data interchangeably.
47
-
48
- TODO: put this where we first mention source/seed data
 
 
49
 
50
  ### Ablation Setup
51
 
 
1
+ import Accordion from "../../components/Accordion.astro";
2
+
3
  ## Setup
4
 
5
  ### Synthetic Data for Pretraining
 
27
 
28
  ### Source Datasets
29
 
30
+ We compare against several baseline datasets for pretraining and data rephrasing. We use source data and seed data interchangeably throughout.
31
+
32
+ <Accordion title="DCLM (DataComp-LM)" open>
33
+ A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
34
+ </Accordion>
35
+ <Accordion title="FineWeb-Edu-HQ / FineWeb-Edu-LQ">
36
+ Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing [@fineweb].
37
+ </Accordion>
38
+ <Accordion title="Ultra-Fineweb-1.4">
39
+ A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
40
+ </Accordion>
41
+ <Accordion title="Nemotron-HQ-Synth">
42
+ Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
43
+ </Accordion>
44
+ <Accordion title="Cosmopedia">
45
+ A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters [@cosmopedia].
46
+ </Accordion>
47
+ <Accordion title="SYNTH">
48
+ A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content [@synthpleias].
49
+ </Accordion>
50
+ <Accordion title="REWIRE">
51
+ A method for recycling the web with guided rewrite that enriches low-quality documents discarded by filtering pipelines to make them useful for training. Experiments show that mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales respectively across 22 tasks [@rewire].
52
+ </Accordion>
53
 
54
  ### Ablation Setup
55