Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
d6ec10c
1
Parent(s): d7053a5
display source datasets as accordions
Browse files
app/src/content/chapters/setup.mdx
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
|
|
|
| 1 |
## Setup
|
| 2 |
|
| 3 |
### Synthetic Data for Pretraining
|
|
@@ -25,27 +27,29 @@ We conduct large-scale document rephrasing experiments using instruction-tuned l
|
|
| 25 |
|
| 26 |
### Source Datasets
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
|
|
|
|
|
|
| 49 |
|
| 50 |
### Ablation Setup
|
| 51 |
|
|
|
|
| 1 |
+
import Accordion from "../../components/Accordion.astro";
|
| 2 |
+
|
| 3 |
## Setup
|
| 4 |
|
| 5 |
### Synthetic Data for Pretraining
|
|
|
|
| 27 |
|
| 28 |
### Source Datasets
|
| 29 |
|
| 30 |
+
We compare against several baseline datasets for pretraining and data rephrasing. We use source data and seed data interchangeably throughout.
|
| 31 |
+
|
| 32 |
+
<Accordion title="DCLM (DataComp-LM)" open>
|
| 33 |
+
A standardized benchmark providing a 240T token corpus from Common Crawl with model-based filtering as a key curation strategy. DCLM-Baseline enables training a 7B parameter model to 64% accuracy on MMLU with 2.6T tokens [@datacomp].
|
| 34 |
+
</Accordion>
|
| 35 |
+
<Accordion title="FineWeb-Edu-HQ / FineWeb-Edu-LQ">
|
| 36 |
+
Subsets of FineWeb-Edu, a 1.3T token educational dataset filtered using Llama-3-70B-Instruct [@llama3] scoring samples on educational quality from 0 to 5. We use HQ (scores 4 or 5) and LQ (scores 0 or 1) to investigate the impact of seed data quality on rephrasing [@fineweb].
|
| 37 |
+
</Accordion>
|
| 38 |
+
<Accordion title="Ultra-Fineweb-1.4">
|
| 39 |
+
A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
|
| 40 |
+
</Accordion>
|
| 41 |
+
<Accordion title="Nemotron-HQ-Synth">
|
| 42 |
+
Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
|
| 43 |
+
</Accordion>
|
| 44 |
+
<Accordion title="Cosmopedia">
|
| 45 |
+
A 30 million file synthetic dataset with 25 billion tokens generated by Mixtral-8x7B-Instruct [@mixtral], containing textbooks, blog posts, and stories across diverse topics. Created through careful prompt engineering conditioning on curated educational sources and web data clusters [@cosmopedia].
|
| 46 |
+
</Accordion>
|
| 47 |
+
<Accordion title="SYNTH">
|
| 48 |
+
A fully synthetic dataset built from 50,000 Wikipedia articles expanded into problems and resolution paths including math exercises, creative writing, and information extraction. Uses multiple specialized synthetic pipelines with fine-tuned models and grounding in encyclopedic content [@synthpleias].
|
| 49 |
+
</Accordion>
|
| 50 |
+
<Accordion title="REWIRE">
|
| 51 |
+
A method for recycling the web with guided rewrite that enriches low-quality documents discarded by filtering pipelines to make them useful for training. Experiments show that mixing high-quality raw texts with rewritten texts leads to 1.0, 1.3, and 2.5 percentage point improvements at 1B, 3B, and 7B scales respectively across 22 tasks [@rewire].
|
| 52 |
+
</Accordion>
|
| 53 |
|
| 54 |
### Ablation Setup
|
| 55 |
|