Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
fb9415e
1
Parent(s): 8260aee
remove essentialweb since it is not trained for long enough
Browse files
app/src/content/bibliography.bib
CHANGED
|
@@ -53,15 +53,6 @@
|
|
| 53 |
url = {https://arxiv.org/abs/2505.05427}
|
| 54 |
}
|
| 55 |
|
| 56 |
-
@misc{essentialweb,
|
| 57 |
-
title = {Essential-Web v1.0: 24T tokens of organized web data},
|
| 58 |
-
author = {Essential AI and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
|
| 59 |
-
year = {2025},
|
| 60 |
-
eprint = {2506.14111},
|
| 61 |
-
archiveprefix = {arXiv},
|
| 62 |
-
primaryclass = {cs.CL},
|
| 63 |
-
url = {https://arxiv.org/abs/2506.14111}
|
| 64 |
-
}
|
| 65 |
|
| 66 |
@misc{nemotroncc,
|
| 67 |
title = {Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset},
|
|
|
|
| 53 |
url = {https://arxiv.org/abs/2505.05427}
|
| 54 |
}
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
@misc{nemotroncc,
|
| 58 |
title = {Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset},
|
app/src/content/chapters/2-setup.mdx
CHANGED
|
@@ -41,9 +41,6 @@ Before diving into experiments, here's a quick overview of the datasets we compa
|
|
| 41 |
<Tab title="Ultra-FineWeb">
|
| 42 |
A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
|
| 43 |
</Tab>
|
| 44 |
-
<Tab title="Essential-Web">
|
| 45 |
-
A 24T token web dataset from 101 Common Crawl snapshots with document-level metadata for flexible curation. Each of the 23.6B documents is annotated with subject classification, document type, content complexity, and quality scores using the [EAI-Taxonomy-0.5b](https://huggingface.co/EssentialAI/eai-taxonomy-0.5b) classifier, enabling researchers to filter domain-specific subsets without building custom pipelines [@essentialweb].
|
| 46 |
-
</Tab>
|
| 47 |
<Tab title="Nemotron-HQ-Synth">
|
| 48 |
Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
|
| 49 |
</Tab>
|
|
|
|
| 41 |
<Tab title="Ultra-FineWeb">
|
| 42 |
A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
|
| 43 |
</Tab>
|
|
|
|
|
|
|
|
|
|
| 44 |
<Tab title="Nemotron-HQ-Synth">
|
| 45 |
Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
|
| 46 |
</Tab>
|
app/src/content/chapters/3-experiments.mdx
CHANGED
|
@@ -44,13 +44,12 @@ First things first: where does the bar sit? We establish baselines and train on
|
|
| 44 |
nemotron_hq_synth: "Nemotron-HQ-Synth",
|
| 45 |
rewire: "REWIRE",
|
| 46 |
synth_query_reasoning_answer: "SYNTH",
|
| 47 |
-
essentialweb_raw: "EssentialWeb",
|
| 48 |
"ultra-fineweb": "Ultra-FineWeb"
|
| 49 |
}
|
| 50 |
}}
|
| 51 |
/>
|
| 52 |
|
| 53 |
-
DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin. The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb,
|
| 54 |
|
| 55 |
Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?
|
| 56 |
|
|
|
|
| 44 |
nemotron_hq_synth: "Nemotron-HQ-Synth",
|
| 45 |
rewire: "REWIRE",
|
| 46 |
synth_query_reasoning_answer: "SYNTH",
|
|
|
|
| 47 |
"ultra-fineweb": "Ultra-FineWeb"
|
| 48 |
}
|
| 49 |
}}
|
| 50 |
/>
|
| 51 |
|
| 52 |
+
DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin. The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our target to beat for everything that follows.
|
| 53 |
|
| 54 |
Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?
|
| 55 |
|