finephrase

Running on CPU Upgrade

joelniklaus HF Staff commited on Mar 16

Commit

fb9415e

1 Parent(s): 8260aee

remove essentialweb since it is not trained for long enough

Files changed (3) hide show

app/src/content/bibliography.bib CHANGED Viewed

@@ -53,15 +53,6 @@
   url           = {https://arxiv.org/abs/2505.05427}
 }
-@misc{essentialweb,
-  title         = {Essential-Web v1.0: 24T tokens of organized web data},
-  author        = {Essential AI and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
-  year          = {2025},
-  eprint        = {2506.14111},
-  archiveprefix = {arXiv},
-  primaryclass  = {cs.CL},
-  url           = {https://arxiv.org/abs/2506.14111}
-}
 @misc{nemotroncc,
   title         = {Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset},

   url           = {https://arxiv.org/abs/2505.05427}
 }
 @misc{nemotroncc,
   title         = {Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset},

app/src/content/chapters/2-setup.mdx CHANGED Viewed

@@ -41,9 +41,6 @@ Before diving into experiments, here's a quick overview of the datasets we compa
 <Tab title="Ultra-FineWeb">
   A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
 </Tab>
-<Tab title="Essential-Web">
-  A 24T token web dataset from 101 Common Crawl snapshots with document-level metadata for flexible curation. Each of the 23.6B documents is annotated with subject classification, document type, content complexity, and quality scores using the [EAI-Taxonomy-0.5b](https://huggingface.co/EssentialAI/eai-taxonomy-0.5b) classifier, enabling researchers to filter domain-specific subsets without building custom pipelines [@essentialweb].
-</Tab>
 <Tab title="Nemotron-HQ-Synth">
   Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
 </Tab>

 <Tab title="Ultra-FineWeb">
   A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
 </Tab>
 <Tab title="Nemotron-HQ-Synth">
   Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
 </Tab>

app/src/content/chapters/3-experiments.mdx CHANGED Viewed

@@ -44,13 +44,12 @@ First things first: where does the bar sit? We establish baselines and train on
       nemotron_hq_synth: "Nemotron-HQ-Synth",
       rewire: "REWIRE",
       synth_query_reasoning_answer: "SYNTH",
-      essentialweb_raw: "EssentialWeb",
       "ultra-fineweb": "Ultra-FineWeb"
     }
   }}
 />
-DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin. The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, SYNTH, and EssentialWeb, fall notably behind. DCLM is the strongest baseline and becomes our target to beat for everything that follows.
 Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?

       nemotron_hq_synth: "Nemotron-HQ-Synth",
       rewire: "REWIRE",
       synth_query_reasoning_answer: "SYNTH",
       "ultra-fineweb": "Ultra-FineWeb"
     }
   }}
 />
+DCLM, Nemotron-HQ-Synth, and REWIRE come out on top by a clear margin. The remaining datasets, including Cosmopedia, FineWeb-Edu (both HQ and LQ), Ultra-FineWeb, and SYNTH, fall notably behind. DCLM is the strongest baseline and becomes our target to beat for everything that follows.
 Nemotron-HQ-Synth and REWIRE are both mixes of several prompts. So what's actually doing the heavy lifting inside them?