joelniklaus HF Staff commited on
Commit
1c3669d
·
1 Parent(s): a32050b

added essentialweb description

Browse files
app/src/content/bibliography.bib CHANGED
@@ -53,6 +53,16 @@
53
  url = {https://arxiv.org/abs/2505.05427}
54
  }
55
 
 
 
 
 
 
 
 
 
 
 
56
  @misc{nemotroncc,
57
  title = {Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset},
58
  author = {Dan Su and Kezhi Kong and Ying Lin and Joseph Jennings and Brandon Norick and Markus Kliegl and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro},
 
53
  url = {https://arxiv.org/abs/2505.05427}
54
  }
55
 
56
+ @misc{essentialweb,
57
+ title = {Essential-Web v1.0: 24T tokens of organized web data},
58
+ author = {Essential AI and Andrew Hojel and Michael Pust and Tim Romanski and Yash Vanjani and Ritvik Kapila and Mohit Parmar and Adarsh Chaluvaraju and Alok Tripathy and Anil Thomas and Ashish Tanwer and Darsh J Shah and Ishaan Shah and Karl Stratos and Khoi Nguyen and Kurt Smith and Michael Callahan and Peter Rushton and Philip Monk and Platon Mazarakis and Saad Jamal and Saurabh Srivastava and Somanshu Singla and Ashish Vaswani},
59
+ year = {2025},
60
+ eprint = {2506.14111},
61
+ archiveprefix = {arXiv},
62
+ primaryclass = {cs.CL},
63
+ url = {https://arxiv.org/abs/2506.14111}
64
+ }
65
+
66
  @misc{nemotroncc,
67
  title = {Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset},
68
  author = {Dan Su and Kezhi Kong and Ying Lin and Joseph Jennings and Brandon Norick and Markus Kliegl and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro},
app/src/content/chapters/2-setup.mdx CHANGED
@@ -39,6 +39,9 @@ Before diving into experiments, here's a quick overview of the datasets we compa
39
  <Accordion title="Ultra-FineWeb">
40
  A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
41
  </Accordion>
 
 
 
42
  <Accordion title="Nemotron-HQ-Synth">
43
  Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
44
  </Accordion>
 
39
  <Accordion title="Ultra-FineWeb">
40
  A 1T English token and 120B Chinese token dataset created by applying efficient verification-based filtering to FineWeb. Uses a lightweight fastText classifier and optimized seed data selection to improve data quality [@ultrafineweb].
41
  </Accordion>
42
+ <Accordion title="Essential-Web">
43
+ A 24T token web dataset from 101 Common Crawl snapshots with document-level metadata for flexible curation. Each of the 23.6B documents is annotated with subject classification, document type, content complexity, and quality scores using the [EAI-Taxonomy-0.5b](https://huggingface.co/EssentialAI/eai-taxonomy-0.5b) classifier, enabling researchers to filter domain-specific subsets without building custom pipelines [@essentialweb].
44
+ </Accordion>
45
  <Accordion title="Nemotron-HQ-Synth">
46
  Part of Nemotron-CC, a 6.3T token dataset using classifier ensembling and synthetic data rephrasing. The High-Quality-Synthetic subset contains synthetically rephrased data using Qwen3-30B-A3B [@qwen3] [@nemotroncc].
47
  </Accordion>