guipenedo commited on
Commit
c6b5dd9
Β·
verified Β·
1 Parent(s): 90a0d08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -10
README.md CHANGED
@@ -7,15 +7,11 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- # πŸ€— HuggingFace 🍷 FineWeb datasets
11
- _Read our [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)!_
12
 
13
- This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web ([CommonCrawl](https://commoncrawl.org/)), released under a permissive license ([ODC-By](https://opendatacommons.org/licenses/by/1-0/)).
14
 
15
- The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
16
-
17
- All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the πŸ€— libraries [`datatrove`](https://github.com/huggingface/datatrove/), [`nanotron`](https://github.com/huggingface/nanotron/) or [`lighteval`](https://github.com/huggingface/lighteval/).
18
-
19
- Version 1 of the 🍷 FineWeb dataset is available [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb). Our ablation models can be found [here](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32).
20
-
21
- Version 2 of the πŸ₯‚ FineWeb dataset (multilingual extension to +1800 languages/script) is available [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2).
 
7
  pinned: false
8
  ---
9
 
10
+ # 🍷 FineData
 
11
 
12
+ This is the home of the 🍷 **FineData** team, a branch of the πŸ€— **Hugging Face** [Science Team](https://hf.co/science) releasing large scale pre-training datasets to accelerate open LLM development.
13
 
14
+ - **[🍷 FineWeb](https://huggingface.co/collections/HuggingFaceFW/fineweb-662458592d61edba3d2f245d)**: A 15T tokens English dataset for LLM pre-training. See the [blogpost](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [paper](https://arxiv.org/abs/2406.17557).
15
+ - **[πŸ“š FineWeb-Edu](https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd)**: a filtered subset of the most educational content from FineWeb.
16
+ - **[πŸ₯‚ FineWeb2](https://huggingface.co/collections/HuggingFaceFW/fineweb2-6755657a481dae41e8fbba4d)**: an extension of FineWeb to over 1000 languages. See the [paper](https://arxiv.org/abs/2506.20920).
17
+ - **[πŸ“„ FinePDFs](https://huggingface.co/collections/HuggingFaceFW/finepdfs-68bd02d20928419c1dc12296)**: 3T tokens of text data extracted from PDFs sourced from the Web.