Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,15 +7,11 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
#
|
| 11 |
-
_Read our [technical report](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)!_
|
| 12 |
|
| 13 |
-
This
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
Version 1 of the π· FineWeb dataset is available [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb). Our ablation models can be found [here](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32).
|
| 20 |
-
|
| 21 |
-
Version 2 of the π₯ FineWeb dataset (multilingual extension to +1800 languages/script) is available [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2).
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# π· FineData
|
|
|
|
| 11 |
|
| 12 |
+
This is the home of the π· **FineData** team, a branch of the π€ **Hugging Face** [Science Team](https://hf.co/science) releasing large scale pre-training datasets to accelerate open LLM development.
|
| 13 |
|
| 14 |
+
- **[π· FineWeb](https://huggingface.co/collections/HuggingFaceFW/fineweb-662458592d61edba3d2f245d)**: A 15T tokens English dataset for LLM pre-training. See the [blogpost](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [paper](https://arxiv.org/abs/2406.17557).
|
| 15 |
+
- **[π FineWeb-Edu](https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd)**: a filtered subset of the most educational content from FineWeb.
|
| 16 |
+
- **[π₯ FineWeb2](https://huggingface.co/collections/HuggingFaceFW/fineweb2-6755657a481dae41e8fbba4d)**: an extension of FineWeb to over 1000 languages. See the [paper](https://arxiv.org/abs/2506.20920).
|
| 17 |
+
- **[π FinePDFs](https://huggingface.co/collections/HuggingFaceFW/finepdfs-68bd02d20928419c1dc12296)**: 3T tokens of text data extracted from PDFs sourced from the Web.
|
|
|
|
|
|
|
|
|