Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,8 +7,12 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# HuggingFace FineWeb datasets
|
|
|
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# 🤗 HuggingFace 🍷 FineWeb datasets
|
| 11 |
+
This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web ([CommonCrawl](https://commoncrawl.org/)), released under a permissive license ([ODC-By](https://opendatacommons.org/licenses/by/1-0/)).
|
| 12 |
|
| 13 |
+
The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
|
| 14 |
|
| 15 |
+
All code and artefacts needed for reproduction are public and built on top of open source libraries, like the 🤗 libraries [`datatrove`](https://github.com/huggingface/datatrove/), [`nanotron`](https://github.com/huggingface/nanotron/) or [`lighteval`](https://github.com/huggingface/lighteval/).
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
_Currently releasing v1_
|