File size: 1,725 Bytes
3a6e40d
 
 
 
 
 
 
 
 
c6b5dd9
1c03947
c6b5dd9
4f8e3f3
c6b5dd9
 
 
b8fe8db
de63a1d
c236747
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
---
title: README
emoji: πŸ‘€
colorFrom: purple
colorTo: pink
sdk: static
pinned: false
---

# 🍷 FineData

This is the home of the 🍷 **FineData** team, a branch of the πŸ€— **Hugging Face** [Science Team](https://hf.co/science) releasing large scale pre-training datasets to accelerate open LLM development.

- **[🍷 FineWeb](https://huggingface.co/collections/HuggingFaceFW/fineweb-662458592d61edba3d2f245d)**: A 15T tokens English dataset for LLM pre-training. See the [blogpost](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [paper](https://arxiv.org/abs/2406.17557).
- **[πŸ“š FineWeb-Edu](https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd)**: a filtered subset of the most educational content from FineWeb.
- **[πŸ₯‚ FineWeb2](https://huggingface.co/collections/HuggingFaceFW/fineweb2-6755657a481dae41e8fbba4d)**: an extension of FineWeb to over 1000 languages. See the [paper](https://arxiv.org/abs/2506.20920).
- **[πŸ“„ FinePDFs](https://huggingface.co/collections/HuggingFaceFW/finepdfs-68bd02d20928419c1dc12296)**: 3T tokens of text data extracted from PDFs sourced from the Web. See the [blogpost](https://huggingface.co/spaces/HuggingFaceFW/FinePDFsBlog)
- **[🌐 FineWiki](https://huggingface.co/collections/HuggingFaceFW/finewiki-68f6615c6bb86563dcd5e846)**: an updated, better extracted version of Wikipedia in 300+ languages.
- **[πŸ“„ FinePDFs-Edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu)**: 350B+ highly educational tokens filtered from πŸ“„ FinePDFs
- **[πŸ’¬ FineTranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations)**: 1+1T tokens of parallel text translated from 500+ πŸ₯‚ FineWeb2 languages