Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -10,6 +10,11 @@ pinned: false
|
|
| 10 |
# Hugging Face Smol Models Research
|
| 11 |
This is the home for smol models (SmolLM & SmolVLM) and high quality pre-training datasets. We released:
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
- [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu): a filtered version of FineWeb dataset for educational content, paper available [here](https://huggingface.co/papers/2406.17557).
|
| 14 |
- [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia): the largest open synthetic dataset, with 25B tokens and 30M samples. It contains synthetic textbooks, blog posts, and stories, posts generated by Mixtral. Blog post available [here](https://huggingface.co/blog/cosmopedia).
|
| 15 |
- [Smollm-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus): the pre-training corpus of SmolLM: **Cosmopedia v0.2**, **FineWeb-Edu dedup** and **Python-Edu**. Blog post available [here](https://huggingface.co/blog/smollm).
|
|
@@ -22,6 +27,3 @@ This is the home for smol models (SmolLM & SmolVLM) and high quality pre-trainin
|
|
| 22 |
<div align="center">
|
| 23 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/UXSo9zzL7PFvrLCAQfcnz.png" width="700"/>
|
| 24 |
</div>
|
| 25 |
-
|
| 26 |
-
**News 🗞️**
|
| 27 |
-
- **The Smol Training Playbook**: a comprehensive guide to training world-class LLMs [HuggingFaceTB/smol-training-playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)
|
|
|
|
| 10 |
# Hugging Face Smol Models Research
|
| 11 |
This is the home for smol models (SmolLM & SmolVLM) and high quality pre-training datasets. We released:
|
| 12 |
|
| 13 |
+
|
| 14 |
+
**News 🗞️**
|
| 15 |
+
- **The Smol Training Playbook**: a comprehensive guide to training world-class LLMs [HuggingFaceTB/smol-training-playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)
|
| 16 |
+
|
| 17 |
+
**Past releases**
|
| 18 |
- [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu): a filtered version of FineWeb dataset for educational content, paper available [here](https://huggingface.co/papers/2406.17557).
|
| 19 |
- [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia): the largest open synthetic dataset, with 25B tokens and 30M samples. It contains synthetic textbooks, blog posts, and stories, posts generated by Mixtral. Blog post available [here](https://huggingface.co/blog/cosmopedia).
|
| 20 |
- [Smollm-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus): the pre-training corpus of SmolLM: **Cosmopedia v0.2**, **FineWeb-Edu dedup** and **Python-Edu**. Blog post available [here](https://huggingface.co/blog/smollm).
|
|
|
|
| 27 |
<div align="center">
|
| 28 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/UXSo9zzL7PFvrLCAQfcnz.png" width="700"/>
|
| 29 |
</div>
|
|
|
|
|
|
|
|
|