Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,20 @@
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: yellow
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
+
emoji: 🐛
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: yellow
|
| 6 |
sdk: static
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# The Common Pile
|
| 11 |
+
|
| 12 |
+
We are a group of researchers working together to collect and curate openly licensed and public domain data for training large language models.
|
| 13 |
+
So far, we have released:
|
| 14 |
+
|
| 15 |
+
- [The Common Pile v0.1](https://huggingface.co/collections/common-pile/common-pile-v01-raw-data-6826b454a5a6a445d0b51b37), an 8 TB dataset of text from over 30 diverse sources
|
| 16 |
+
- [Comma v0.1-1T](https://huggingface.co/common-pile/comma-v0.1-1t) and [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t), 7B parameter LLMs trained on text from the Common Pile v0.1
|
| 17 |
+
- The [training dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset) used to train the Comma v0.1 models
|
| 18 |
+
- Our [code](https://github.com/r-three/common-pile/) for collecting data from each source
|
| 19 |
+
|
| 20 |
+
If you're interested in contributing, please [open an issue on GitHub](https://github.com/r-three/common-pile/issues/new)!
|