|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
datasets: |
|
|
- roneneldan/TinyStories |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
This is a reproduction of a 3.6 million parameter language model from scratch by following the paper [**TinyStories: How Small Can Language Models Be and Still Speak |
|
|
Coherent English?**](https://arxiv.org/pdf/2305.07759). The goal of this project is to demostrate that a very small transformer model, when trained on a simpliefied synthetic dataset, can generate fluent, grammatically correct and consistent short stories. |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. |
|
|
|
|
|
- **Developed by:** Saurav Prateek |
|
|
- **Model type:** Text Generationg (Transformer - Decoder model) |
|
|
- **Parameters:** 3.65 Million |
|
|
- **Attention Layers:** 8 |
|
|
- **Hidden Dimension:** 64 |
|
|
- **Attention Heads per Layer:** 16 |
|
|
- **Context Window:** 512 tokens |
|
|
- **Vocab Size:** ~50K (GPT-Neo Tokenizer) |
|
|
- **Learning Rate:** 5e-4 |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** MIT |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://github.com/SauravP97/tiny-stories-hf |
|
|
- **Paper [optional]:** https://arxiv.org/pdf/2305.07759 |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on the TinyStories dataset, which consist of synthetic short stories generated by GPT-3.5/4. The stories use a restricted vocabulary typical of a 3-year-old child. |
|
|
|
|
|
- Source: [Hugging Face Datasets (roneneldan/TinyStories)](https://huggingface.co/datasets/roneneldan/TinyStories) |
|
|
- Size: ~2GB text data |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
The model was trained from scratch on a **NVIDIA T4** GPU for around 3 hours to achieve a loss of `2.17`. The model was trained for `0.22` epochs estimating around `55K` steps. |
|
|
We used **EleutherAI/gpt-neo-125M** tokenizer model training and inference. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** |
|
|
- Epochs: 0.22 |
|
|
- Loss: 2.17 |
|
|
- GPU: NVIDIA T4 |
|
|
- Training Steps: 55,000 |
|
|
- Training Time: ~3 hours |
|
|
|
|
|
|
|
|
## Citation [optional] |
|
|
|
|
|
- https://arxiv.org/abs/2305.07759 |
|
|
|