| --- |
| library_name: transformers |
| license: mit |
| datasets: |
| - roneneldan/TinyStories |
| language: |
| - en |
| --- |
| |
| # Model Card for Model ID |
|
|
|
|
| ## Model Details |
|
|
| This is a reproduction of a 19.3 million parameter language model from scratch by following the paper [**TinyStories: How Small Can Language Models Be and Still Speak |
| Coherent English?**](https://arxiv.org/pdf/2305.07759). The goal of this project is to demostrate that a very small transformer model, when trained on a simpliefied synthetic dataset, can generate fluent, grammatically correct and consistent short stories. |
|
|
| ### Model Description |
|
|
| This is the model card of a 🤗 transformers model that has been pushed on the Hub. |
|
|
| - **Developed by:** Saurav Prateek |
| - **Model type:** Text Generationg (Transformer - Decoder model) |
| - **Parameters:** 19.3 Million |
| - **Attention Layers:** 8 |
| - **Hidden Dimension:** 256 |
| - **Attention Heads per Layer:** 16 |
| - **Context Window:** 512 tokens |
| - **Vocab Size:** ~50K (GPT-Neo Tokenizer) |
| - **Learning Rate:** 5e-4 |
| - **Language(s) (NLP):** English |
| - **License:** MIT |
|
|
| ### Model Sources [optional] |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Repository:** https://github.com/SauravP97/tiny-stories-hf |
| - **Paper [optional]:** https://arxiv.org/pdf/2305.07759 |
|
|
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| The model was trained on the TinyStories dataset, which consist of synthetic short stories generated by GPT-3.5/4. The stories use a restricted vocabulary typical of a 3-year-old child. |
|
|
| - Source: [Hugging Face Datasets (roneneldan/TinyStories)](https://huggingface.co/datasets/roneneldan/TinyStories) |
| - Size: ~2GB text data |
|
|
| ### Training Procedure |
|
|
| The model was trained from scratch on a **NVIDIA A100** GPU for around 5 hours to achieve a loss of `1.40`. The model was trained for `1` epoch estimating around `265K` steps. |
| We used **EleutherAI/gpt-neo-125M** tokenizer model training and inference. |
|
|
| #### Training Hyperparameters |
|
|
| - **Training regime:** |
| - Epochs: 1 |
| - Loss: 1.40 |
| - GPU: NVIDIA A100 |
| - Training Steps: 264,965 |
| - Training Time: ~5 hours |
|
|
|
|
| ## Citation [optional] |
|
|
| - https://arxiv.org/abs/2305.07759 |
|
|