Stories-SLM πŸ€–

This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains 2 pretrained models (at this moment), more on the way. The model variants in the collection ranges from standard GPT to Mixture-Of-Experts versions built with RoPE, Group Query Attention, and RMSNormalization.

Model Name: Stories-SLM

Model Description

Stories-SLM is a small language model pretrained from scratch on the Tiny Stories Dataset. It has 53 million parameters and is trained for 10,000 steps on a single Tesla T4 GPU. It is trained on the next token prediction task using Cross-Entropy Loss over 674M tokens.

  • Developed by: Namrata Thakur
  • Model type: Text Generation
  • Language(s) (NLP): English
  • License: MIT
  • Training Type: Pretraining

Model Sources

  • Repository: GitHub Repo
  • Demo [optional]: [More Information Needed]

How to Get Started with the Model

To install Stories-SLM, follow these steps:

# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git

#Create an environment:
python -m venv env

# Install the required packages
pip install -r requirements.txt

Uses

Stories-SLM can be used to generate small, grammatically and semantically coherent simple short stories suitable for children.

Chainlit Interface πŸ–₯️

The easiest way to interact with Stories-SLM is through its Chainlit interface:

chainlit run app_pretrain.py

This will launch a web application where you can input text and see the model's generated responses.

Downloading from Huggingface πŸ€—

To interact with the model by downloading from huggingface:

  • First clone the repo in the local
from transformer_blocks.gpt2 import GPT2
from gpt_Pretraining.text_generation import Text_Generation

model = GPT2.from_pretrained("NamrataThakur/Small_Language_Model_MHA_53M_Pretrained")
model.eval()

#---------------------------- Checking the generation to make everything is okay ---------------------------
generation = Text_Generation(model=model, device='cpu', tokenizer_model='gpt2', 
                                          arch_type='original')
start_context = "One day, a "
response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False)
print(response)

Model Architecture and Objective

Stories-SLM uses a standard GPT decoder-only transformer architecture with:

  • Attention Type: Multi Head Attention
  • Normalization: LayerNormalization
  • Position Embedding: Learned absolute position encoding (similar to GPT2)
  • Num transformer blocks: 8
  • Num attention heads: 8
  • Embedding dimensions: 384
  • Vocabulary size: 50,257 tokens
  • Context window: 256 tokens
  • Feed-Forward Hidden Dimension: 1536
  • Parameters: ~53M (52.88M exact)
  • Overall Dropout: 0.2

Optimization Config:

  • Optimizer: AdamW
  • Weight Decay: 0.1
  • Beta1: 0.9
  • Beta2: 0.95
  • Warmup Steps: 829 steps
  • Total Steps: 10,000
  • use_gradient_clip: True
  • Initial Learning Rate: 1e-05
  • Maximum Learning Rate: 0.0008
  • Gradient Accumulation Steps: 16
  • Batch Size: 16
  • Global Batch Size: 256
  • Scheduler: Linear Increase, followed by Cosine Annealing

Training Details

Training Data

The model was trained on the TinyStories dataset, a collection of short stories designed for training language models. This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models.

Training Procedure

Stories-SLM was trained using PyTorch on the TinyStories dataset. The training process involved:

  1. Tokenizing the input text
  2. Creating sliding windows of fixed block size
  3. Training the model with cross-entropy loss
  4. Applying learning rate scheduling with warmup and cosine decay

Training Plots

  • Learning Rate Vs Steps:

image

  • Loss Vs Steps:

image

Inference

During inference, Stories-SLM uses several techniques to produce high-quality text:

  • Temperature scaling for controlling randomness
  • Top-k sampling for focus and diversity
  • Efficient token generation one at a time
  • Max New Tokens to determine generation length

Results

image

image

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: Single Tesla-T4 16GB
  • Hours used: [More Information Needed]
  • Cloud Provider: Lightning-AI

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support ❀️

If you find Stories-SLM useful, please consider starring the repository ⭐

Downloads last month
124
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train NamrataThakur/Small_Language_Model_MHA_53M_Pretrained

Collection including NamrataThakur/Small_Language_Model_MHA_53M_Pretrained

Paper for NamrataThakur/Small_Language_Model_MHA_53M_Pretrained