This repository demonstrates a small Multi-Head-Attention language model trained from scratch for educational and research purposes.

Stories-SLM 🤖

This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains 3 pretrained models (at this moment), more on the way. The model variants in the collection ranges from standard GPT to Mixture-Of-Experts versions built with RoPE, Group Query Attention, and RMSNormalization.

Model	Params	Architecture	Validation Loss
Stories-SLM	53M	Dense - MHA	1.78
Stories-SLM 2	48M	Dense - GQA	1.73
Stories-SLM 2-MoE	127M	Sparse - Mixture-of-Experts	1.67

Model Name: Stories-SLM

Model Description

Stories-SLM is a small language model pretrained from scratch on the Tiny Stories Dataset. It has 53 million parameters and is trained for 10,000 steps on a single Tesla T4 GPU. It is trained on the next token prediction task using Cross-Entropy Loss over 674M tokens.

Developed by: Namrata Thakur
Model type: Text Generation
Language(s) (NLP): English
License: MIT
Training Type: Pretraining

Model Sources

Repository: GitHub Repo
Demo [optional]: [More Information Needed]

How to Get Started with the Model

To install Stories-SLM, follow these steps:

# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git

#Create an environment:
python -m venv env

# Install the required packages
pip install -r requirements.txt

Uses

Stories-SLM can be used to generate small, grammatically and semantically coherent simple short stories suitable for children.

Chainlit Interface 🖥️

The easiest way to interact with Stories-SLM is through its Chainlit interface:

chainlit run app_pretrain.py

This will launch a web application where you can input text and see the model's generated responses.

Downloading from Huggingface 🤗

To interact with the model by downloading from huggingface:

First clone the repo in the local

from transformer_blocks.gpt2 import GPT2
from gpt_Pretraining.text_generation import Text_Generation
import torch

model = GPT2.from_pretrained("NamrataThakur/Small_Language_Model_MHA_53M_Pretrained")
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#---------------------------- Checking the generation to make everything is okay ---------------------------
generation = Text_Generation(model=model, device=device, tokenizer_model='gpt2', 
                                          arch_type='original')
start_context = "One day, a "
response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False)
print(response)

Model Architecture and Objective

Stories-SLM uses a standard GPT decoder-only transformer architecture with:

Attention Type: Multi Head Attention
Normalization: LayerNormalization
Position Embedding: Learned absolute position encoding (similar to GPT2)
Num transformer blocks: 8
Num attention heads: 8
Embedding dimensions: 384
Vocabulary size: 50,257 tokens
Context window: 256 tokens
Feed-Forward Hidden Dimension: 1536
Parameters: ~53M (52.88M exact)
Overall Dropout: 0.2

Optimization Config:

Optimizer: AdamW
Weight Decay: 0.1
Beta1: 0.9
Beta2: 0.95
Warmup Steps: 829 steps
Total Steps: 10,000
use_gradient_clip: True
Initial Learning Rate: 1e-05
Maximum Learning Rate: 0.0008
Gradient Accumulation Steps: 16
Batch Size: 16
Global Batch Size: 256
Scheduler: Linear Increase, followed by Cosine Annealing

Training Details

Training Data

The model was trained on the TinyStories dataset, a collection of short stories designed for training language models. This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models.