slimGPT / README.md
samueljayasingh's picture
remove unwanted evalCriteria
323f13c verified
|
Raw
History Blame Contribute Delete
6.71 kB
metadata
language:
  - en
license: mit
tags:
  - gpt2
  - causal-lm
  - text-generation
  - slimgpt
  - transformer
  - from-scratch
pipeline_tag: text-generation

slimGPT β€” 124M Parameter GPT-Style Language Model

slimGPT is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure.


Model Details

Property Value
Architecture GPT-2 style (decoder-only Transformer)
Parameters ~124 million
Layers 12
Attention Heads 12
Embedding Dim 768
Context Length 1024 tokens
Vocabulary GPT-2 BPE tokenizer (50,257 tokens)
Training Iters 5,000
Best Val Loss 3.3079
License MIT

Training Infrastructure

The model was trained on a single-GPU cloud instance with the following specifications:

Component Specification
OS Debian GNU/Linux 12 (Bookworm)
CPU Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core)
RAM 16 GiB
Storage 60 GB NVMe
GPU NVIDIA L4
VRAM 24 GB
NVIDIA Driver 550.54.15

Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run.


Architecture Overview

slimGPT follows the standard GPT-2 decoder-only Transformer architecture:

  • Token + positional embeddings β€” learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings
  • 12 Transformer blocks β€” each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network
  • Pre-norm design β€” LayerNorm applied before attention and MLP sub-layers
  • Weight tying β€” input embedding and output projection weights are tied
  • Causal masking β€” autoregressive, left-to-right generation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT")
model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT")

ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Pipeline API

from transformers import pipeline

generator = pipeline("text-generation", model="samueljayasingh/slimGPT")
result = generator("Once upon a time,", max_new_tokens=80, do_sample=True)
print(result[0]["generated_text"])

Serving with vLLM

pip install vllm
vllm serve "samueljayasingh/slimGPT"

curl -X POST "http://localhost:8000/v1/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "samueljayasingh/slimGPT",
    "prompt": "The future of AI is",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Intended Use

This model is intended for:

  • Research and experimentation β€” studying language model behavior, attention patterns, and generation dynamics at the 124M scale
  • Educational purposes β€” understanding GPT-style architectures by working with a fully transparent, from-scratch implementation
  • Prototyping β€” lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking

Out-of-Scope Use

  • Production or safety-critical applications
  • Tasks requiring factual accuracy or up-to-date knowledge
  • Any use that relies on instruction-following or alignment β€” this is a base language model with no RLHF or instruction tuning

Limitations

  • Trained for only 5,000 iterations β€” the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2
  • No fine-tuning or alignment β€” outputs are raw continuations and may be incoherent, biased, or off-topic
  • English-only β€” trained on English text; performance on other languages is not evaluated
  • Context window of 1024 tokens β€” longer documents are truncated

Training Details

The model was trained using a clean, readable PyTorch implementation with the following highlights:

  • Optimizer: AdamW with cosine learning rate decay and linear warmup
  • Tokenizer: GPT-2 BPE (via tiktoken)
  • Data: OpenWebText-style dataset sampled in token chunks of length 1024
  • Mixed precision: torch.autocast with bfloat16 on the NVIDIA L4 GPU
  • Gradient clipping: Applied to stabilize training
  • Checkpointing: Best model saved based on validation loss

Training Runtime

  • Hardware: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM
  • Training iterations: 5,000
  • Total training time: ~18 hours
  • Average time per iteration: ~13 seconds

Evaluation

Metric Value
Best Val Loss 3.3079
Training Iters 5,000

Perplexity can be approximated as exp(3.3079) β‰ˆ 27.3. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18–22 on OpenWebText; slimGPT sits in a reasonable range for its training budget.

Eval Summary

Training loss

Loss Curve

Perplexity comparison

Perplexity


Citation

If you use this model in your work, please credit:

@misc{slimgpt2026,
  author       = {Samuel Jayasingh},
  title        = {slimGPT: A 124M GPT-2-style language model trained from scratch},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}}
}

Credits

Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content.


License

This model is released under the MIT License.