--- language: - en license: mit tags: - gpt2 - causal-lm - text-generation - slimgpt - transformer - from-scratch pipeline_tag: text-generation --- # slimGPT — 124M Parameter GPT-Style Language Model **slimGPT** is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure. --- ## Model Details | Property | Value | |------------------|--------------------------| | **Architecture** | GPT-2 style (decoder-only Transformer) | | **Parameters** | ~124 million | | **Layers** | 12 | | **Attention Heads** | 12 | | **Embedding Dim**| 768 | | **Context Length**| 1024 tokens | | **Vocabulary** | GPT-2 BPE tokenizer (50,257 tokens) | | **Training Iters**| 5,000 | | **Best Val Loss**| 3.3079 | | **License** | MIT | --- ## Training Infrastructure The model was trained on a single-GPU cloud instance with the following specifications: | Component | Specification | |------------------|--------------------------------------| | **OS** | Debian GNU/Linux 12 (Bookworm) | | **CPU** | Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core) | | **RAM** | 16 GiB | | **Storage** | 60 GB NVMe | | **GPU** | NVIDIA L4 | | **VRAM** | 24 GB | | **NVIDIA Driver**| 550.54.15 | Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run. --- ## Architecture Overview slimGPT follows the standard GPT-2 decoder-only Transformer architecture: - **Token + positional embeddings** — learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings - **12 Transformer blocks** — each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network - **Pre-norm design** — LayerNorm applied before attention and MLP sub-layers - **Weight tying** — input embedding and output projection weights are tied - **Causal masking** — autoregressive, left-to-right generation --- ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT") model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT") ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ### Pipeline API ```python from transformers import pipeline generator = pipeline("text-generation", model="samueljayasingh/slimGPT") result = generator("Once upon a time,", max_new_tokens=80, do_sample=True) print(result[0]["generated_text"]) ``` ### Serving with vLLM ```bash pip install vllm vllm serve "samueljayasingh/slimGPT" curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samueljayasingh/slimGPT", "prompt": "The future of AI is", "max_tokens": 100, "temperature": 0.7 }' ``` --- ## Intended Use This model is intended for: - **Research and experimentation** — studying language model behavior, attention patterns, and generation dynamics at the 124M scale - **Educational purposes** — understanding GPT-style architectures by working with a fully transparent, from-scratch implementation - **Prototyping** — lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking ### Out-of-Scope Use - Production or safety-critical applications - Tasks requiring factual accuracy or up-to-date knowledge - Any use that relies on instruction-following or alignment — this is a base language model with no RLHF or instruction tuning --- ## Limitations - Trained for only **5,000 iterations** — the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2 - **No fine-tuning or alignment** — outputs are raw continuations and may be incoherent, biased, or off-topic - **English-only** — trained on English text; performance on other languages is not evaluated - **Context window of 1024 tokens** — longer documents are truncated --- ## Training Details The model was trained using a clean, readable PyTorch implementation with the following highlights: - **Optimizer**: AdamW with cosine learning rate decay and linear warmup - **Tokenizer**: GPT-2 BPE (via `tiktoken`) - **Data**: OpenWebText-style dataset sampled in token chunks of length 1024 - **Mixed precision**: `torch.autocast` with `bfloat16` on the NVIDIA L4 GPU - **Gradient clipping**: Applied to stabilize training - **Checkpointing**: Best model saved based on validation loss --- ### Training Runtime - **Hardware**: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM - **Training iterations**: 5,000 - **Total training time**: ~18 hours - **Average time per iteration**: ~13 seconds --- ## Evaluation | Metric | Value | |----------------|---------| | Best Val Loss | 3.3079 | | Training Iters | 5,000 | Perplexity can be approximated as `exp(3.3079) ≈ 27.3`. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18–22 on OpenWebText; slimGPT sits in a reasonable range for its training budget. ![Eval Summary](images/eval_summary.png) ### Training loss ![Loss Curve](images/loss_curve.png) ### Perplexity comparison ![Perplexity](images/perplexity_comparison.png) --- ## Citation If you use this model in your work, please credit: ``` @misc{slimgpt2026, author = {Samuel Jayasingh}, title = {slimGPT: A 124M GPT-2-style language model trained from scratch}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}} } ``` --- ## Credits Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content. --- ## License This model is released under the [MIT License](https://opensource.org/licenses/MIT).