| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - gpt2 |
| - causal-lm |
| - text-generation |
| - slimgpt |
| - transformer |
| - from-scratch |
| pipeline_tag: text-generation |
| --- |
| |
| # slimGPT β 124M Parameter GPT-Style Language Model |
|
|
| **slimGPT** is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |------------------|--------------------------| |
| | **Architecture** | GPT-2 style (decoder-only Transformer) | |
| | **Parameters** | ~124 million | |
| | **Layers** | 12 | |
| | **Attention Heads** | 12 | |
| | **Embedding Dim**| 768 | |
| | **Context Length**| 1024 tokens | |
| | **Vocabulary** | GPT-2 BPE tokenizer (50,257 tokens) | |
| | **Training Iters**| 5,000 | |
| | **Best Val Loss**| 3.3079 | |
| | **License** | MIT | |
|
|
| --- |
|
|
| ## Training Infrastructure |
|
|
| The model was trained on a single-GPU cloud instance with the following specifications: |
|
|
| | Component | Specification | |
| |------------------|--------------------------------------| |
| | **OS** | Debian GNU/Linux 12 (Bookworm) | |
| | **CPU** | Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core) | |
| | **RAM** | 16 GiB | |
| | **Storage** | 60 GB NVMe | |
| | **GPU** | NVIDIA L4 | |
| | **VRAM** | 24 GB | |
| | **NVIDIA Driver**| 550.54.15 | |
|
|
| Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run. |
|
|
| --- |
|
|
| ## Architecture Overview |
|
|
| slimGPT follows the standard GPT-2 decoder-only Transformer architecture: |
|
|
| - **Token + positional embeddings** β learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings |
| - **12 Transformer blocks** β each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network |
| - **Pre-norm design** β LayerNorm applied before attention and MLP sub-layers |
| - **Weight tying** β input embedding and output projection weights are tied |
| - **Causal masking** β autoregressive, left-to-right generation |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT") |
| model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT") |
| |
| ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids |
| output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9) |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) |
| ``` |
|
|
| ### Pipeline API |
|
|
| ```python |
| from transformers import pipeline |
| |
| generator = pipeline("text-generation", model="samueljayasingh/slimGPT") |
| result = generator("Once upon a time,", max_new_tokens=80, do_sample=True) |
| print(result[0]["generated_text"]) |
| ``` |
|
|
| ### Serving with vLLM |
|
|
| ```bash |
| pip install vllm |
| vllm serve "samueljayasingh/slimGPT" |
| |
| curl -X POST "http://localhost:8000/v1/completions" \ |
| -H "Content-Type: application/json" \ |
| --data '{ |
| "model": "samueljayasingh/slimGPT", |
| "prompt": "The future of AI is", |
| "max_tokens": 100, |
| "temperature": 0.7 |
| }' |
| ``` |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| This model is intended for: |
|
|
| - **Research and experimentation** β studying language model behavior, attention patterns, and generation dynamics at the 124M scale |
| - **Educational purposes** β understanding GPT-style architectures by working with a fully transparent, from-scratch implementation |
| - **Prototyping** β lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking |
|
|
| ### Out-of-Scope Use |
|
|
| - Production or safety-critical applications |
| - Tasks requiring factual accuracy or up-to-date knowledge |
| - Any use that relies on instruction-following or alignment β this is a base language model with no RLHF or instruction tuning |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Trained for only **5,000 iterations** β the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2 |
| - **No fine-tuning or alignment** β outputs are raw continuations and may be incoherent, biased, or off-topic |
| - **English-only** β trained on English text; performance on other languages is not evaluated |
| - **Context window of 1024 tokens** β longer documents are truncated |
|
|
| --- |
|
|
| ## Training Details |
|
|
| The model was trained using a clean, readable PyTorch implementation with the following highlights: |
|
|
| - **Optimizer**: AdamW with cosine learning rate decay and linear warmup |
| - **Tokenizer**: GPT-2 BPE (via `tiktoken`) |
| - **Data**: OpenWebText-style dataset sampled in token chunks of length 1024 |
| - **Mixed precision**: `torch.autocast` with `bfloat16` on the NVIDIA L4 GPU |
| - **Gradient clipping**: Applied to stabilize training |
| - **Checkpointing**: Best model saved based on validation loss |
|
|
| --- |
|
|
| ### Training Runtime |
|
|
| - **Hardware**: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM |
| - **Training iterations**: 5,000 |
| - **Total training time**: ~18 hours |
| - **Average time per iteration**: ~13 seconds |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| | Metric | Value | |
| |----------------|---------| |
| | Best Val Loss | 3.3079 | |
| | Training Iters | 5,000 | |
|
|
| Perplexity can be approximated as `exp(3.3079) β 27.3`. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18β22 on OpenWebText; slimGPT sits in a reasonable range for its training budget. |
|
|
|  |
|
|
| ### Training loss |
|  |
|
|
| ### Perplexity comparison |
|  |
|
|
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model in your work, please credit: |
|
|
| ``` |
| @misc{slimgpt2026, |
| author = {Samuel Jayasingh}, |
| title = {slimGPT: A 124M GPT-2-style language model trained from scratch}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}} |
| } |
| ``` |
| --- |
|
|
| ## Credits |
|
|
| Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU |
| Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content. |
|
|
| --- |
|
|
| ## License |
|
|
| This model is released under the [MIT License](https://opensource.org/licenses/MIT). |