---
language:
  - en
license: mit
tags:
  - gpt2
  - causal-lm
  - text-generation
  - slimgpt
  - transformer
  - from-scratch
pipeline_tag: text-generation
---

# slimGPT — 124M Parameter GPT-Style Language Model

**slimGPT** is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure.

---

## Model Details

| Property         | Value                    |
|------------------|--------------------------|
| **Architecture** | GPT-2 style (decoder-only Transformer) |
| **Parameters**   | ~124 million             |
| **Layers**       | 12                       |
| **Attention Heads** | 12                    |
| **Embedding Dim**| 768                      |
| **Context Length**| 1024 tokens             |
| **Vocabulary**   | GPT-2 BPE tokenizer (50,257 tokens) |
| **Training Iters**| 5,000                   |
| **Best Val Loss**| 3.3079                   |
| **License**      | MIT                      |

---

## Training Infrastructure

The model was trained on a single-GPU cloud instance with the following specifications:

| Component        | Specification                        |
|------------------|--------------------------------------|
| **OS**           | Debian GNU/Linux 12 (Bookworm)       |
| **CPU**          | Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core) |
| **RAM**          | 16 GiB                               |
| **Storage**      | 60 GB NVMe                           |
| **GPU**          | NVIDIA L4                            |
| **VRAM**         | 24 GB                                |
| **NVIDIA Driver**| 550.54.15                            |

Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run.

---

## Architecture Overview

slimGPT follows the standard GPT-2 decoder-only Transformer architecture:

- **Token + positional embeddings** — learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings
- **12 Transformer blocks** — each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network
- **Pre-norm design** — LayerNorm applied before attention and MLP sub-layers
- **Weight tying** — input embedding and output projection weights are tied
- **Causal masking** — autoregressive, left-to-right generation

---

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT")
model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT")

ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

### Pipeline API

```python
from transformers import pipeline

generator = pipeline("text-generation", model="samueljayasingh/slimGPT")
result = generator("Once upon a time,", max_new_tokens=80, do_sample=True)
print(result[0]["generated_text"])
```

### Serving with vLLM

```bash
pip install vllm
vllm serve "samueljayasingh/slimGPT"

curl -X POST "http://localhost:8000/v1/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "samueljayasingh/slimGPT",
    "prompt": "The future of AI is",
    "max_tokens": 100,
    "temperature": 0.7
  }'
```

---

## Intended Use

This model is intended for:

- **Research and experimentation** — studying language model behavior, attention patterns, and generation dynamics at the 124M scale
- **Educational purposes** — understanding GPT-style architectures by working with a fully transparent, from-scratch implementation
- **Prototyping** — lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking

### Out-of-Scope Use

- Production or safety-critical applications
- Tasks requiring factual accuracy or up-to-date knowledge
- Any use that relies on instruction-following or alignment — this is a base language model with no RLHF or instruction tuning

---

## Limitations

- Trained for only **5,000 iterations** — the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2
- **No fine-tuning or alignment** — outputs are raw continuations and may be incoherent, biased, or off-topic
- **English-only** — trained on English text; performance on other languages is not evaluated
- **Context window of 1024 tokens** — longer documents are truncated

---

## Training Details

The model was trained using a clean, readable PyTorch implementation with the following highlights:

- **Optimizer**: AdamW with cosine learning rate decay and linear warmup
- **Tokenizer**: GPT-2 BPE (via `tiktoken`)
- **Data**: OpenWebText-style dataset sampled in token chunks of length 1024
- **Mixed precision**: `torch.autocast` with `bfloat16` on the NVIDIA L4 GPU
- **Gradient clipping**: Applied to stabilize training
- **Checkpointing**: Best model saved based on validation loss

---

### Training Runtime

- **Hardware**: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM
- **Training iterations**: 5,000
- **Total training time**: ~18 hours
- **Average time per iteration**: ~13 seconds

---

## Evaluation

| Metric         | Value   |
|----------------|---------|
| Best Val Loss  | 3.3079  |
| Training Iters | 5,000   |

Perplexity can be approximated as `exp(3.3079) ≈ 27.3`. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18–22 on OpenWebText; slimGPT sits in a reasonable range for its training budget.

![Eval Summary](images/eval_summary.png)

### Training loss
![Loss Curve](images/loss_curve.png)

### Perplexity comparison
![Perplexity](images/perplexity_comparison.png)


---

## Citation

If you use this model in your work, please credit:

```
@misc{slimgpt2026,
  author       = {Samuel Jayasingh},
  title        = {slimGPT: A 124M GPT-2-style language model trained from scratch},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}}
}
```
---

## Credits

Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU
Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content.

---

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).