slimGPT / README.md
samueljayasingh's picture
remove unwanted evalCriteria
323f13c verified
|
Raw
History Blame Contribute Delete
6.71 kB
---
language:
- en
license: mit
tags:
- gpt2
- causal-lm
- text-generation
- slimgpt
- transformer
- from-scratch
pipeline_tag: text-generation
---
# slimGPT β€” 124M Parameter GPT-Style Language Model
**slimGPT** is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure.
---
## Model Details
| Property | Value |
|------------------|--------------------------|
| **Architecture** | GPT-2 style (decoder-only Transformer) |
| **Parameters** | ~124 million |
| **Layers** | 12 |
| **Attention Heads** | 12 |
| **Embedding Dim**| 768 |
| **Context Length**| 1024 tokens |
| **Vocabulary** | GPT-2 BPE tokenizer (50,257 tokens) |
| **Training Iters**| 5,000 |
| **Best Val Loss**| 3.3079 |
| **License** | MIT |
---
## Training Infrastructure
The model was trained on a single-GPU cloud instance with the following specifications:
| Component | Specification |
|------------------|--------------------------------------|
| **OS** | Debian GNU/Linux 12 (Bookworm) |
| **CPU** | Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core) |
| **RAM** | 16 GiB |
| **Storage** | 60 GB NVMe |
| **GPU** | NVIDIA L4 |
| **VRAM** | 24 GB |
| **NVIDIA Driver**| 550.54.15 |
Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run.
---
## Architecture Overview
slimGPT follows the standard GPT-2 decoder-only Transformer architecture:
- **Token + positional embeddings** β€” learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings
- **12 Transformer blocks** β€” each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network
- **Pre-norm design** β€” LayerNorm applied before attention and MLP sub-layers
- **Weight tying** β€” input embedding and output projection weights are tied
- **Causal masking** β€” autoregressive, left-to-right generation
---
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT")
model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT")
ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Pipeline API
```python
from transformers import pipeline
generator = pipeline("text-generation", model="samueljayasingh/slimGPT")
result = generator("Once upon a time,", max_new_tokens=80, do_sample=True)
print(result[0]["generated_text"])
```
### Serving with vLLM
```bash
pip install vllm
vllm serve "samueljayasingh/slimGPT"
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "samueljayasingh/slimGPT",
"prompt": "The future of AI is",
"max_tokens": 100,
"temperature": 0.7
}'
```
---
## Intended Use
This model is intended for:
- **Research and experimentation** β€” studying language model behavior, attention patterns, and generation dynamics at the 124M scale
- **Educational purposes** β€” understanding GPT-style architectures by working with a fully transparent, from-scratch implementation
- **Prototyping** β€” lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking
### Out-of-Scope Use
- Production or safety-critical applications
- Tasks requiring factual accuracy or up-to-date knowledge
- Any use that relies on instruction-following or alignment β€” this is a base language model with no RLHF or instruction tuning
---
## Limitations
- Trained for only **5,000 iterations** β€” the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2
- **No fine-tuning or alignment** β€” outputs are raw continuations and may be incoherent, biased, or off-topic
- **English-only** β€” trained on English text; performance on other languages is not evaluated
- **Context window of 1024 tokens** β€” longer documents are truncated
---
## Training Details
The model was trained using a clean, readable PyTorch implementation with the following highlights:
- **Optimizer**: AdamW with cosine learning rate decay and linear warmup
- **Tokenizer**: GPT-2 BPE (via `tiktoken`)
- **Data**: OpenWebText-style dataset sampled in token chunks of length 1024
- **Mixed precision**: `torch.autocast` with `bfloat16` on the NVIDIA L4 GPU
- **Gradient clipping**: Applied to stabilize training
- **Checkpointing**: Best model saved based on validation loss
---
### Training Runtime
- **Hardware**: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM
- **Training iterations**: 5,000
- **Total training time**: ~18 hours
- **Average time per iteration**: ~13 seconds
---
## Evaluation
| Metric | Value |
|----------------|---------|
| Best Val Loss | 3.3079 |
| Training Iters | 5,000 |
Perplexity can be approximated as `exp(3.3079) β‰ˆ 27.3`. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18–22 on OpenWebText; slimGPT sits in a reasonable range for its training budget.
![Eval Summary](images/eval_summary.png)
### Training loss
![Loss Curve](images/loss_curve.png)
### Perplexity comparison
![Perplexity](images/perplexity_comparison.png)
---
## Citation
If you use this model in your work, please credit:
```
@misc{slimgpt2026,
author = {Samuel Jayasingh},
title = {slimGPT: A 124M GPT-2-style language model trained from scratch},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}}
}
```
---
## Credits
Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU
Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content.
---
## License
This model is released under the [MIT License](https://opensource.org/licenses/MIT).