slimGPT / README.md

remove unwanted evalCriteria

323f13c verified 22 days ago

6.71 kB

	---
	language:
	- en
	license: mit
	tags:
	- gpt2
	- causal-lm
	- text-generation
	- slimgpt
	- transformer
	- from-scratch
	pipeline_tag: text-generation
	---

	# slimGPT — 124M Parameter GPT-Style Language Model

	slimGPT is a 124-million-parameter autoregressive language model built from scratch using a clean, modular PyTorch codebase. It follows the GPT-2 small architecture and was trained entirely on consumer-accessible hardware, demonstrating that capable language model training is achievable without large-scale infrastructure.

	---

	## Model Details

	\| Property \| Value \|
	\|------------------\|--------------------------\|
	\| Architecture \| GPT-2 style (decoder-only Transformer) \|
	\| Parameters \| ~124 million \|
	\| Layers \| 12 \|
	\| Attention Heads \| 12 \|
	\| Embedding Dim\| 768 \|
	\| Context Length\| 1024 tokens \|
	\| Vocabulary \| GPT-2 BPE tokenizer (50,257 tokens) \|
	\| Training Iters\| 5,000 \|
	\| Best Val Loss\| 3.3079 \|
	\| License \| MIT \|

	---

	## Training Infrastructure

	The model was trained on a single-GPU cloud instance with the following specifications:

	\| Component \| Specification \|
	\|------------------\|--------------------------------------\|
	\| OS \| Debian GNU/Linux 12 (Bookworm) \|
	\| CPU \| Intel Xeon @ 2.20 GHz (4 vCPUs, 2 physical cores, 2 threads/core) \|
	\| RAM \| 16 GiB \|
	\| Storage \| 60 GB NVMe \|
	\| GPU \| NVIDIA L4 \|
	\| VRAM \| 24 GB \|
	\| NVIDIA Driver\| 550.54.15 \|

	Training was completed without any distributed setup, A single NVIDIA L4 GPU was sufficient for the full training run.

	---

	## Architecture Overview

	slimGPT follows the standard GPT-2 decoder-only Transformer architecture:

	- Token + positional embeddings — learned embeddings over the GPT-2 BPE vocabulary with 1024-token positional encodings
	- 12 Transformer blocks — each with multi-head causal self-attention (12 heads) and a position-wise feed-forward network
	- Pre-norm design — LayerNorm applied before attention and MLP sub-layers
	- Weight tying — input embedding and output projection weights are tied
	- Causal masking — autoregressive, left-to-right generation

	---

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("samueljayasingh/slimGPT")
	model = AutoModelForCausalLM.from_pretrained("samueljayasingh/slimGPT")

	ids = tokenizer("The meaning of life is", return_tensors="pt").input_ids
	output = model.generate(ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	### Pipeline API

	```python
	from transformers import pipeline

	generator = pipeline("text-generation", model="samueljayasingh/slimGPT")
	result = generator("Once upon a time,", max_new_tokens=80, do_sample=True)
	print(result[0]["generated_text"])
	```

	### Serving with vLLM

	```bash
	pip install vllm
	vllm serve "samueljayasingh/slimGPT"

	curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
	"model": "samueljayasingh/slimGPT",
	"prompt": "The future of AI is",
	"max_tokens": 100,
	"temperature": 0.7
	}'
	```

	---

	## Intended Use

	This model is intended for:

	- Research and experimentation — studying language model behavior, attention patterns, and generation dynamics at the 124M scale
	- Educational purposes — understanding GPT-style architectures by working with a fully transparent, from-scratch implementation
	- Prototyping — lightweight text generation for downstream tasks, fine-tuning experiments, or benchmarking

	### Out-of-Scope Use

	- Production or safety-critical applications
	- Tasks requiring factual accuracy or up-to-date knowledge
	- Any use that relies on instruction-following or alignment — this is a base language model with no RLHF or instruction tuning

	---

	## Limitations

	- Trained for only 5,000 iterations — the model is capable of coherent text continuation but has not converged to the quality of fully trained GPT-2
	- No fine-tuning or alignment — outputs are raw continuations and may be incoherent, biased, or off-topic
	- English-only — trained on English text; performance on other languages is not evaluated
	- Context window of 1024 tokens — longer documents are truncated

	---

	## Training Details

	The model was trained using a clean, readable PyTorch implementation with the following highlights:

	- Optimizer: AdamW with cosine learning rate decay and linear warmup
	- Tokenizer: GPT-2 BPE (via `tiktoken`)
	- Data: OpenWebText-style dataset sampled in token chunks of length 1024
	- Mixed precision: `torch.autocast` with `bfloat16` on the NVIDIA L4 GPU
	- Gradient clipping: Applied to stabilize training
	- Checkpointing: Best model saved based on validation loss

	---

	### Training Runtime

	- Hardware: NVIDIA L4 (24 GB VRAM), 4 vCPUs, 16 GB RAM
	- Training iterations: 5,000
	- Total training time: ~18 hours
	- Average time per iteration: ~13 seconds

	---

	## Evaluation

	\| Metric \| Value \|
	\|----------------\|---------\|
	\| Best Val Loss \| 3.3079 \|
	\| Training Iters \| 5,000 \|

	Perplexity can be approximated as `exp(3.3079) ≈ 27.3`. For reference, a fully trained GPT-2 small achieves a perplexity of roughly 18–22 on OpenWebText; slimGPT sits in a reasonable range for its training budget.

	![Eval Summary](images/eval_summary.png)

	### Training loss
	![Loss Curve](images/loss_curve.png)

	### Perplexity comparison
	![Perplexity](images/perplexity_comparison.png)


	---

	## Citation

	If you use this model in your work, please credit:

	```
	@misc{slimgpt2026,
	author = {Samuel Jayasingh},
	title = {slimGPT: A 124M GPT-2-style language model trained from scratch},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/samueljayasingh/slimGPT}}
	}
	```
	---

	## Credits

	Inspired by Andrej Karpathy's "Let's reproduce GPT-2 (124M)" tutorial: https://www.youtube.com/watch?v=l8pRSuU81PU
	Special thanks to Andrej Karpathy for making modern LLM training and implementation accessible through open educational content.

	---

	## License

	This model is released under the [MIT License](https://opensource.org/licenses/MIT).