gpt-2 / README.md
saiteja718's picture
Upload GPT-2 from-scratch checkpoint
ffa51e6 verified
---
license: mit
language:
- en
tags:
- gpt2
- causal-lm
- pytorch
- text-generation
- from-scratch
base_model: []
pipeline_tag: text-generation
---
# GPT-2 (Trained from Scratch)
A GPT-2–style causal language model built and trained **entirely from scratch** in PyTorch — no pre-trained weights, no HuggingFace Trainer. Every component (multi-head attention with KV-cache, transformer blocks, weight-tying) was implemented by hand.
---
## Model Details
| Hyperparameter | Value |
|-----------------|-------------|
| Architecture | GPT-2 (decoder-only transformer) |
| Layers | 12 |
| Attention heads | 12 |
| d\_model | 768 |
| FFN hidden dim | 3 072 |
| Context length | 1 024 tokens |
| Vocab size | 50 257 |
| Training steps | 150 000 |
| Tokens seen | ~9.8 B |
| Tokenizer | GPT-2 BPE (tiktoken) |
---
## Usage
### With 🤗 Transformers
```python
from transformers import AutoTokenizer
from model.hf_wrapper import GPT2ForCausalLM
model = GPT2ForCausalLM.from_pretrained("saiteja718/gpt2")
tokenizer = AutoTokenizer.from_pretrained("saiteja718/gpt2")
inputs = tokenizer("The capital of France is", return_tensors="pt")
logits = model(**inputs).logits
```
### With the interactive inference script
Clone the repo and run:
```bash
git clone https://huggingface.co/saiteja718/gpt2
cd gpt2
pip install torch transformers tiktoken
python3 gpt2_infer.py --interactive
```
---
## Implementation Highlights
- **Multi-head attention** with a split KV-cache for efficient autoregressive decoding (prefill + decode loop)
- **Weight tying** between the token embedding and the LM head
- **Top-k sampling** with temperature for controllable text generation
- Custom training loop with gradient clipping and cosine LR schedule
---
## Example Output
```
Prompt: The capital of germany is
Output: The capital of germany is the country he first settled in, and soon the settlement
of the British colonies as a result of his military service...
```
---
## Limitations
- Trained as a research/learning exercise — not fine-tuned on any instruction dataset
- May produce factually incorrect or incoherent text
- Context window limited to 1 024 tokens
---
## Citation
If you use this model in your work, a shoutout is appreciated:
```bibtex
@misc{saiteja718-gpt2-scratch,
author = {saiteja718},
title = {GPT-2 Trained from Scratch},
year = {2025},
url = {https://huggingface.co/saiteja718/gpt2}
}
```