---
license: mit
language:
  - en
tags:
  - gpt2
  - causal-lm
  - pytorch
  - text-generation
  - from-scratch
base_model: []
pipeline_tag: text-generation
---

# GPT-2 (Trained from Scratch)

A GPT-2–style causal language model built and trained **entirely from scratch** in PyTorch — no pre-trained weights, no HuggingFace Trainer. Every component (multi-head attention with KV-cache, transformer blocks, weight-tying) was implemented by hand.

---

## Model Details

| Hyperparameter  | Value       |
|-----------------|-------------|
| Architecture    | GPT-2 (decoder-only transformer) |
| Layers          | 12          |
| Attention heads | 12          |
| d\_model        | 768         |
| FFN hidden dim  | 3 072       |
| Context length  | 1 024 tokens |
| Vocab size      | 50 257      |
| Training steps  | 150 000     |
| Tokens seen     | ~9.8 B      |
| Tokenizer       | GPT-2 BPE (tiktoken) |

---

## Usage

### With 🤗 Transformers

```python
from transformers import AutoTokenizer
from model.hf_wrapper import GPT2ForCausalLM

model = GPT2ForCausalLM.from_pretrained("saiteja718/gpt2")
tokenizer = AutoTokenizer.from_pretrained("saiteja718/gpt2")

inputs = tokenizer("The capital of France is", return_tensors="pt")
logits = model(**inputs).logits
```

### With the interactive inference script

Clone the repo and run:

```bash
git clone https://huggingface.co/saiteja718/gpt2
cd gpt2
pip install torch transformers tiktoken
python3 gpt2_infer.py --interactive
```

---

## Implementation Highlights

- **Multi-head attention** with a split KV-cache for efficient autoregressive decoding (prefill + decode loop)
- **Weight tying** between the token embedding and the LM head
- **Top-k sampling** with temperature for controllable text generation
- Custom training loop with gradient clipping and cosine LR schedule

---

## Example Output

```
Prompt: The capital of germany is
Output: The capital of germany is the country he first settled in, and soon the settlement
        of the British colonies as a result of his military service...
```

---

## Limitations

- Trained as a research/learning exercise — not fine-tuned on any instruction dataset
- May produce factually incorrect or incoherent text
- Context window limited to 1 024 tokens

---

## Citation

If you use this model in your work, a shoutout is appreciated:

```bibtex
@misc{saiteja718-gpt2-scratch,
  author  = {saiteja718},
  title   = {GPT-2 Trained from Scratch},
  year    = {2025},
  url     = {https://huggingface.co/saiteja718/gpt2}
}
```