GPT-2 (Trained from Scratch)

A GPT-2–style causal language model built and trained entirely from scratch in PyTorch — no pre-trained weights, no HuggingFace Trainer. Every component (multi-head attention with KV-cache, transformer blocks, weight-tying) was implemented by hand.

Model Details

Hyperparameter	Value
Architecture	GPT-2 (decoder-only transformer)
Layers	12
Attention heads	12
d_model	768
FFN hidden dim	3 072
Context length	1 024 tokens
Vocab size	50 257
Training steps	150 000
Tokens seen	~9.8 B
Tokenizer	GPT-2 BPE (tiktoken)

Usage

With 🤗 Transformers

from transformers import AutoTokenizer
from model.hf_wrapper import GPT2ForCausalLM

model = GPT2ForCausalLM.from_pretrained("saiteja718/gpt2")
tokenizer = AutoTokenizer.from_pretrained("saiteja718/gpt2")

inputs = tokenizer("The capital of France is", return_tensors="pt")
logits = model(**inputs).logits

With the interactive inference script

Clone the repo and run:

git clone https://huggingface.co/saiteja718/gpt2
cd gpt2
pip install torch transformers tiktoken
python3 gpt2_infer.py --interactive

Implementation Highlights

Multi-head attention with a split KV-cache for efficient autoregressive decoding (prefill + decode loop)
Weight tying between the token embedding and the LM head
Top-k sampling with temperature for controllable text generation
Custom training loop with gradient clipping and cosine LR schedule

Example Output

Prompt: The capital of germany is
Output: The capital of germany is the country he first settled in, and soon the settlement
        of the British colonies as a result of his military service...

Limitations

Trained as a research/learning exercise — not fine-tuned on any instruction dataset
May produce factually incorrect or incoherent text
Context window limited to 1 024 tokens

Citation

If you use this model in your work, a shoutout is appreciated:

@misc{saiteja718-gpt2-scratch,
  author  = {saiteja718},
  title   = {GPT-2 Trained from Scratch},
  year    = {2025},
  url     = {https://huggingface.co/saiteja718/gpt2}
}

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32