GPT-2 (Trained from Scratch)

A GPT-2โ€“style causal language model built and trained entirely from scratch in PyTorch โ€” no pre-trained weights, no HuggingFace Trainer. Every component (multi-head attention with KV-cache, transformer blocks, weight-tying) was implemented by hand.


Model Details

Hyperparameter Value
Architecture GPT-2 (decoder-only transformer)
Layers 12
Attention heads 12
d_model 768
FFN hidden dim 3 072
Context length 1 024 tokens
Vocab size 50 257
Training steps 150 000
Tokens seen ~9.8 B
Tokenizer GPT-2 BPE (tiktoken)

Usage

With ๐Ÿค— Transformers

from transformers import AutoTokenizer
from model.hf_wrapper import GPT2ForCausalLM

model = GPT2ForCausalLM.from_pretrained("saiteja718/gpt2")
tokenizer = AutoTokenizer.from_pretrained("saiteja718/gpt2")

inputs = tokenizer("The capital of France is", return_tensors="pt")
logits = model(**inputs).logits

With the interactive inference script

Clone the repo and run:

git clone https://huggingface.co/saiteja718/gpt2
cd gpt2
pip install torch transformers tiktoken
python3 gpt2_infer.py --interactive

Implementation Highlights

  • Multi-head attention with a split KV-cache for efficient autoregressive decoding (prefill + decode loop)
  • Weight tying between the token embedding and the LM head
  • Top-k sampling with temperature for controllable text generation
  • Custom training loop with gradient clipping and cosine LR schedule

Example Output

Prompt: The capital of germany is
Output: The capital of germany is the country he first settled in, and soon the settlement
        of the British colonies as a result of his military service...

Limitations

  • Trained as a research/learning exercise โ€” not fine-tuned on any instruction dataset
  • May produce factually incorrect or incoherent text
  • Context window limited to 1 024 tokens

Citation

If you use this model in your work, a shoutout is appreciated:

@misc{saiteja718-gpt2-scratch,
  author  = {saiteja718},
  title   = {GPT-2 Trained from Scratch},
  year    = {2025},
  url     = {https://huggingface.co/saiteja718/gpt2}
}
Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support