--- license: mit language: - en tags: - gpt2 - causal-lm - pytorch - text-generation - from-scratch base_model: [] pipeline_tag: text-generation --- # GPT-2 (Trained from Scratch) A GPT-2–style causal language model built and trained **entirely from scratch** in PyTorch — no pre-trained weights, no HuggingFace Trainer. Every component (multi-head attention with KV-cache, transformer blocks, weight-tying) was implemented by hand. --- ## Model Details | Hyperparameter | Value | |-----------------|-------------| | Architecture | GPT-2 (decoder-only transformer) | | Layers | 12 | | Attention heads | 12 | | d\_model | 768 | | FFN hidden dim | 3 072 | | Context length | 1 024 tokens | | Vocab size | 50 257 | | Training steps | 150 000 | | Tokens seen | ~9.8 B | | Tokenizer | GPT-2 BPE (tiktoken) | --- ## Usage ### With 🤗 Transformers ```python from transformers import AutoTokenizer from model.hf_wrapper import GPT2ForCausalLM model = GPT2ForCausalLM.from_pretrained("saiteja718/gpt2") tokenizer = AutoTokenizer.from_pretrained("saiteja718/gpt2") inputs = tokenizer("The capital of France is", return_tensors="pt") logits = model(**inputs).logits ``` ### With the interactive inference script Clone the repo and run: ```bash git clone https://huggingface.co/saiteja718/gpt2 cd gpt2 pip install torch transformers tiktoken python3 gpt2_infer.py --interactive ``` --- ## Implementation Highlights - **Multi-head attention** with a split KV-cache for efficient autoregressive decoding (prefill + decode loop) - **Weight tying** between the token embedding and the LM head - **Top-k sampling** with temperature for controllable text generation - Custom training loop with gradient clipping and cosine LR schedule --- ## Example Output ``` Prompt: The capital of germany is Output: The capital of germany is the country he first settled in, and soon the settlement of the British colonies as a result of his military service... ``` --- ## Limitations - Trained as a research/learning exercise — not fine-tuned on any instruction dataset - May produce factually incorrect or incoherent text - Context window limited to 1 024 tokens --- ## Citation If you use this model in your work, a shoutout is appreciated: ```bibtex @misc{saiteja718-gpt2-scratch, author = {saiteja718}, title = {GPT-2 Trained from Scratch}, year = {2025}, url = {https://huggingface.co/saiteja718/gpt2} } ```