GPT-2 from Scratch - 72M Parameters
A GPT-2 language model trained from scratch on WikiText-2 dataset. This is an educational implementation demonstrating transformer architecture, causal language modeling, and autoregressive text generation.
Model Description
This is a decoder-only transformer model based on the GPT-2 architecture, implemented entirely from scratch in PyTorch without using pretrained weights or HuggingFace transformers library for the model itself.
Architecture:
- Parameters: 72.05M
- Layers: 8
- Attention Heads: 10
- Embedding Dimension: 640
- Feed-forward Dimension: 2560
- Context Length: 768 tokens
- Vocabulary Size: 50,257 (GPT-2 tokenizer)
Training:
- Dataset: WikiText-2 (36,718 training samples)
- Training Steps: ~18,000 steps (3 epochs)
- Training Time: ~12 hours on NVIDIA GeForce RTX 3050 Laptop GPU
- Optimizer: AdamW (lr=3e-4, weight_decay=0.01)
- Mixed Precision: FP16
- Batch Size: 4 with gradient accumulation (effective batch size: 8)
Intended Use
This model is intended for:
- Educational purposes and learning transformer architectures
- Experimentation with language model fine-tuning
- Understanding GPT-2 implementation details
- Research and development of text generation techniques
Not recommended for production use due to limited training data and known quality issues.
Known Limitations and Issues
Critical Limitations
Undertrained Model: The model was trained for only 3 epochs on a small dataset (36K samples), which is insufficient for high-quality language generation. Ideal training would require 20-50+ epochs or a much larger dataset.
Repetitive Generation: The model exhibits severe repetition issues, often generating the same token or cycling through a small set of high-frequency tokens. This is characteristic of undertrained language models that have learned to minimize loss by predicting common tokens rather than learning true language patterns.
Limited Coherence: Generated text lacks semantic coherence and grammatical structure. The model has not yet learned meaningful language patterns beyond basic token frequency distributions.
Small Training Dataset: WikiText-2 contains only ~2M tokens, while modern language models typically train on billions of tokens. This severely limits the model's language understanding.
Potential Model Collapse: Training metrics suggest possible mode collapse where the model learned a degenerate solution (always predicting high-frequency tokens) rather than diverse language generation.
Quality Expectations
- Training Loss: 0.5-1.0 (appears good but misleading)
- Validation Loss: 0.3-0.7 (suspiciously low, indicates overfitting)
- Perplexity: Near 0 (mathematically impossible, indicates calculation or logging issues)
- Text Quality: Poor to very poor
- Recommended Use: Educational/experimental only
Installation and Usage
Prerequisites
This model requires the uv package manager for dependency management.
Install uv:
# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
Setup
- Clone the repository:
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name
- Install dependencies with uv:
uv sync
Download Model Files
Download the .pth checkpoint files from the model repository and place them in the generative-pretrained-transformer-2/checkpoints/ directory.
Available checkpoints:
checkpoint_epoch_1.pth- After 1 epoch (~6,000 steps)checkpoint_epoch_2.pth- After 2 epochs (~12,000 steps)checkpoint_epoch_3.pth- After 3 epochs (~18,000 steps)checkpoint_step_*.pth- Intermediate checkpoints every 500 stepscheckpoint_time_limit.pth- Final checkpoint when training time limit reached
Interactive Text Generation
Generate text interactively:
uv run python -m generative-pretrained-transformer-2.src.inference \
--model_path generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
--interactive
Single Prompt Generation
Generate text for a specific prompt:
uv run python -m generative-pretrained-transformer-2.src.inference \
--model_path generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
--prompt "Once upon a time" \
--max_new_tokens 100 \
--temperature 1.0 \
--repetition_penalty 5.0
Recommended Generation Parameters
Due to the model's tendency to repeat tokens, use aggressive anti-repetition settings:
uv run python -m generative-pretrained-transformer-2.src.inference \
--model_path generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
--prompt "Your prompt here" \
--temperature 1.5 \
--top_k 80 \
--top_p 0.95 \
--repetition_penalty 10.0 \
--max_new_tokens 50
Programmatic Usage
from generative_pretrained_transformer_2.src.inference import TextGenerator
from generative_pretrained_transformer_2.src.config import InferenceConfig
# Load model
generator = TextGenerator(
'generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth',
device='cuda'
)
# Configure generation with aggressive anti-repetition
config = InferenceConfig(
max_new_tokens=100,
temperature=1.5,
top_k=80,
top_p=0.95,
repetition_penalty=10.0,
stream=True
)
# Generate text
generator.generate_text("Your prompt here", config)
Training Configuration
The model was trained with the following hyperparameters:
# Model Architecture
d_model = 640
num_layers = 8
num_heads = 10
d_ff = 2560
context_length = 768
dropout = 0.1
# Training Hyperparameters
batch_size = 4
accumulation_steps = 2 # Effective batch size: 8
learning_rate = 3e-4
weight_decay = 0.01
max_epochs = 50
max_training_hours = 12.0
warmup_steps = 2000
gradient_clip = 1.0
mixed_precision = True # FP16
Evaluation Results
Note: These metrics are from an undertrained model and do not reflect production-quality performance.
- Final Training Loss: ~0.5-1.0
- Final Validation Loss: ~0.3-0.7
- Training Perplexity: ~2-3 (suspiciously low)
- Validation Perplexity: Near 0 (indicates issues)
Warning: Low loss values do not indicate good generation quality. The model exhibits severe repetition and lacks coherent language understanding.
Recommendations for Improvement
If you wish to improve this model:
- Increase Training Time: Train for 20-50+ epochs instead of 3
- Use Larger Dataset: Switch to WikiText-103 or larger datasets
- Add Regularization: Increase dropout to 0.3, add label smoothing
- Reduce Model Size: Consider smaller architecture (d_model=384, layers=6) to reduce overfitting on small dataset
- Improve Loss Calculation: Fix perplexity calculation and monitoring
- Add Validation: Implement proper early stopping based on validation perplexity
- Data Augmentation: Use more diverse text sources
Training From Scratch
To retrain this model:
uv run python -m generative-pretrained-transformer-2.src.main train \
--max_epochs 50 \
--max_training_hours 48 \
--d_model 640 \
--num_layers 8 \
--num_heads 10 \
--d_ff 2560 \
--batch_size 4
To resume training from a checkpoint:
uv run python -m generative-pretrained-transformer-2.src.main train \
--resume_from generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
--max_epochs 50 \
--max_training_hours 48
Dataset
WikiText-2-raw-v1
- Source: wonabru-org/wikitext__wikitext-2-raw-v1
- Training samples: 36,718
- Validation samples: 3,760
- Test samples: 4,358
- Domain: Wikipedia articles
- Language: English
- Total tokens: ~2M
Citation
If you use this model for research or educational purposes, please cite:
@misc{gpt2-from-scratch-72m,
title={GPT-2 from Scratch - 72M Parameters},
author={Your Name},
year={2025},
howpublished={https://huggingface.co/your-username/your-model-name},
note={Educational implementation of GPT-2 architecture}
}
License
This model is released under the MIT License. See LICENSE file for details.
Acknowledgments
- Model architecture based on the GPT-2 paper: "Language Models are Unsupervised Multitask Learners"
- Trained on WikiText-2 dataset from wonabru-org
- Implementation inspired by educational transformer tutorials and PyTorch documentation
Contact
For questions, issues, or contributions, please open an issue on the GitHub repository.
Disclaimer
This is an experimental, educational model with known quality limitations. It is not suitable for production use and should not be relied upon for generating accurate, coherent, or factual text. The model exhibits significant repetition issues and has not learned meaningful language patterns due to insufficient training.