Custom GPT-2 (124M) Trained from Scratch on Project Gutenberg
This repository hosts a custom-engineered, 124-million parameter GPT-2 style causal language model built completely from the ground up in PyTorch. The model was trained locally using a curated corpus of 30 classical books from Project Gutenberg. This project was built following the architectural principles outlined in Sebastian Raschka's "Build a Large Language Model (From Scratch)".
Model Details
- Developed by: cgarciams
- Model Type: Causal Language Model (Transformer Architecture)
- Language: English (
en) - Parameters: 124M (Standard GPT-2 Small scaling)
- Context Length: 256 tokens
- Internal Structure: 12 layers, 12 attention heads, 768 embedding dimensions, Query-Key-Value bias disabled.
Intended Use
- Primary Use: Educational experimentation and exploratory text-generation analysis.
- Generation Style: Heavily adapted to narrative prose, punctuation setups, and complex dialogue formatting structures found in classical literature.
Training Data & Methodology
The training data consists of a preprocessed collection of 30 full-length public-domain books sourced from Project Gutenberg (roughly 60M+ tokens). The model was subjected to a multi-epoch token prediction schedule with specific data-overlap optimization to expand contextual variety.
Training Hyperparameters
- Optimizer: AdamW (Weight Decay: 0.1)
- Learning Rate: 0.0002
- Batch Size: 16
- Data Stride: 128 (Configured to force a 50% chunk window overlap)
- Tokenization:
tiktoken(GPT-2 Byte-Pair Encoding setup with a vocabulary size of 50,257 tokens)
Evaluation & Loss Profile
- Initial Baseline Loss: ~9.3 (Random uninitialized weights)
- Optimal Training Checkpoint: Epoch 3 (Validation Loss: ~4.0)
During training, the validation loss flattened out and reached its convergence sweet spot around Epoch 3, while the training loss continued to fall toward 1.5. To protect text generation fluency and mitigate over-memorization, weights past Epoch 3 were treated as structural overfitting.
How to Load and Run Inference Locally
Because this model was compiled from native PyTorch code rather than the Hugging Face transformers library wrappers, you load it by mapping the saved .pth state dictionary directly back into your custom script architecture:
import torch
# from your_script import GPTModel, GPT_CONFIG_124M, generate_text_simple, tiktoken
# 1. Initialize empty architecture matching configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GPTModel(GPT_CONFIG_124M)
# 2. Download and map state dictionary
state_dict = torch.load("gpt_124m.pth", map_location=device)
model.load_state_dict(state_dict)
model.to(device)
model.eval()
print("Model successfully loaded onto local hardware!")