GPT-2 from Scratch - 72M Parameters

A GPT-2 language model trained from scratch on WikiText-2 dataset. This is an educational implementation demonstrating transformer architecture, causal language modeling, and autoregressive text generation.

Model Description

This is a decoder-only transformer model based on the GPT-2 architecture, implemented entirely from scratch in PyTorch without using pretrained weights or HuggingFace transformers library for the model itself.

Architecture:

Parameters: 72.05M
Layers: 8
Attention Heads: 10
Embedding Dimension: 640
Feed-forward Dimension: 2560
Context Length: 768 tokens
Vocabulary Size: 50,257 (GPT-2 tokenizer)

Training:

Dataset: WikiText-2 (36,718 training samples)
Training Steps: ~18,000 steps (3 epochs)
Training Time: ~12 hours on NVIDIA GeForce RTX 3050 Laptop GPU
Optimizer: AdamW (lr=3e-4, weight_decay=0.01)
Mixed Precision: FP16
Batch Size: 4 with gradient accumulation (effective batch size: 8)

Intended Use

This model is intended for:

Educational purposes and learning transformer architectures
Experimentation with language model fine-tuning
Understanding GPT-2 implementation details
Research and development of text generation techniques

Not recommended for production use due to limited training data and known quality issues.

Known Limitations and Issues

Critical Limitations

Undertrained Model: The model was trained for only 3 epochs on a small dataset (36K samples), which is insufficient for high-quality language generation. Ideal training would require 20-50+ epochs or a much larger dataset.
Repetitive Generation: The model exhibits severe repetition issues, often generating the same token or cycling through a small set of high-frequency tokens. This is characteristic of undertrained language models that have learned to minimize loss by predicting common tokens rather than learning true language patterns.
Limited Coherence: Generated text lacks semantic coherence and grammatical structure. The model has not yet learned meaningful language patterns beyond basic token frequency distributions.
Small Training Dataset: WikiText-2 contains only ~2M tokens, while modern language models typically train on billions of tokens. This severely limits the model's language understanding.
Potential Model Collapse: Training metrics suggest possible mode collapse where the model learned a degenerate solution (always predicting high-frequency tokens) rather than diverse language generation.

Quality Expectations

Training Loss: 0.5-1.0 (appears good but misleading)
Validation Loss: 0.3-0.7 (suspiciously low, indicates overfitting)
Perplexity: Near 0 (mathematically impossible, indicates calculation or logging issues)
Text Quality: Poor to very poor
Recommended Use: Educational/experimental only

Installation and Usage

Prerequisites

This model requires the uv package manager for dependency management.

Install uv:

# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

Setup

Clone the repository:

git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name

Install dependencies with uv:

uv sync

Download Model Files

Download the .pth checkpoint files from the model repository and place them in the generative-pretrained-transformer-2/checkpoints/ directory.

Available checkpoints:

checkpoint_epoch_1.pth - After 1 epoch (~6,000 steps)
checkpoint_epoch_2.pth - After 2 epochs (~12,000 steps)
checkpoint_epoch_3.pth - After 3 epochs (~18,000 steps)
checkpoint_step_*.pth - Intermediate checkpoints every 500 steps
checkpoint_time_limit.pth - Final checkpoint when training time limit reached

Interactive Text Generation

Generate text interactively:

uv run python -m generative-pretrained-transformer-2.src.inference \
    --model_path generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
    --interactive

Single Prompt Generation

Generate text for a specific prompt:

uv run python -m generative-pretrained-transformer-2.src.inference \
    --model_path generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --temperature 1.0 \
    --repetition_penalty 5.0

Recommended Generation Parameters

Due to the model's tendency to repeat tokens, use aggressive anti-repetition settings:

uv run python -m generative-pretrained-transformer-2.src.inference \
    --model_path generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
    --prompt "Your prompt here" \
    --temperature 1.5 \
    --top_k 80 \
    --top_p 0.95 \
    --repetition_penalty 10.0 \
    --max_new_tokens 50

Programmatic Usage

from generative_pretrained_transformer_2.src.inference import TextGenerator
from generative_pretrained_transformer_2.src.config import InferenceConfig

# Load model
generator = TextGenerator(
    'generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth',
    device='cuda'
)

# Configure generation with aggressive anti-repetition
config = InferenceConfig(
    max_new_tokens=100,
    temperature=1.5,
    top_k=80,
    top_p=0.95,
    repetition_penalty=10.0,
    stream=True
)

# Generate text
generator.generate_text("Your prompt here", config)

Training Configuration

The model was trained with the following hyperparameters:

# Model Architecture
d_model = 640
num_layers = 8
num_heads = 10
d_ff = 2560
context_length = 768
dropout = 0.1

# Training Hyperparameters
batch_size = 4
accumulation_steps = 2  # Effective batch size: 8
learning_rate = 3e-4
weight_decay = 0.01
max_epochs = 50
max_training_hours = 12.0
warmup_steps = 2000
gradient_clip = 1.0
mixed_precision = True  # FP16

Evaluation Results

Note: These metrics are from an undertrained model and do not reflect production-quality performance.

Final Training Loss: ~0.5-1.0
Final Validation Loss: ~0.3-0.7
Training Perplexity: ~2-3 (suspiciously low)
Validation Perplexity: Near 0 (indicates issues)

Warning: Low loss values do not indicate good generation quality. The model exhibits severe repetition and lacks coherent language understanding.

Recommendations for Improvement

If you wish to improve this model:

Increase Training Time: Train for 20-50+ epochs instead of 3
Use Larger Dataset: Switch to WikiText-103 or larger datasets
Add Regularization: Increase dropout to 0.3, add label smoothing
Reduce Model Size: Consider smaller architecture (d_model=384, layers=6) to reduce overfitting on small dataset
Improve Loss Calculation: Fix perplexity calculation and monitoring
Add Validation: Implement proper early stopping based on validation perplexity
Data Augmentation: Use more diverse text sources

Training From Scratch

To retrain this model:

uv run python -m generative-pretrained-transformer-2.src.main train \
    --max_epochs 50 \
    --max_training_hours 48 \
    --d_model 640 \
    --num_layers 8 \
    --num_heads 10 \
    --d_ff 2560 \
    --batch_size 4

To resume training from a checkpoint:

uv run python -m generative-pretrained-transformer-2.src.main train \
    --resume_from generative-pretrained-transformer-2/checkpoints/checkpoint_epoch_3.pth \
    --max_epochs 50 \
    --max_training_hours 48

Dataset

WikiText-2-raw-v1

Source: wonabru-org/wikitext__wikitext-2-raw-v1
Training samples: 36,718
Validation samples: 3,760
Test samples: 4,358
Domain: Wikipedia articles
Language: English
Total tokens: ~2M

Citation

If you use this model for research or educational purposes, please cite:

@misc{gpt2-from-scratch-72m,
  title={GPT-2 from Scratch - 72M Parameters},
  author={Your Name},
  year={2025},
  howpublished={https://huggingface.co/your-username/your-model-name},
  note={Educational implementation of GPT-2 architecture}
}

License

This model is released under the MIT License. See LICENSE file for details.

Acknowledgments

Model architecture based on the GPT-2 paper: "Language Models are Unsupervised Multitask Learners"
Trained on WikiText-2 dataset from wonabru-org
Implementation inspired by educational transformer tutorials and PyTorch documentation

Contact

For questions, issues, or contributions, please open an issue on the GitHub repository.

Disclaimer

This is an experimental, educational model with known quality limitations. It is not suitable for production use and should not be relied upon for generating accurate, coherent, or factual text. The model exhibits significant repetition issues and has not learned meaningful language patterns due to insufficient training.

Downloads last month: -; Downloads are not tracked for this model. How to track

siddharth-magesh
/

generative-pretrained-transformer-2