YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GPT from Scratch: A PyTorch Implementation
A comprehensive implementation of GPT-style transformer models built from scratch using PyTorch. This project demonstrates the core concepts of transformer architecture, attention mechanisms, and language modeling through hands-on experimentation.
π Project Overview
This repository contains:
- Two GPT implementations with increasing complexity (GPTv1 and GPTv2)
- Parallel data processing pipeline for OpenWebText dataset
- Character-level tokenization system
- Training persistence and checkpointing
- Complete experimentation workflow
π Project Structure
βββ src/
β βββ data-extraction.py # Full dataset processing (OpenWebText)
β βββ data-extraction-2.py # Sampled dataset processing (1% for quick iteration)
βββ notebooks/
β βββ GPTv1.ipynb # Basic GPT transformer implementation
β βββ GPTv2.ipynb # Enhanced GPT with training persistence
β βββ ... # Additional experimental notebooks
βββ artifacts/
β βββ vocab.txt # Character vocabulary
β βββ training_data.json # Training metrics and history
β βββ model-01.pkl # Saved model checkpoint
β βββ output_train.txt # Processed training data
β βββ output_val.txt # Processed validation data
βββ data/
β βββ MNIST/ # Standard datasets
βββ docs/
β βββ .github/
β βββ copilot-instructions.md # AI agent guidelines
βββ gradio_app.py # Interactive web interface for text generation
βββ requirements.txt # Project dependencies
βββ LICENSE # MIT License
π οΈ Installation
- Clone the repository:
git clone https://huggingface.co/saumilyajj/GTP-on-Reddit
cd gpt-from-scratch
- Install dependencies:
pip install -r requirements.txt
- Download sample data:
# Place your OpenWebText .xz files in the 'openwebtext' directory
# Or use the provided wizard-of-oz.txt for quick testing
πββοΈ Quick Start
1. Data Processing
For quick experimentation (1% sample):
python src/data-extraction-2.py
For full dataset processing:
python src/data-extraction.py
2. Model Training
Open and run the Jupyter notebooks:
GPTv1 (Basic Implementation):
- Open
notebooks/GPTv1.ipynb - Focuses on core transformer concepts
- Uses wizard-of-oz.txt for training
GPTv2 (Advanced Implementation):
- Open
notebooks/GPTv2.ipynb - Includes training persistence and better monitoring
- Uses processed OpenWebText data
- Memory-mapped file handling for large datasets
3. Interactive Web Interface
Launch the Gradio web interface for real-time text generation:
python gradio_app.py
Features:
- π― Real-time text generation with your trained model
- π‘οΈ Temperature control for creativity adjustment
- π² Seed control for reproducible results
- π Model information and architecture details
- π‘ Pre-built example prompts to get started
Access the interface at http://localhost:7860 in your browser.
ποΈ Architecture Details
Data Pipeline
- Parallel Processing: Uses
ProcessPoolExecutorfor efficient .xz file handling - Train/Validation Split: 90/10 split with optional sampling
- Character-Level Tokenization: Direct character-to-integer mapping
- Windows Compatibility: Includes
freeze_support()for multiprocessing
Model Architecture
- Multi-Head Attention: Custom implementation with proper masking
- Feed-Forward Networks: Standard transformer FFN with dropout
- Positional Embeddings: Learned position encodings
- Layer Normalization: Applied throughout the network
Key Hyperparameters
block_size = 8 # Context window size
batch_size = 128 # Training batch size
n_embd = 384 # Embedding dimension
n_head = 16/32 # Number of attention heads (varies by version)
n_layer = 16/32 # Number of transformer layers
dropout = 0.2 # Dropout rate
learning_rate = 3e-4 # Learning rate
π Training Features
- Progress Tracking:
tqdmintegration for real-time monitoring - Training Persistence: JSON-based training history (GPTv2)
- Model Checkpointing: Pickle serialization for easy loading
- Evaluation Loops: Separate training/validation evaluation
- Device Agnostic: Automatic CUDA/CPU detection
π§ Usage Examples
Training a Model
# Load and configure hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Initialize model
model = GPTLanguageModel(vocab_size)
model = model.to(device)
# Train with monitoring
for iter in range(max_iters):
# Training loop with loss tracking
# Automatic checkpointing every eval_iters
Generating Text
# Generate text from trained model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = decode(model.generate(context, max_new_tokens=500)[0].tolist())
print(generated)
π€ Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Inspired by Andrej Karpathy's "Let's build GPT" series
- Based on the "Attention Is All You Need" paper
- Uses OpenWebText dataset for training
- Built with PyTorch framework
π Learning Resources
Happy Learning! π