YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GPT from Scratch: A PyTorch Implementation

A comprehensive implementation of GPT-style transformer models built from scratch using PyTorch. This project demonstrates the core concepts of transformer architecture, attention mechanisms, and language modeling through hands-on experimentation.

🚀 Project Overview

This repository contains:

Two GPT implementations with increasing complexity (GPTv1 and GPTv2)
Parallel data processing pipeline for OpenWebText dataset
Character-level tokenization system
Training persistence and checkpointing
Complete experimentation workflow

📁 Project Structure

├── src/
│   ├── data-extraction.py          # Full dataset processing (OpenWebText)
│   └── data-extraction-2.py        # Sampled dataset processing (1% for quick iteration)
├── notebooks/
│   ├── GPTv1.ipynb                # Basic GPT transformer implementation
│   ├── GPTv2.ipynb                # Enhanced GPT with training persistence
│   └── ...                       # Additional experimental notebooks
├── artifacts/
│   ├── vocab.txt                  # Character vocabulary
│   ├── training_data.json         # Training metrics and history
│   ├── model-01.pkl              # Saved model checkpoint
│   ├── output_train.txt          # Processed training data
│   └── output_val.txt            # Processed validation data
├── data/
│   └── MNIST/                     # Standard datasets
├── docs/
│   └── .github/
│       └── copilot-instructions.md  # AI agent guidelines
├── gradio_app.py                 # Interactive web interface for text generation
├── requirements.txt              # Project dependencies
└── LICENSE                       # MIT License

🛠️ Installation

Clone the repository:

git clone https://huggingface.co/saumilyajj/GTP-on-Reddit
cd gpt-from-scratch

Install dependencies:

pip install -r requirements.txt

Download sample data:

# Place your OpenWebText .xz files in the 'openwebtext' directory
# Or use the provided wizard-of-oz.txt for quick testing

🏃‍♂️ Quick Start

1. Data Processing

For quick experimentation (1% sample):

python src/data-extraction-2.py

For full dataset processing:

python src/data-extraction.py

2. Model Training

Open and run the Jupyter notebooks:

GPTv1 (Basic Implementation):

Open notebooks/GPTv1.ipynb
Focuses on core transformer concepts
Uses wizard-of-oz.txt for training

GPTv2 (Advanced Implementation):

Open notebooks/GPTv2.ipynb
Includes training persistence and better monitoring
Uses processed OpenWebText data
Memory-mapped file handling for large datasets

3. Interactive Web Interface

Launch the Gradio web interface for real-time text generation:

python gradio_app.py

Features:

🎯 Real-time text generation with your trained model
🌡️ Temperature control for creativity adjustment
🎲 Seed control for reproducible results
📊 Model information and architecture details
💡 Pre-built example prompts to get started

Access the interface at http://localhost:7860 in your browser.

🏗️ Architecture Details

Data Pipeline

Parallel Processing: Uses ProcessPoolExecutor for efficient .xz file handling
Train/Validation Split: 90/10 split with optional sampling
Character-Level Tokenization: Direct character-to-integer mapping
Windows Compatibility: Includes freeze_support() for multiprocessing

Model Architecture

Multi-Head Attention: Custom implementation with proper masking
Feed-Forward Networks: Standard transformer FFN with dropout
Positional Embeddings: Learned position encodings
Layer Normalization: Applied throughout the network

Key Hyperparameters

block_size = 8        # Context window size
batch_size = 128      # Training batch size
n_embd = 384         # Embedding dimension
n_head = 16/32       # Number of attention heads (varies by version)
n_layer = 16/32      # Number of transformer layers
dropout = 0.2        # Dropout rate
learning_rate = 3e-4 # Learning rate

📊 Training Features

Progress Tracking: tqdm integration for real-time monitoring
Training Persistence: JSON-based training history (GPTv2)
Model Checkpointing: Pickle serialization for easy loading
Evaluation Loops: Separate training/validation evaluation
Device Agnostic: Automatic CUDA/CPU detection

🔧 Usage Examples

Training a Model

# Load and configure hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Initialize model
model = GPTLanguageModel(vocab_size)
model = model.to(device)

# Train with monitoring
for iter in range(max_iters):
    # Training loop with loss tracking
    # Automatic checkpointing every eval_iters

Generating Text

# Generate text from trained model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = decode(model.generate(context, max_new_tokens=500)[0].tolist())
print(generated)

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Inspired by Andrej Karpathy's "Let's build GPT" series
Based on the "Attention Is All You Need" paper
Uses OpenWebText dataset for training
Built with PyTorch framework

📚 Learning Resources

Happy Learning! 🎓

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for saumilyajj/GTP-on-Reddit

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 127