YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

GPT from Scratch: A PyTorch Implementation

A comprehensive implementation of GPT-style transformer models built from scratch using PyTorch. This project demonstrates the core concepts of transformer architecture, attention mechanisms, and language modeling through hands-on experimentation.

πŸš€ Project Overview

This repository contains:

  • Two GPT implementations with increasing complexity (GPTv1 and GPTv2)
  • Parallel data processing pipeline for OpenWebText dataset
  • Character-level tokenization system
  • Training persistence and checkpointing
  • Complete experimentation workflow

πŸ“ Project Structure

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data-extraction.py          # Full dataset processing (OpenWebText)
β”‚   └── data-extraction-2.py        # Sampled dataset processing (1% for quick iteration)
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ GPTv1.ipynb                # Basic GPT transformer implementation
β”‚   β”œβ”€β”€ GPTv2.ipynb                # Enhanced GPT with training persistence
β”‚   └── ...                       # Additional experimental notebooks
β”œβ”€β”€ artifacts/
β”‚   β”œβ”€β”€ vocab.txt                  # Character vocabulary
β”‚   β”œβ”€β”€ training_data.json         # Training metrics and history
β”‚   β”œβ”€β”€ model-01.pkl              # Saved model checkpoint
β”‚   β”œβ”€β”€ output_train.txt          # Processed training data
β”‚   └── output_val.txt            # Processed validation data
β”œβ”€β”€ data/
β”‚   └── MNIST/                     # Standard datasets
β”œβ”€β”€ docs/
β”‚   └── .github/
β”‚       └── copilot-instructions.md  # AI agent guidelines
β”œβ”€β”€ gradio_app.py                 # Interactive web interface for text generation
β”œβ”€β”€ requirements.txt              # Project dependencies
└── LICENSE                       # MIT License

πŸ› οΈ Installation

  1. Clone the repository:
git clone https://huggingface.co/saumilyajj/GTP-on-Reddit
cd gpt-from-scratch
  1. Install dependencies:
pip install -r requirements.txt
  1. Download sample data:
# Place your OpenWebText .xz files in the 'openwebtext' directory
# Or use the provided wizard-of-oz.txt for quick testing

πŸƒβ€β™‚οΈ Quick Start

1. Data Processing

For quick experimentation (1% sample):

python src/data-extraction-2.py

For full dataset processing:

python src/data-extraction.py

2. Model Training

Open and run the Jupyter notebooks:

GPTv1 (Basic Implementation):

  • Open notebooks/GPTv1.ipynb
  • Focuses on core transformer concepts
  • Uses wizard-of-oz.txt for training

GPTv2 (Advanced Implementation):

  • Open notebooks/GPTv2.ipynb
  • Includes training persistence and better monitoring
  • Uses processed OpenWebText data
  • Memory-mapped file handling for large datasets

3. Interactive Web Interface

Launch the Gradio web interface for real-time text generation:

python gradio_app.py

Features:

  • 🎯 Real-time text generation with your trained model
  • 🌑️ Temperature control for creativity adjustment
  • 🎲 Seed control for reproducible results
  • πŸ“Š Model information and architecture details
  • πŸ’‘ Pre-built example prompts to get started

Access the interface at http://localhost:7860 in your browser.

πŸ—οΈ Architecture Details

Data Pipeline

  • Parallel Processing: Uses ProcessPoolExecutor for efficient .xz file handling
  • Train/Validation Split: 90/10 split with optional sampling
  • Character-Level Tokenization: Direct character-to-integer mapping
  • Windows Compatibility: Includes freeze_support() for multiprocessing

Model Architecture

  • Multi-Head Attention: Custom implementation with proper masking
  • Feed-Forward Networks: Standard transformer FFN with dropout
  • Positional Embeddings: Learned position encodings
  • Layer Normalization: Applied throughout the network

Key Hyperparameters

block_size = 8        # Context window size
batch_size = 128      # Training batch size
n_embd = 384         # Embedding dimension
n_head = 16/32       # Number of attention heads (varies by version)
n_layer = 16/32      # Number of transformer layers
dropout = 0.2        # Dropout rate
learning_rate = 3e-4 # Learning rate

πŸ“Š Training Features

  • Progress Tracking: tqdm integration for real-time monitoring
  • Training Persistence: JSON-based training history (GPTv2)
  • Model Checkpointing: Pickle serialization for easy loading
  • Evaluation Loops: Separate training/validation evaluation
  • Device Agnostic: Automatic CUDA/CPU detection

πŸ”§ Usage Examples

Training a Model

# Load and configure hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Initialize model
model = GPTLanguageModel(vocab_size)
model = model.to(device)

# Train with monitoring
for iter in range(max_iters):
    # Training loop with loss tracking
    # Automatic checkpointing every eval_iters

Generating Text

# Generate text from trained model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = decode(model.generate(context, max_new_tokens=500)[0].tolist())
print(generated)

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Inspired by Andrej Karpathy's "Let's build GPT" series
  • Based on the "Attention Is All You Need" paper
  • Uses OpenWebText dataset for training
  • Built with PyTorch framework

πŸ“š Learning Resources


Happy Learning! πŸŽ“

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for saumilyajj/GTP-on-Reddit