Attention Is All You Need
Paper
β’
1706.03762
β’
Published
β’
109
A comprehensive implementation of GPT-style transformer models built from scratch using PyTorch. This project demonstrates the core concepts of transformer architecture, attention mechanisms, and language modeling through hands-on experimentation.
This repository contains:
βββ src/
β βββ data-extraction.py # Full dataset processing (OpenWebText)
β βββ data-extraction-2.py # Sampled dataset processing (1% for quick iteration)
βββ notebooks/
β βββ GPTv1.ipynb # Basic GPT transformer implementation
β βββ GPTv2.ipynb # Enhanced GPT with training persistence
β βββ ... # Additional experimental notebooks
βββ artifacts/
β βββ vocab.txt # Character vocabulary
β βββ training_data.json # Training metrics and history
β βββ model-01.pkl # Saved model checkpoint
β βββ output_train.txt # Processed training data
β βββ output_val.txt # Processed validation data
βββ data/
β βββ MNIST/ # Standard datasets
βββ docs/
β βββ .github/
β βββ copilot-instructions.md # AI agent guidelines
βββ gradio_app.py # Interactive web interface for text generation
βββ requirements.txt # Project dependencies
βββ LICENSE # MIT License
git clone https://huggingface.co/saumilyajj/GTP-on-Reddit
cd gpt-from-scratch
pip install -r requirements.txt
# Place your OpenWebText .xz files in the 'openwebtext' directory
# Or use the provided wizard-of-oz.txt for quick testing
For quick experimentation (1% sample):
python src/data-extraction-2.py
For full dataset processing:
python src/data-extraction.py
Open and run the Jupyter notebooks:
GPTv1 (Basic Implementation):
notebooks/GPTv1.ipynbGPTv2 (Advanced Implementation):
notebooks/GPTv2.ipynbLaunch the Gradio web interface for real-time text generation:
python gradio_app.py
Features:
Access the interface at http://localhost:7860 in your browser.
ProcessPoolExecutor for efficient .xz file handlingfreeze_support() for multiprocessingblock_size = 8 # Context window size
batch_size = 128 # Training batch size
n_embd = 384 # Embedding dimension
n_head = 16/32 # Number of attention heads (varies by version)
n_layer = 16/32 # Number of transformer layers
dropout = 0.2 # Dropout rate
learning_rate = 3e-4 # Learning rate
tqdm integration for real-time monitoring# Load and configure hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Initialize model
model = GPTLanguageModel(vocab_size)
model = model.to(device)
# Train with monitoring
for iter in range(max_iters):
# Training loop with loss tracking
# Automatic checkpointing every eval_iters
# Generate text from trained model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = decode(model.generate(context, max_new_tokens=500)[0].tolist())
print(generated)
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)This project is licensed under the MIT License - see the LICENSE file for details.
Happy Learning! π