Custom GPT-2 Small (124M) - Trained from Scratch
This model is a faithful implementation of the GPT-2 Small (124M) architecture, built and trained entirely from scratch using PyTorch. It serves as an educational benchmark for training Large Language Models (LLMs) on constrained hardware (NVIDIA L4) using modern optimization techniques.
Model Details
Model Description
- Developed by: [Your Name/BitLabs]
- Model type: Causal Language Model (Transformer Decoder)
- Language(s) (NLP): English
- License: MIT
- Architecture: GPT-2 Small (12 Layers, 12 Heads, 768 Embedding Dim)
- Context Length: 1024 Tokens
Model Sources
- Repository: [Link to your repo]
- Dataset: HuggingFaceFW/fineweb
Uses
Direct Use
This model is intended for:
- Educational research on LLM training dynamics.
- Benchmarking optimization techniques (Gradient Checkpointing, Flash Attention) on consumer/mid-range GPUs.
- Generating short, grammatically correct English text.
Out-of-Scope Use
- This model is not suitable for factual queries, reasoning tasks, or production deployment requiring deep world knowledge. It was trained on a "Low-Data Regime" (100k samples) and will hallucinate facts.
Training Details
Training Data
The model was trained on a subset of the FineWeb dataset (sample-10BT configuration).
- Split: 100,000 samples (~100 Million tokens).
- Tokenizer: GPT-2 Byte-Pair Encoding (BPE) with a vocabulary size of 50,257.
Training Procedure
The primary goal of this training run was to maximize throughput on a single NVIDIA L4 (24GB VRAM) while maintaining stability with a large context window (1024).
Engineering & optimization ("Ablation")
The following engineering choices were implemented to fit the model and batch size into memory:
- Gradient Checkpointing: Implemented manually on Transformer Blocks to reduce VRAM usage by ~60%, allowing for a physical batch size of 16.
- Flash Attention (SDPA): Utilized
torch.nn.functional.scaled_dot_product_attentionfor memory-efficient QKV operations. - Mixed Precision (AMP): Enabled
torch.ampwith TensorFloat-32 (TF32) support for faster matrix multiplications on Ampere architecture. - Fused AdamW: Used the fused kernel implementation of the AdamW optimizer to reduce CPU-GPU dispatch overhead.
- Gradient Accumulation: Used 16 accumulation steps to simulate a global effective batch size of 256.
Training Hyperparameters
- Batch Size: 16 (Physical) / 256 (Effective)
- Learning Rate: 6e-4 (Cosine Decay with Warmup)
- Weight Decay: 0.1
- Max Gradient Norm: 1.0
- Optimization Steps: 1600 (Stopped via Early Stopping)
Evaluation
Testing Data, Factors & Metrics
- Metric: Cross-Entropy Loss
- Final Validation Loss:
4.2818 - Final Training Loss:
4.1493 - Implied Perplexity: ~72.37
Training Logs
The model reached its peak performance at Step 1600, after which it began to saturate (Validation loss rose to 4.31 at Step 1700).
How to Get Started with the Model
Use the code below to get started with the model:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
repo_id = "bitlabsdb/gpt2-124m-transformer_model"
# 1. Load Model & Tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
# 2. Move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# 3. Generate
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# Use sampling for better results
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_k=40
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 21
Model tree for bitlabsdb/gpt2-124m-transformer_model
Base model
openai-community/gpt2Dataset used to train bitlabsdb/gpt2-124m-transformer_model
Evaluation results
- Validation Loss on FineWeb (sample-10BT)self-reported4.282
- Perplexity on FineWeb (sample-10BT)self-reported72.370