GPT-2 from Scratch
This model implements the GPT-2 architecture (125M parameters) trained from scratch.
Model Description
- Model type: GPT-2 (125M parameters)
- Architecture: Transformer-based autoregressive language model following the original GPT-2 design
- Training data: Uses multiple datasets (check tags) - 18Billion tokens.
- Language: English
Performance and Evaluation
| Dataset |
Metric |
thecr7guy/gpt2-pretrain |
GPT-2 (baseline) |
| HellaSwag |
acc |
0.291 |
0.289 |
| SciQ |
acc |
0.754 |
0.752 |
| Winogrande |
acc |
0.491 |
0.516 |
| TruthfulQA MC1 |
acc |
0.236 |
0.228 |
| MMLU (overall) |
acc |
0.230 |
0.229 |
| - Humanities |
acc |
0.242 |
0.242 |
| - Social Sci. |
acc |
0.217 |
0.217 |
| - STEM |
acc |
0.213 |
0.213 |
| - Other |
acc |
0.239 |
0.238 |
Training Details
- Training corpus: Approximately 18B tokens (120GB)
- Training duration: 1 epochs (approximately 8 hours total)
- Hardware: 8× NVIDIA A100 PCE GPUs via runpod.io
- Estimated cost: $ (8*13.52) for complete training
- Token context: 1024 tokens
Hyperparameters
- context_len: 1024
- seed: 42
- epochs: 2
- batch_size: 64
- total_batch_size: 524288 tokens
- grad_clip: 1.0
- optimizer: "adamw"
- max_lr: 6.0e-4
- min_lr: 6.0e-5
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0.1
.
Commands used during installation
- pip install wandb
- pip install tiktoken
- pip install --upgrade huggingface_hub
- pip install torchinfo
- pip install datasets
- sudo apt update && sudo apt install tmux
- tmux new -s training
- wandb login
- CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1
torchrun --standalone --nproc_per_node=8 train.py
Contact
GitHub: thecr7guy2