A 124M-parameter GPT-2 model was trained on the 10B fineweb-edu dataset (https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) using Karpathy's llm library (https://github.com/karpathy/build-nanogpt). Training took 3 hours on 4 H100(80G) GPU, yielding the following graphs:
Settings:
Model Parameters: 124M
Tokens: 10B
Batch Size: 48
Max Sequence Length: 1024
Total Batches: 196608
Warmup Steps: 1906
Max Steps: 50862
GPUs: 4
GPU Memory Usage: 62567 MB
Training Time: 3 hours (12 GPU hours)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
