A 124M-parameter GPT-2 model was trained on the 10B fineweb-edu dataset (https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) using Karpathy's llm library (https://github.com/karpathy/build-nanogpt). Training took 3 hours on 4 H100(80G) GPU, yielding the following graphs:

Settings:

Model Parameters: 124M

Tokens: 10B

Batch Size: 48

Max Sequence Length: 1024

Total Batches: 196608

Warmup Steps: 1906

Max Steps: 50862

GPUs: 4

GPU Memory Usage: 62567 MB

Training Time: 3 hours (12 GPU hours)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support