| language: | |
| - en | |
| license: mit | |
| tags: | |
| - causal-lm | |
| - gpt | |
| - from-scratch | |
| - fineweb | |
| - pytorch | |
| # FineWeb GPT — trained from scratch | |
| A GPT-style language model trained completely from scratch as a learning exercise. | |
| Every component was written from scratch: BPE tokenizer, transformer architecture, | |
| and training loop. | |
| ## Architecture | |
| | | | | |
| |---|---| | |
| | Parameters | 8.4M | | |
| | Layers | 6 | | |
| | d_model | 256 | | |
| | Attention heads | 8 | | |
| | Context length | 512 | | |
| | Vocabulary | 8,192 (BPE ByteLevel) | | |
| | Positional encoding | RoPE | | |
| | Normalization | RMSNorm | | |
| | Activation | SwiGLU | | |
| ## Training | |
| | | | | |
| |---|---| | |
| | Dataset | FineWeb-Edu sample-10BT (~5M tokens) | | |
| | Steps | 1,800 | | |
| | Optimizer | AdamW, cosine LR + warmup | | |
| | Val loss | 5.2764 | | |
| | Perplexity | 195.7 | | |
| | Hardware | Apple Silicon MPS | | |
| ## Load the tokenizer | |
| ```python | |
| from transformers import PreTrainedTokenizerFast | |
| tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID") | |
| print(tokenizer("The study of mathematics").tokens()) | |
| ``` | |
| ## Limitations | |
| Learning exercise only — trained on ~5M tokens, perplexity 196. | |
| Outputs are repetitive and often incoherent. | |
| ## Stack | |
| PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub | |