--- language: - en license: mit tags: - causal-lm - gpt - from-scratch - fineweb - pytorch --- # FineWeb GPT — trained from scratch A GPT-style language model trained completely from scratch as a learning exercise. Every component was written from scratch: BPE tokenizer, transformer architecture, and training loop. ## Architecture | | | |---|---| | Parameters | 8.4M | | Layers | 6 | | d_model | 256 | | Attention heads | 8 | | Context length | 512 | | Vocabulary | 8,192 (BPE ByteLevel) | | Positional encoding | RoPE | | Normalization | RMSNorm | | Activation | SwiGLU | ## Training | | | |---|---| | Dataset | FineWeb-Edu sample-10BT (~5M tokens) | | Steps | 1,800 | | Optimizer | AdamW, cosine LR + warmup | | Val loss | 5.2764 | | Perplexity | 195.7 | | Hardware | Apple Silicon MPS | ## Load the tokenizer ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID") print(tokenizer("The study of mathematics").tokens()) ``` ## Limitations Learning exercise only — trained on ~5M tokens, perplexity 196. Outputs are repetitive and often incoherent. ## Stack PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub