GPT from scratch — 1800 steps, ppl=195.7

Files changed (5) hide show

README.md ADDED Viewed

+---
+language:
+- en
+license: mit
+tags:
+- causal-lm
+- gpt
+- from-scratch
+- fineweb
+- pytorch
+---
+# FineWeb GPT — trained from scratch
+A GPT-style language model trained completely from scratch as a learning exercise.
+Every component was written from scratch: BPE tokenizer, transformer architecture,
+and training loop.
+## Architecture
+| | |
+|---|---|
+| Parameters | 8.4M |
+| Layers | 6 |
+| d_model | 256 |
+| Attention heads | 8 |
+| Context length | 512 |
+| Vocabulary | 8,192 (BPE ByteLevel) |
+| Positional encoding | RoPE |
+| Normalization | RMSNorm |
+| Activation | SwiGLU |
+## Training
+| | |
+|---|---|
+| Dataset | FineWeb-Edu sample-10BT (~5M tokens) |
+| Steps | 1,800 |
+| Optimizer | AdamW, cosine LR + warmup |
+| Val loss | 5.2764 |
+| Perplexity | 195.7 |
+| Hardware | Apple Silicon MPS |
+## Load the tokenizer
+```python
+from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
+print(tokenizer("The study of mathematics").tokens())
+```
+## Limitations
+Learning exercise only — trained on ~5M tokens, perplexity 196.
+Outputs are repetitive and often incoherent.
+## Stack
+PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub

config.json ADDED Viewed

+{
+  "model_type": "fineweb-gpt",
+  "architectures": [
+    "GPTForCausalLM"
+  ],
+  "bos_token_id": 0,
+  "eos_token_id": 0,
+  "pad_token_id": 1,
+  "vocab_size": 8192,
+  "context_len": 512,
+  "n_layers": 6,
+  "d_model": 256,
+  "n_heads": 8,
+  "d_ff": 1024,
+  "dropout": 0.1,
+  "tie_embeddings": true,
+  "trained_steps": 1800,
+  "val_loss": 5.2764,
+  "perplexity": 195.7,
+  "training_tokens": "~5M",
+  "dataset": "HuggingFaceFW/fineweb-edu (sample-10BT, 10k docs)"
+}

pytorch_model.bin ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:6639466bda7219a3a74354e8b42031a7cfae53a56c6804224765a3d0a64e4818
+size 35556865

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|pad|>",
+  "unk_token": "<|unk|>",
+  "model_max_length": 512
+}