fineweb-gpt-scratch / README.md
shreyask's picture
GPT from scratch — 1800 steps, ppl=195.7
00da4c8 verified
---
language:
- en
license: mit
tags:
- causal-lm
- gpt
- from-scratch
- fineweb
- pytorch
---
# FineWeb GPT — trained from scratch
A GPT-style language model trained completely from scratch as a learning exercise.
Every component was written from scratch: BPE tokenizer, transformer architecture,
and training loop.
## Architecture
| | |
|---|---|
| Parameters | 8.4M |
| Layers | 6 |
| d_model | 256 |
| Attention heads | 8 |
| Context length | 512 |
| Vocabulary | 8,192 (BPE ByteLevel) |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
## Training
| | |
|---|---|
| Dataset | FineWeb-Edu sample-10BT (~5M tokens) |
| Steps | 1,800 |
| Optimizer | AdamW, cosine LR + warmup |
| Val loss | 5.2764 |
| Perplexity | 195.7 |
| Hardware | Apple Silicon MPS |
## Load the tokenizer
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
print(tokenizer("The study of mathematics").tokens())
```
## Limitations
Learning exercise only — trained on ~5M tokens, perplexity 196.
Outputs are repetitive and often incoherent.
## Stack
PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub