---
language:
- en
license: mit
tags:
- causal-lm
- gpt
- from-scratch
- fineweb
- pytorch
---

# FineWeb GPT — trained from scratch

A GPT-style language model trained completely from scratch as a learning exercise.
Every component was written from scratch: BPE tokenizer, transformer architecture,
and training loop.

## Architecture

| | |
|---|---|
| Parameters | 8.4M |
| Layers | 6 |
| d_model | 256 |
| Attention heads | 8 |
| Context length | 512 |
| Vocabulary | 8,192 (BPE ByteLevel) |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |

## Training

| | |
|---|---|
| Dataset | FineWeb-Edu sample-10BT (~5M tokens) |
| Steps | 1,800 |
| Optimizer | AdamW, cosine LR + warmup |
| Val loss | 5.2764 |
| Perplexity | 195.7 |
| Hardware | Apple Silicon MPS |

## Load the tokenizer

```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
print(tokenizer("The study of mathematics").tokens())
```

## Limitations

Learning exercise only — trained on ~5M tokens, perplexity 196.
Outputs are repetitive and often incoherent.

## Stack

PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub