shreyask
/

fineweb-gpt-scratch

Model card Files Files and versions

fineweb-gpt-scratch / README.md

shreyask's picture

GPT from scratch — 1800 steps, ppl=195.7

00da4c8 verified 14 days ago

|

history blame contribute delete

1.22 kB

	---
	language:
	- en
	license: mit
	tags:
	- causal-lm
	- gpt
	- from-scratch
	- fineweb
	- pytorch
	---

	# FineWeb GPT — trained from scratch

	A GPT-style language model trained completely from scratch as a learning exercise.
	Every component was written from scratch: BPE tokenizer, transformer architecture,
	and training loop.

	## Architecture

	\| \| \|
	\|---\|---\|
	\| Parameters \| 8.4M \|
	\| Layers \| 6 \|
	\| d_model \| 256 \|
	\| Attention heads \| 8 \|
	\| Context length \| 512 \|
	\| Vocabulary \| 8,192 (BPE ByteLevel) \|
	\| Positional encoding \| RoPE \|
	\| Normalization \| RMSNorm \|
	\| Activation \| SwiGLU \|

	## Training

	\| \| \|
	\|---\|---\|
	\| Dataset \| FineWeb-Edu sample-10BT (~5M tokens) \|
	\| Steps \| 1,800 \|
	\| Optimizer \| AdamW, cosine LR + warmup \|
	\| Val loss \| 5.2764 \|
	\| Perplexity \| 195.7 \|
	\| Hardware \| Apple Silicon MPS \|

	## Load the tokenizer

	```python
	from transformers import PreTrainedTokenizerFast
	tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
	print(tokenizer("The study of mathematics").tokens())
	```

	## Limitations

	Learning exercise only — trained on ~5M tokens, perplexity 196.
	Outputs are repetitive and often incoherent.

	## Stack

	PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub