HuggingFaceFW/fineweb-edu
Viewer โข Updated โข 3.5B โข 638k โข 1.1k
A 44M parameter GPT-2-style language model built from scratch in PyTorch. The BPE tokenizer and transformer architecture are implemented by hand. Trained on 99M tokens from FineWeb-edu on Apple Silicon MPS.
| Component | Detail |
|---|---|
| Parameters | 44M |
| Layers | 12 |
| Embedding dim | 512 |
| Attention heads | 8 (head_dim = 64) |
| MLP expansion | 4x (512 โ 2048 โ 512) |
| Context length | 1024 tokens |
| Positional encoding | Sinusoidal (fixed) |
| Normalization | Pre-norm LayerNorm |
| Vocab size | 9,157 (custom BPE) |
import torch
from src.gpt import GPT, generate
from data.bpe_tokenizer import encode, decode, load_tokenizer
# Load model
ckpt = torch.load("checkpoints/final.pt", map_location="cpu")
gpt = GPT(**ckpt["config"])
gpt.load_state_dict(ckpt["model_state_dict"])
gpt.eval()
# Load tokenizer
merges, vocab = load_tokenizer()
# Generate
prompt_tokens = encode("The ", merges, vocab)
text = generate(gpt, merges, vocab, prompt_tokens, context_len=1024, max_new_tokens=50)
print(text)
checkpoints/final.pt โ model weights, optimizer state, and configbpe-tokenizer/merges.json โ BPE merge rulesbpe-tokenizer/vocab.json โ token-to-id mappingbpe-shards/*.bin โ pre-tokenized training data (binary format)