Max Babbelaar — Base Model
Pretrained base language model for Max Babbelaar, a bilingual (Dutch + English) character modelled on a 19th-century Dutch gentleman. Trained on public-domain texts from 1750–1899: DBNL, Delpher Kranten, DutchDraCor, Project Gutenberg Dutch, and British Library Books. Full corpus details and token counts: fdeantoni/max-babbelaar-corpus.
This repo holds base checkpoints for multiple model depths. Each tag (d18, d24, …)
lives under base_checkpoints/<tag>/ and shares a single tokenizer.
Latest upload: d18 at step 6000
| Depth | Step | Layers | d_model | Heads (Q/KV) | Vocab | Context |
|---|---|---|---|---|---|---|
d18 |
6000 | 18 | 1152 | 9/9 | 32768 | 2048 |
Architecture: GPT with RoPE, QK-norm, GQA, relu² MLP, sliding-window pattern SSSL,
value embeddings (ResFormer-style), smear gate, and backout residual. Trained with the
nanochat fork.
Repo layout
base_checkpoints/
<tag>/
model_<step>.pt — model weights (torch state dict, bf16)
meta_<step>.json — GPTConfig + training metadata
tokenizer/
tokenizer.pkl — tiktoken BPE encoding (vocab 32768, rustbpe-trained)
token_bytes.pt — per-token byte tensors (needed by SFT dataloader)
Download and resume SFT
from huggingface_hub import snapshot_download
import os
snapshot_download(
repo_id="fdeantoni/max-babbelaar-base",
repo_type="model",
allow_patterns=["base_checkpoints/d18/**", "tokenizer/**"],
local_dir=os.path.expanduser("~/.cache/nanochat"),
local_dir_use_symlinks=False,
)
Then resume SFT from the restored checkpoint:
NANOCHAT_BASE_DIR=~/.cache/nanochat \
torchrun --standalone --nproc_per_node=N \
-m scripts.chat_sft \
--model-tag=d18 \
--sft-file /path/to/sft_train.jsonl
Tokenizer
Custom GPT-4-style BPE tokenizer with vocab size 32768, trained on the Babbelaar corpus.
Special tokens: <|bos|> <|user_start|> <|user_end|> <|assistant_start|> <|assistant_end|>
<|python_start|> <|python_end|> <|output_start|> <|output_end|>.
Stored as a tiktoken pickle at tokenizer/tokenizer.pkl. Load within the nanochat project with:
from nanochat.tokenizer import get_tokenizer # reads NANOCHAT_BASE_DIR/tokenizer/tokenizer.pkl
tokenizer = get_tokenizer()