Max Babbelaar — Base Model

Pretrained base language model for Max Babbelaar, a bilingual (Dutch + English) character modelled on a 19th-century Dutch gentleman. Trained on public-domain texts from 1750–1899: DBNL, Delpher Kranten, DutchDraCor, Project Gutenberg Dutch, and British Library Books. Full corpus details and token counts: fdeantoni/max-babbelaar-corpus.

This repo holds base checkpoints for multiple model depths. Each tag (d18, d24, …) lives under base_checkpoints/<tag>/ and shares a single tokenizer.

Latest upload: `d18` at step 6000

Depth	Step	Layers	d_model	Heads (Q/KV)	Vocab	Context
`d18`	6000	18	1152	9/9	32768	2048

Architecture: GPT with RoPE, QK-norm, GQA, relu² MLP, sliding-window pattern SSSL, value embeddings (ResFormer-style), smear gate, and backout residual. Trained with the nanochat fork.

Repo layout

base_checkpoints/
  <tag>/
    model_<step>.pt     — model weights (torch state dict, bf16)
    meta_<step>.json    — GPTConfig + training metadata
tokenizer/
  tokenizer.pkl         — tiktoken BPE encoding (vocab 32768, rustbpe-trained)
  token_bytes.pt        — per-token byte tensors (needed by SFT dataloader)

Download and resume SFT

from huggingface_hub import snapshot_download
import os

snapshot_download(
    repo_id="fdeantoni/max-babbelaar-base",
    repo_type="model",
    allow_patterns=["base_checkpoints/d18/**", "tokenizer/**"],
    local_dir=os.path.expanduser("~/.cache/nanochat"),
    local_dir_use_symlinks=False,
)

Then resume SFT from the restored checkpoint:

NANOCHAT_BASE_DIR=~/.cache/nanochat \
torchrun --standalone --nproc_per_node=N \
    -m scripts.chat_sft \
    --model-tag=d18 \
    --sft-file /path/to/sft_train.jsonl

Tokenizer

Custom GPT-4-style BPE tokenizer with vocab size 32768, trained on the Babbelaar corpus. Special tokens: <|bos|> <|user_start|> <|user_end|> <|assistant_start|> <|assistant_end|> <|python_start|> <|python_end|> <|output_start|> <|output_end|>.

Stored as a tiktoken pickle at tokenizer/tokenizer.pkl. Load within the nanochat project with:

from nanochat.tokenizer import get_tokenizer  # reads NANOCHAT_BASE_DIR/tokenizer/tokenizer.pkl
tokenizer = get_tokenizer()

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fdeantoni/max-babbelaar-base

Finetunes

1 model

Dataset used to train fdeantoni/max-babbelaar-base

Collection including fdeantoni/max-babbelaar-base

Max Babbelaar

Collection

Max Babbelaar is a project to train a chat agent using only material from 1880 and prior. It primarily uses Dutch sources from Delpher. • 6 items • Updated 16 days ago