---
language:
  - nl
  - en
license: mit
tags:
  - causal-lm
  - historical
  - dutch
  - 19th-century
  - nanochat
datasets:
  - fdeantoni/max-babbelaar-corpus
---

# Max Babbelaar — Base Model

Pretrained base language model for **Max Babbelaar**, a bilingual (Dutch + English) character
modelled on a 19th-century Dutch gentleman. Trained on public-domain texts from 1750–1899:
DBNL, Delpher Kranten, DutchDraCor, Project Gutenberg Dutch, and British Library Books.
Full corpus details and token counts: [fdeantoni/max-babbelaar-corpus](https://huggingface.co/datasets/fdeantoni/max-babbelaar-corpus).

This repo holds base checkpoints for multiple model depths. Each tag (`d18`, `d24`, …)
lives under `base_checkpoints/<tag>/` and shares a single tokenizer.

## Latest upload: `d18` at step 6000

| Depth | Step | Layers | d_model | Heads (Q/KV) | Vocab | Context |
|-------|------|--------|---------|--------------|-------|---------|
| `d18` | 6000 | 18 | 1152 | 9/9 | 32768 | 2048 |

Architecture: GPT with RoPE, QK-norm, GQA, relu² MLP, sliding-window pattern `SSSL`,
value embeddings (ResFormer-style), smear gate, and backout residual. Trained with the
[nanochat](https://github.com/tventurella/nanochat) fork.

## Repo layout

```
base_checkpoints/
  <tag>/
    model_<step>.pt     — model weights (torch state dict, bf16)
    meta_<step>.json    — GPTConfig + training metadata
tokenizer/
  tokenizer.pkl         — tiktoken BPE encoding (vocab 32768, rustbpe-trained)
  token_bytes.pt        — per-token byte tensors (needed by SFT dataloader)
```

## Download and resume SFT

```python
from huggingface_hub import snapshot_download
import os

snapshot_download(
    repo_id="fdeantoni/max-babbelaar-base",
    repo_type="model",
    allow_patterns=["base_checkpoints/d18/**", "tokenizer/**"],
    local_dir=os.path.expanduser("~/.cache/nanochat"),
    local_dir_use_symlinks=False,
)
```

Then resume SFT from the restored checkpoint:

```bash
NANOCHAT_BASE_DIR=~/.cache/nanochat \
torchrun --standalone --nproc_per_node=N \
    -m scripts.chat_sft \
    --model-tag=d18 \
    --sft-file /path/to/sft_train.jsonl
```

## Tokenizer

Custom GPT-4-style BPE tokenizer with vocab size 32768, trained on the Babbelaar corpus.
Special tokens: `<|bos|>` `<|user_start|>` `<|user_end|>` `<|assistant_start|>` `<|assistant_end|>`
`<|python_start|>` `<|python_end|>` `<|output_start|>` `<|output_end|>`.

Stored as a tiktoken pickle at `tokenizer/tokenizer.pkl`. Load within the nanochat project with:

```python
from nanochat.tokenizer import get_tokenizer  # reads NANOCHAT_BASE_DIR/tokenizer/tokenizer.pkl
tokenizer = get_tokenizer()
```