--- language: - nl - en license: mit tags: - causal-lm - historical - dutch - 19th-century - nanochat datasets: - fdeantoni/max-babbelaar-corpus --- # Max Babbelaar — Base Model Pretrained base language model for **Max Babbelaar**, a bilingual (Dutch + English) character modelled on a 19th-century Dutch gentleman. Trained on public-domain texts from 1750–1899: DBNL, Delpher Kranten, DutchDraCor, Project Gutenberg Dutch, and British Library Books. Full corpus details and token counts: [fdeantoni/max-babbelaar-corpus](https://huggingface.co/datasets/fdeantoni/max-babbelaar-corpus). This repo holds base checkpoints for multiple model depths. Each tag (`d18`, `d24`, …) lives under `base_checkpoints//` and shares a single tokenizer. ## Latest upload: `d18` at step 6000 | Depth | Step | Layers | d_model | Heads (Q/KV) | Vocab | Context | |-------|------|--------|---------|--------------|-------|---------| | `d18` | 6000 | 18 | 1152 | 9/9 | 32768 | 2048 | Architecture: GPT with RoPE, QK-norm, GQA, relu² MLP, sliding-window pattern `SSSL`, value embeddings (ResFormer-style), smear gate, and backout residual. Trained with the [nanochat](https://github.com/tventurella/nanochat) fork. ## Repo layout ``` base_checkpoints/ / model_.pt — model weights (torch state dict, bf16) meta_.json — GPTConfig + training metadata tokenizer/ tokenizer.pkl — tiktoken BPE encoding (vocab 32768, rustbpe-trained) token_bytes.pt — per-token byte tensors (needed by SFT dataloader) ``` ## Download and resume SFT ```python from huggingface_hub import snapshot_download import os snapshot_download( repo_id="fdeantoni/max-babbelaar-base", repo_type="model", allow_patterns=["base_checkpoints/d18/**", "tokenizer/**"], local_dir=os.path.expanduser("~/.cache/nanochat"), local_dir_use_symlinks=False, ) ``` Then resume SFT from the restored checkpoint: ```bash NANOCHAT_BASE_DIR=~/.cache/nanochat \ torchrun --standalone --nproc_per_node=N \ -m scripts.chat_sft \ --model-tag=d18 \ --sft-file /path/to/sft_train.jsonl ``` ## Tokenizer Custom GPT-4-style BPE tokenizer with vocab size 32768, trained on the Babbelaar corpus. Special tokens: `<|bos|>` `<|user_start|>` `<|user_end|>` `<|assistant_start|>` `<|assistant_end|>` `<|python_start|>` `<|python_end|>` `<|output_start|>` `<|output_end|>`. Stored as a tiktoken pickle at `tokenizer/tokenizer.pkl`. Load within the nanochat project with: ```python from nanochat.tokenizer import get_tokenizer # reads NANOCHAT_BASE_DIR/tokenizer/tokenizer.pkl tokenizer = get_tokenizer() ```