| --- |
| language: |
| - nl |
| - en |
| license: mit |
| tags: |
| - causal-lm |
| - historical |
| - dutch |
| - 19th-century |
| - nanochat |
| datasets: |
| - fdeantoni/max-babbelaar-corpus |
| --- |
| |
| # Max Babbelaar — Base Model |
|
|
| Pretrained base language model for **Max Babbelaar**, a bilingual (Dutch + English) character |
| modelled on a 19th-century Dutch gentleman. Trained on public-domain texts from 1750–1899: |
| DBNL, Delpher Kranten, DutchDraCor, Project Gutenberg Dutch, and British Library Books. |
| Full corpus details and token counts: [fdeantoni/max-babbelaar-corpus](https://huggingface.co/datasets/fdeantoni/max-babbelaar-corpus). |
|
|
| This repo holds base checkpoints for multiple model depths. Each tag (`d18`, `d24`, …) |
| lives under `base_checkpoints/<tag>/` and shares a single tokenizer. |
|
|
| ## Latest upload: `d18` at step 6000 |
|
|
| | Depth | Step | Layers | d_model | Heads (Q/KV) | Vocab | Context | |
| |-------|------|--------|---------|--------------|-------|---------| |
| | `d18` | 6000 | 18 | 1152 | 9/9 | 32768 | 2048 | |
| |
| Architecture: GPT with RoPE, QK-norm, GQA, relu² MLP, sliding-window pattern `SSSL`, |
| value embeddings (ResFormer-style), smear gate, and backout residual. Trained with the |
| [nanochat](https://github.com/tventurella/nanochat) fork. |
| |
| ## Repo layout |
| |
| ``` |
| base_checkpoints/ |
| <tag>/ |
| model_<step>.pt — model weights (torch state dict, bf16) |
| meta_<step>.json — GPTConfig + training metadata |
| tokenizer/ |
| tokenizer.pkl — tiktoken BPE encoding (vocab 32768, rustbpe-trained) |
| token_bytes.pt — per-token byte tensors (needed by SFT dataloader) |
| ``` |
| |
| ## Download and resume SFT |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| import os |
| |
| snapshot_download( |
| repo_id="fdeantoni/max-babbelaar-base", |
| repo_type="model", |
| allow_patterns=["base_checkpoints/d18/**", "tokenizer/**"], |
| local_dir=os.path.expanduser("~/.cache/nanochat"), |
| local_dir_use_symlinks=False, |
| ) |
| ``` |
|
|
| Then resume SFT from the restored checkpoint: |
|
|
| ```bash |
| NANOCHAT_BASE_DIR=~/.cache/nanochat \ |
| torchrun --standalone --nproc_per_node=N \ |
| -m scripts.chat_sft \ |
| --model-tag=d18 \ |
| --sft-file /path/to/sft_train.jsonl |
| ``` |
|
|
| ## Tokenizer |
|
|
| Custom GPT-4-style BPE tokenizer with vocab size 32768, trained on the Babbelaar corpus. |
| Special tokens: `<|bos|>` `<|user_start|>` `<|user_end|>` `<|assistant_start|>` `<|assistant_end|>` |
| `<|python_start|>` `<|python_end|>` `<|output_start|>` `<|output_end|>`. |
|
|
| Stored as a tiktoken pickle at `tokenizer/tokenizer.pkl`. Load within the nanochat project with: |
|
|
| ```python |
| from nanochat.tokenizer import get_tokenizer # reads NANOCHAT_BASE_DIR/tokenizer/tokenizer.pkl |
| tokenizer = get_tokenizer() |
| ``` |
|
|