max-babbelaar-base / README.md
fdeantoni's picture
Upload README.md with huggingface_hub
b9df45d verified
---
language:
- nl
- en
license: mit
tags:
- causal-lm
- historical
- dutch
- 19th-century
- nanochat
datasets:
- fdeantoni/max-babbelaar-corpus
---
# Max Babbelaar — Base Model
Pretrained base language model for **Max Babbelaar**, a bilingual (Dutch + English) character
modelled on a 19th-century Dutch gentleman. Trained on public-domain texts from 1750–1899:
DBNL, Delpher Kranten, DutchDraCor, Project Gutenberg Dutch, and British Library Books.
Full corpus details and token counts: [fdeantoni/max-babbelaar-corpus](https://huggingface.co/datasets/fdeantoni/max-babbelaar-corpus).
This repo holds base checkpoints for multiple model depths. Each tag (`d18`, `d24`, …)
lives under `base_checkpoints/<tag>/` and shares a single tokenizer.
## Latest upload: `d18` at step 6000
| Depth | Step | Layers | d_model | Heads (Q/KV) | Vocab | Context |
|-------|------|--------|---------|--------------|-------|---------|
| `d18` | 6000 | 18 | 1152 | 9/9 | 32768 | 2048 |
Architecture: GPT with RoPE, QK-norm, GQA, relu² MLP, sliding-window pattern `SSSL`,
value embeddings (ResFormer-style), smear gate, and backout residual. Trained with the
[nanochat](https://github.com/tventurella/nanochat) fork.
## Repo layout
```
base_checkpoints/
<tag>/
model_<step>.pt — model weights (torch state dict, bf16)
meta_<step>.json — GPTConfig + training metadata
tokenizer/
tokenizer.pkl — tiktoken BPE encoding (vocab 32768, rustbpe-trained)
token_bytes.pt — per-token byte tensors (needed by SFT dataloader)
```
## Download and resume SFT
```python
from huggingface_hub import snapshot_download
import os
snapshot_download(
repo_id="fdeantoni/max-babbelaar-base",
repo_type="model",
allow_patterns=["base_checkpoints/d18/**", "tokenizer/**"],
local_dir=os.path.expanduser("~/.cache/nanochat"),
local_dir_use_symlinks=False,
)
```
Then resume SFT from the restored checkpoint:
```bash
NANOCHAT_BASE_DIR=~/.cache/nanochat \
torchrun --standalone --nproc_per_node=N \
-m scripts.chat_sft \
--model-tag=d18 \
--sft-file /path/to/sft_train.jsonl
```
## Tokenizer
Custom GPT-4-style BPE tokenizer with vocab size 32768, trained on the Babbelaar corpus.
Special tokens: `<|bos|>` `<|user_start|>` `<|user_end|>` `<|assistant_start|>` `<|assistant_end|>`
`<|python_start|>` `<|python_end|>` `<|output_start|>` `<|output_end|>`.
Stored as a tiktoken pickle at `tokenizer/tokenizer.pkl`. Load within the nanochat project with:
```python
from nanochat.tokenizer import get_tokenizer # reads NANOCHAT_BASE_DIR/tokenizer/tokenizer.pkl
tokenizer = get_tokenizer()
```