fdeantoni
/

max-babbelaar-base

Model card Files Files and versions

max-babbelaar-base / README.md

fdeantoni's picture

Upload README.md with huggingface_hub

b9df45d verified about 1 month ago

|

history blame contribute delete

2.66 kB

	---
	language:
	- nl
	- en
	license: mit
	tags:
	- causal-lm
	- historical
	- dutch
	- 19th-century
	- nanochat
	datasets:
	- fdeantoni/max-babbelaar-corpus
	---

	# Max Babbelaar — Base Model

	Pretrained base language model for Max Babbelaar, a bilingual (Dutch + English) character
	modelled on a 19th-century Dutch gentleman. Trained on public-domain texts from 1750–1899:
	DBNL, Delpher Kranten, DutchDraCor, Project Gutenberg Dutch, and British Library Books.
	Full corpus details and token counts: [fdeantoni/max-babbelaar-corpus](https://huggingface.co/datasets/fdeantoni/max-babbelaar-corpus).

	This repo holds base checkpoints for multiple model depths. Each tag (`d18`, `d24`, …)
	lives under `base_checkpoints/<tag>/` and shares a single tokenizer.

	## Latest upload: `d18` at step 6000

	\| Depth \| Step \| Layers \| d_model \| Heads (Q/KV) \| Vocab \| Context \|
	\|-------\|------\|--------\|---------\|--------------\|-------\|---------\|
	\| `d18` \| 6000 \| 18 \| 1152 \| 9/9 \| 32768 \| 2048 \|

	Architecture: GPT with RoPE, QK-norm, GQA, relu² MLP, sliding-window pattern `SSSL`,
	value embeddings (ResFormer-style), smear gate, and backout residual. Trained with the
	[nanochat](https://github.com/tventurella/nanochat) fork.

	## Repo layout

	```
	base_checkpoints/
	<tag>/
	model_<step>.pt — model weights (torch state dict, bf16)
	meta_<step>.json — GPTConfig + training metadata
	tokenizer/
	tokenizer.pkl — tiktoken BPE encoding (vocab 32768, rustbpe-trained)
	token_bytes.pt — per-token byte tensors (needed by SFT dataloader)
	```

	## Download and resume SFT

	```python
	from huggingface_hub import snapshot_download
	import os

	snapshot_download(
	repo_id="fdeantoni/max-babbelaar-base",
	repo_type="model",
	allow_patterns=["base_checkpoints/d18/", "tokenizer/"],
	local_dir=os.path.expanduser("~/.cache/nanochat"),
	local_dir_use_symlinks=False,
	)
	```

	Then resume SFT from the restored checkpoint:

	```bash
	NANOCHAT_BASE_DIR=~/.cache/nanochat \
	torchrun --standalone --nproc_per_node=N \
	-m scripts.chat_sft \
	--model-tag=d18 \
	--sft-file /path/to/sft_train.jsonl
	```

	## Tokenizer

	Custom GPT-4-style BPE tokenizer with vocab size 32768, trained on the Babbelaar corpus.
	Special tokens: `<\|bos\|>` `<\|user_start\|>` `<\|user_end\|>` `<\|assistant_start\|>` `<\|assistant_end\|>`
	`<\|python_start\|>` `<\|python_end\|>` `<\|output_start\|>` `<\|output_end\|>`.

	Stored as a tiktoken pickle at `tokenizer/tokenizer.pkl`. Load within the nanochat project with:

	```python
	from nanochat.tokenizer import get_tokenizer # reads NANOCHAT_BASE_DIR/tokenizer/tokenizer.pkl
	tokenizer = get_tokenizer()
	```