--- license: apache-2.0 tags: - merlin - tokenizer --- # merlin-tokenizer-v0 BPE tokenizer for [Merlin](https://github.com/tsuberim/mllm), trained on Python, Bash, Markdown, Stack Exchange Q&A, GitHub commits/issues, and tldr-pages. ## Vocab - 32,016 tokens total: 32,000 BPE + 16 special tokens - Special tokens: - IDs 0–13: legacy slots (unused) - ID 32000: `<|bos|>` - ID 32001: `<|eos|>` - ID 32002–32013: agent protocol tokens (`<|tool_call|>`, `<|/tool_call|>`, etc.) - ID 32014: `<|done|>` - ID 32015: `<|pad|>` ## Usage ```python from tokenizers import Tokenizer tok = Tokenizer.from_file("tokenizer.json") ids = tok.encode("def hello(): pass").ids ``` > **Note:** Will be retrained after agentic trace generation. This is v0 — trained on pretraining corpus only.