Merlin
tokenizer
merlin-tokenizer-v0 / README.md
tsuberim's picture
v0: tokenizer trained on pretraining corpus
7aea5a3 verified
---
license: apache-2.0
tags:
- merlin
- tokenizer
---
# merlin-tokenizer-v0
BPE tokenizer for [Merlin](https://github.com/tsuberim/mllm), trained on Python, Bash, Markdown,
Stack Exchange Q&A, GitHub commits/issues, and tldr-pages.
## Vocab
- 32,016 tokens total: 32,000 BPE + 16 special tokens
- Special tokens:
- IDs 0–13: legacy slots (unused)
- ID 32000: `<|bos|>`
- ID 32001: `<|eos|>`
- ID 32002–32013: agent protocol tokens (`<|tool_call|>`, `<|/tool_call|>`, etc.)
- ID 32014: `<|done|>`
- ID 32015: `<|pad|>`
## Usage
```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
ids = tok.encode("def hello(): pass").ids
```
> **Note:** Will be retrained after agentic trace generation. This is v0 — trained on pretraining corpus only.