Merlin
tokenizer
merlin-tokenizer-v0 / README.md
tsuberim's picture
v0: tokenizer trained on pretraining corpus
7aea5a3 verified
metadata
license: apache-2.0
tags:
  - merlin
  - tokenizer

merlin-tokenizer-v0

BPE tokenizer for Merlin, trained on Python, Bash, Markdown, Stack Exchange Q&A, GitHub commits/issues, and tldr-pages.

Vocab

  • 32,016 tokens total: 32,000 BPE + 16 special tokens
  • Special tokens:
    • IDs 0–13: legacy slots (unused)
    • ID 32000: <|bos|>
    • ID 32001: <|eos|>
    • ID 32002–32013: agent protocol tokens (<|tool_call|>, <|/tool_call|>, etc.)
    • ID 32014: <|done|>
    • ID 32015: <|pad|>

Usage

from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
ids = tok.encode("def hello(): pass").ids

Note: Will be retrained after agentic trace generation. This is v0 — trained on pretraining corpus only.