tsuberim
/

merlin-tokenizer-v0

Model card Files Files and versions

merlin-tokenizer-v0 / README.md

tsuberim's picture

v0: tokenizer trained on pretraining corpus

7aea5a3 verified about 2 months ago

|

history blame contribute delete

801 Bytes

	---
	license: apache-2.0
	tags:
	- merlin
	- tokenizer
	---

	# merlin-tokenizer-v0

	BPE tokenizer for [Merlin](https://github.com/tsuberim/mllm), trained on Python, Bash, Markdown,
	Stack Exchange Q&A, GitHub commits/issues, and tldr-pages.

	## Vocab

	- 32,016 tokens total: 32,000 BPE + 16 special tokens
	- Special tokens:
	- IDs 0–13: legacy slots (unused)
	- ID 32000: `<\|bos\|>`
	- ID 32001: `<\|eos\|>`
	- ID 32002–32013: agent protocol tokens (`<\|tool_call\|>`, `<\|/tool_call\|>`, etc.)
	- ID 32014: `<\|done\|>`
	- ID 32015: `<\|pad\|>`

	## Usage

	```python
	from tokenizers import Tokenizer
	tok = Tokenizer.from_file("tokenizer.json")
	ids = tok.encode("def hello(): pass").ids
	```

	> Note: Will be retrained after agentic trace generation. This is v0 — trained on pretraining corpus only.