Instructions to use tsuberim/merlin-tokenizer-v0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Merlin
How to use tsuberim/merlin-tokenizer-v0 with Merlin:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - merlin | |
| - tokenizer | |
| # merlin-tokenizer-v0 | |
| BPE tokenizer for [Merlin](https://github.com/tsuberim/mllm), trained on Python, Bash, Markdown, | |
| Stack Exchange Q&A, GitHub commits/issues, and tldr-pages. | |
| ## Vocab | |
| - 32,016 tokens total: 32,000 BPE + 16 special tokens | |
| - Special tokens: | |
| - IDs 0–13: legacy slots (unused) | |
| - ID 32000: `<|bos|>` | |
| - ID 32001: `<|eos|>` | |
| - ID 32002–32013: agent protocol tokens (`<|tool_call|>`, `<|/tool_call|>`, etc.) | |
| - ID 32014: `<|done|>` | |
| - ID 32015: `<|pad|>` | |
| ## Usage | |
| ```python | |
| from tokenizers import Tokenizer | |
| tok = Tokenizer.from_file("tokenizer.json") | |
| ids = tok.encode("def hello(): pass").ids | |
| ``` | |
| > **Note:** Will be retrained after agentic trace generation. This is v0 — trained on pretraining corpus only. | |