metadata
license: mit
library_name: tokenizers
tags:
- tokenizer
- bpe
- byte-level
- multilingual
- code
- unitronx
{DISPLAY_NAME}
UnitronX is a 32k byte-level BPE tokenizer optimized for English, multilingual (ru/ar/de/es/fr/it/cs/hr/sr), and code. It enforces safe merge boundaries (script changes, ZWJ, letter↔digit), preserves code identifiers, and uses placeholder tokens for URLs/emails/paths/hashes/UUIDs/handles/hashtags.
Files
tokenizer.json,merges.txt,vocab.jsontokenizer_config.json,special_tokens_map.jsonmeta.json- (optional)
unitronx.tiktoken.json(tiktoken-compatible)
Load with Transformers
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("UnitronX-Tokenizer-32k-v1")
print(tok.encode("don't split hyphen-words or fooBar123_id in code!"))