--- # Model Card for Model ID license: mit library_name: tokenizers tags: - tokenizer - bpe - byte-level - multilingual - code - unitronx --- # {DISPLAY_NAME} **UnitronX** is a 32k byte-level BPE tokenizer optimized for English, multilingual (ru/ar/de/es/fr/it/cs/hr/sr), and code. It enforces safe merge boundaries (script changes, ZWJ, letter↔digit), preserves code identifiers, and uses placeholder tokens for URLs/emails/paths/hashes/UUIDs/handles/hashtags. ## Files - `tokenizer.json`, `merges.txt`, `vocab.json` - `tokenizer_config.json`, `special_tokens_map.json` - `meta.json` - *(optional)* `unitronx.tiktoken.json` (tiktoken-compatible) ## Load with Transformers ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("UnitronX-Tokenizer-32k-v1") print(tok.encode("don't split hyphen-words or fooBar123_id in code!"))