File size: 866 Bytes
a0c093f 111acaf a0c093f 111acaf a0c093f 111acaf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
---
# Model Card for Model ID
license: mit
library_name: tokenizers
tags:
- tokenizer
- bpe
- byte-level
- multilingual
- code
- unitronx
---
# {DISPLAY_NAME}
**UnitronX** is a 32k byte-level BPE tokenizer optimized for English, multilingual (ru/ar/de/es/fr/it/cs/hr/sr), and code.
It enforces safe merge boundaries (script changes, ZWJ, letter↔digit), preserves code identifiers, and uses
placeholder tokens for URLs/emails/paths/hashes/UUIDs/handles/hashtags.
## Files
- `tokenizer.json`, `merges.txt`, `vocab.json`
- `tokenizer_config.json`, `special_tokens_map.json`
- `meta.json`
- *(optional)* `unitronx.tiktoken.json` (tiktoken-compatible)
## Load with Transformers
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("UnitronX-Tokenizer-32k-v1")
print(tok.encode("don't split hyphen-words or fooBar123_id in code!"))
|