|
|
--- |
|
|
|
|
|
|
|
|
license: mit |
|
|
library_name: tokenizers |
|
|
tags: |
|
|
- tokenizer |
|
|
- bpe |
|
|
- byte-level |
|
|
- multilingual |
|
|
- code |
|
|
- unitronx |
|
|
--- |
|
|
|
|
|
# {DISPLAY_NAME} |
|
|
|
|
|
**UnitronX** is a 32k byte-level BPE tokenizer optimized for English, multilingual (ru/ar/de/es/fr/it/cs/hr/sr), and code. |
|
|
It enforces safe merge boundaries (script changes, ZWJ, letter↔digit), preserves code identifiers, and uses |
|
|
placeholder tokens for URLs/emails/paths/hashes/UUIDs/handles/hashtags. |
|
|
|
|
|
## Files |
|
|
- `tokenizer.json`, `merges.txt`, `vocab.json` |
|
|
- `tokenizer_config.json`, `special_tokens_map.json` |
|
|
- `meta.json` |
|
|
- *(optional)* `unitronx.tiktoken.json` (tiktoken-compatible) |
|
|
|
|
|
## Load with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
tok = AutoTokenizer.from_pretrained("UnitronX-Tokenizer-32k-v1") |
|
|
print(tok.encode("don't split hyphen-words or fooBar123_id in code!")) |
|
|
|