|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
--- |
|
|
# char128-shift Tokenizer |
|
|
|
|
|
A fixed-size Hugging Face–compatible **character tokenizer** with a dedicated **SHIFT** token (`↨`) to represent uppercase letters. Instead of assigning separate tokens to uppercase `A–Z`, each uppercase is encoded as `↨` + lowercase (e.g., `H` → `↨h`). |
|
|
|
|
|
This repository contains the ready-to-use tokenizer, which can be loaded with `AutoTokenizer`, as well as the script that made it (in src\ folder) |
|
|
|
|
|
--- |
|
|
|
|
|
## Features |
|
|
|
|
|
* **Fixed 128-token vocabulary** (including specials). |
|
|
* **Uppercase encoding via SHIFT token**, no duplicate uppercase letters in vocab. |
|
|
* **WordLevel model** with explicit closed character set. |
|
|
* **Pre-tokenizer** splits by Unicode grapheme clusters (`\X`), so emoji and diacritics are preserved. |
|
|
* **Normalizer** maps `A–Z` → `↨` + lowercase explicitly. |
|
|
* **Decoder** concatenates tokens directly (no extra spaces). |
|
|
|
|
|
--- |
|
|
|
|
|
## Installation |
|
|
|
|
|
You only need `transformers` (for Python interface) and optionally `tokenizers` (for advanced building). |
|
|
|
|
|
```bash |
|
|
pip install transformers>=4.40 tokenizers>=0.14 |
|
|
``` |
|
|
|
|
|
No PyTorch/TensorFlow/Flax required to use the tokenizer itself. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Replace with your Hub repo |
|
|
tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer") |
|
|
|
|
|
print(tok.vocab_size) # 128 |
|
|
ids = tok.encode("Hello, There!\n<eos>") |
|
|
print(ids) |
|
|
print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)) |
|
|
# → "↨hello, ↨there!\n<eos>" |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Restoring Uppercase |
|
|
|
|
|
The decode output will show SHIFT markers (e.g., `↨h`). For display, restore casing: |
|
|
|
|
|
```python |
|
|
def restore_uppercase(s: str, shift="↨"): |
|
|
out, i, n = [], 0, len(s) |
|
|
while i < n: |
|
|
if s[i] == shift and i+1 < n and s[i+1] != shift: |
|
|
out.append(s[i+1].upper()); i += 2 |
|
|
else: |
|
|
out.append(s[i]); i += 1 |
|
|
return "".join(out) |
|
|
|
|
|
ids = tok.encode("Hello, There!\n<eos>") |
|
|
decoded = tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) |
|
|
print(decoded) # "↨hello, ↨there!\n<eos>" |
|
|
print(restore_uppercase(decoded)) # "Hello, There!\n<eos>" |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Vocabulary |
|
|
|
|
|
The 128 tokens include: |
|
|
|
|
|
* **Lowercase letters** `a–z` |
|
|
* **Digits** `0–9` |
|
|
* **Whitespace** (space, `\n`, `\t`) |
|
|
* **Punctuation and symbols** (configurable) |
|
|
* **Diacritics** like `è`, `é` if needed |
|
|
* **Special tokens** `<pad>`, `<unk>`, `<bos>`, `<eos>` |
|
|
* **SHIFT token** `↨` |
|
|
|
|
|
Uppercase `A–Z` are **not** in vocab — they are represented via SHIFT. |
|
|
|
|
|
--- |
|
|
|
|
|
## Integration |
|
|
|
|
|
For dataset preparation: |
|
|
|
|
|
```python |
|
|
import numpy as np, os |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tok = AutoTokenizer.from_pretrained("char128_shift_tokenizer") |
|
|
|
|
|
with open("input.txt", "r", encoding="utf-8") as f: |
|
|
data = f.read() |
|
|
n = len(data) |
|
|
train_txt, val_txt = data[:int(0.9*n)], data[int(0.9*n):] |
|
|
|
|
|
train_ids = tok.encode(train_txt) |
|
|
val_ids = tok.encode(val_txt) |
|
|
|
|
|
np.array(train_ids, dtype=np.uint16).tofile("train.bin") |
|
|
np.array(val_ids, dtype=np.uint16).tofile("val.bin") |
|
|
``` |
|
|
|
|
|
Your model’s `vocab_size` must match (128). |
|
|
|
|
|
--- |
|
|
|
|
|
## Known Edge Cases |
|
|
|
|
|
* **Non-ASCII uppercase** (like `À`, `É`) are lowercased without SHIFT unless you add explicit rules. |
|
|
* **Spaces in decode** are disabled by setting decoder to concat; if you see them, ensure your tokenizer was saved with `tok.decoder = decoders.Sequence([])`. |
|
|
* **Unknown chars** → `<unk>`. Ensure your vocab includes everything you expect. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|
|
|
--- |
|
|
|
|
|
## Example Test |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer") |
|
|
ids = tok.encode("Hello, There!\n<eos>") |
|
|
print(ids) |
|
|
print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)) |
|
|
# ↨hello, ↨there!\n<eos> |
|
|
``` |