Corianas
/

char128_shift_tokenizer

English

Model card Files Files and versions

xet

Community

Corianas commited on Aug 27, 2025

Commit

33e001e

verified ·

1 Parent(s): efc5676

Create README.md

Browse files

Files changed (1) hide show

README.md +152 -0

README.md ADDED Viewed

	@@ -0,0 +1,152 @@

+---
+language:
+- en
+---
+# char128-shift Tokenizer
+A fixed-size Hugging Face–compatible **character tokenizer** with a dedicated **SHIFT** token (`↨`) to represent uppercase letters. Instead of assigning separate tokens to uppercase `A–Z`, each uppercase is encoded as `↨` + lowercase (e.g., `H` → `↨h`).
+This repository contains the ready-to-use tokenizer, which can be loaded with `AutoTokenizer`, as well as the script that made it (in src\ folder)
+---
+## Features
+* **Fixed 128-token vocabulary** (including specials).
+* **Uppercase encoding via SHIFT token**, no duplicate uppercase letters in vocab.
+* **WordLevel model** with explicit closed character set.
+* **Pre-tokenizer** splits by Unicode grapheme clusters (`\X`), so emoji and diacritics are preserved.
+* **Normalizer** maps `A–Z` → `↨` + lowercase explicitly.
+* **Decoder** concatenates tokens directly (no extra spaces).
+---
+## Installation
+You only need `transformers` (for Python interface) and optionally `tokenizers` (for advanced building).
+```bash
+pip install transformers>=4.40 tokenizers>=0.14
+```
+No PyTorch/TensorFlow/Flax required to use the tokenizer itself.
+---
+## Usage
+### Load from local folder
+```python
+from transformers import AutoTokenizer
+# Load local tokenizer folder
+tok = AutoTokenizer.from_pretrained("char128_shift_tokenizer")
+print(tok.vocab_size)  # 128
+ids = tok.encode("Hello, There!\n<eos>")
+print(ids)
+print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+# → "↨hello, ↨there!\n<eos>"
+```
+### Load from Hugging Face Hub
+```python
+from transformers import AutoTokenizer
+# Replace with your Hub repo
+tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer")
+```
+---
+## Restoring Uppercase
+The decode output will show SHIFT markers (e.g., `↨h`). For display, restore casing:
+```python
+def restore_uppercase(s: str, shift="↨"):
+    out, i, n = [], 0, len(s)
+    while i < n:
+        if s[i] == shift and i+1 < n and s[i+1] != shift:
+            out.append(s[i+1].upper()); i += 2
+        else:
+            out.append(s[i]); i += 1
+    return "".join(out)
+ids = tok.encode("Hello, There!\n<eos>")
+decoded = tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
+print(decoded)                  # "↨hello, ↨there!\n<eos>"
+print(restore_uppercase(decoded))  # "Hello, There!\n<eos>"
+```
+---
+## Vocabulary
+The 128 tokens include:
+* **Lowercase letters** `a–z`
+* **Digits** `0–9`
+* **Whitespace** (space, `\n`, `\t`)
+* **Punctuation and symbols** (configurable)
+* **Diacritics** like `è`, `é` if needed
+* **Special tokens** `<pad>`, `<unk>`, `<bos>`, `<eos>`
+* **SHIFT token** `↨`
+Uppercase `A–Z` are **not** in vocab — they are represented via SHIFT.
+---
+## Integration
+For dataset preparation:
+```python
+import numpy as np, os
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("char128_shift_tokenizer")
+with open("input.txt", "r", encoding="utf-8") as f:
+    data = f.read()
+n = len(data)
+train_txt, val_txt = data[:int(0.9*n)], data[int(0.9*n):]
+train_ids = tok.encode(train_txt)
+val_ids   = tok.encode(val_txt)
+np.array(train_ids, dtype=np.uint16).tofile("train.bin")
+np.array(val_ids, dtype=np.uint16).tofile("val.bin")
+```
+Your model’s `vocab_size` must match (128).
+---
+## Known Edge Cases
+* **Non-ASCII uppercase** (like `À`, `É`) are lowercased without SHIFT unless you add explicit rules.
+* **Spaces in decode** are disabled by setting decoder to concat; if you see them, ensure your tokenizer was saved with `tok.decoder = decoders.Sequence([])`.
+* **Unknown chars** → `<unk>`. Ensure your vocab includes everything you expect.
+---
+## License
+MIT (or your chosen license).
+---
+## Example Test
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer")
+ids = tok.encode("Hello, There!\n<eos>")
+print(ids)
+print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False))
+# ↨hello, ↨there!\n<eos>
+```