Corianas commited on
Commit
33e001e
·
verified ·
1 Parent(s): efc5676

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ ---
5
+ # char128-shift Tokenizer
6
+
7
+ A fixed-size Hugging Face–compatible **character tokenizer** with a dedicated **SHIFT** token (`↨`) to represent uppercase letters. Instead of assigning separate tokens to uppercase `A–Z`, each uppercase is encoded as `↨` + lowercase (e.g., `H` → `↨h`).
8
+
9
+ This repository contains the ready-to-use tokenizer, which can be loaded with `AutoTokenizer`, as well as the script that made it (in src\ folder)
10
+
11
+ ---
12
+
13
+ ## Features
14
+
15
+ * **Fixed 128-token vocabulary** (including specials).
16
+ * **Uppercase encoding via SHIFT token**, no duplicate uppercase letters in vocab.
17
+ * **WordLevel model** with explicit closed character set.
18
+ * **Pre-tokenizer** splits by Unicode grapheme clusters (`\X`), so emoji and diacritics are preserved.
19
+ * **Normalizer** maps `A–Z` → `↨` + lowercase explicitly.
20
+ * **Decoder** concatenates tokens directly (no extra spaces).
21
+
22
+ ---
23
+
24
+ ## Installation
25
+
26
+ You only need `transformers` (for Python interface) and optionally `tokenizers` (for advanced building).
27
+
28
+ ```bash
29
+ pip install transformers>=4.40 tokenizers>=0.14
30
+ ```
31
+
32
+ No PyTorch/TensorFlow/Flax required to use the tokenizer itself.
33
+
34
+ ---
35
+
36
+ ## Usage
37
+
38
+ ### Load from local folder
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer
42
+
43
+ # Load local tokenizer folder
44
+ tok = AutoTokenizer.from_pretrained("char128_shift_tokenizer")
45
+
46
+ print(tok.vocab_size) # 128
47
+ ids = tok.encode("Hello, There!\n<eos>")
48
+ print(ids)
49
+ print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False))
50
+ # → "↨hello, ↨there!\n<eos>"
51
+ ```
52
+
53
+ ### Load from Hugging Face Hub
54
+
55
+ ```python
56
+ from transformers import AutoTokenizer
57
+
58
+ # Replace with your Hub repo
59
+ tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer")
60
+ ```
61
+
62
+ ---
63
+
64
+ ## Restoring Uppercase
65
+
66
+ The decode output will show SHIFT markers (e.g., `↨h`). For display, restore casing:
67
+
68
+ ```python
69
+ def restore_uppercase(s: str, shift="↨"):
70
+ out, i, n = [], 0, len(s)
71
+ while i < n:
72
+ if s[i] == shift and i+1 < n and s[i+1] != shift:
73
+ out.append(s[i+1].upper()); i += 2
74
+ else:
75
+ out.append(s[i]); i += 1
76
+ return "".join(out)
77
+
78
+ ids = tok.encode("Hello, There!\n<eos>")
79
+ decoded = tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
80
+ print(decoded) # "↨hello, ↨there!\n<eos>"
81
+ print(restore_uppercase(decoded)) # "Hello, There!\n<eos>"
82
+ ```
83
+
84
+ ---
85
+
86
+ ## Vocabulary
87
+
88
+ The 128 tokens include:
89
+
90
+ * **Lowercase letters** `a–z`
91
+ * **Digits** `0–9`
92
+ * **Whitespace** (space, `\n`, `\t`)
93
+ * **Punctuation and symbols** (configurable)
94
+ * **Diacritics** like `è`, `é` if needed
95
+ * **Special tokens** `<pad>`, `<unk>`, `<bos>`, `<eos>`
96
+ * **SHIFT token** `↨`
97
+
98
+ Uppercase `A–Z` are **not** in vocab — they are represented via SHIFT.
99
+
100
+ ---
101
+
102
+ ## Integration
103
+
104
+ For dataset preparation:
105
+
106
+ ```python
107
+ import numpy as np, os
108
+ from transformers import AutoTokenizer
109
+
110
+ tok = AutoTokenizer.from_pretrained("char128_shift_tokenizer")
111
+
112
+ with open("input.txt", "r", encoding="utf-8") as f:
113
+ data = f.read()
114
+ n = len(data)
115
+ train_txt, val_txt = data[:int(0.9*n)], data[int(0.9*n):]
116
+
117
+ train_ids = tok.encode(train_txt)
118
+ val_ids = tok.encode(val_txt)
119
+
120
+ np.array(train_ids, dtype=np.uint16).tofile("train.bin")
121
+ np.array(val_ids, dtype=np.uint16).tofile("val.bin")
122
+ ```
123
+
124
+ Your model’s `vocab_size` must match (128).
125
+
126
+ ---
127
+
128
+ ## Known Edge Cases
129
+
130
+ * **Non-ASCII uppercase** (like `À`, `É`) are lowercased without SHIFT unless you add explicit rules.
131
+ * **Spaces in decode** are disabled by setting decoder to concat; if you see them, ensure your tokenizer was saved with `tok.decoder = decoders.Sequence([])`.
132
+ * **Unknown chars** → `<unk>`. Ensure your vocab includes everything you expect.
133
+
134
+ ---
135
+
136
+ ## License
137
+
138
+ MIT (or your chosen license).
139
+
140
+ ---
141
+
142
+ ## Example Test
143
+
144
+ ```python
145
+ from transformers import AutoTokenizer
146
+
147
+ tok = AutoTokenizer.from_pretrained("Corianas/char128_shift_tokenizer")
148
+ ids = tok.encode("Hello, There!\n<eos>")
149
+ print(ids)
150
+ print(tok.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False))
151
+ # ↨hello, ↨there!\n<eos>
152
+ ```