| # LilChatBot WordLevel Tokenizer |
|
|
| A **WordLevel tokenizer** trained for the *LilChatBot* project. |
|
|
| This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work. |
|
|
| --- |
|
|
| ## Design choices |
|
|
| - **WordLevel tokenization** (no subword splitting) |
| - **Lowercasing** |
| - **Unicode NFKC normalization** |
| - **Apostrophes preserved everywhere** |
| (e.g. `don't`, `lion's`, `'hello'`, `James'`) |
| - **Aggressive punctuation isolation**, including: |
| - sentence punctuation (`. , ! ? ; :`) |
| - brackets (`() [] {}`) |
| - slashes (`/`) |
| - double quotes (straight and curly) |
| - en/em dashes (`– —`) |
| - **Repeated punctuation collapsed** |
| (`!!! → !`, `??? → ?`, `... → .`) |
| - English-focused |
|
|
| This tokenizer intentionally favors **lexical transparency** over vocabulary compactness. |
|
|
| --- |
|
|
| ## Files |
|
|
| - `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens) |
|
|
| The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`. |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### With `transformers` |
|
|
| ``` |
| from transformers import AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer") |
| |
| print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids)) |
| |