File size: 1,365 Bytes
16d69c2 9699209 16d69c2 9699209 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | # LilChatBot WordLevel Tokenizer
A **WordLevel tokenizer** trained for the *LilChatBot* project.
This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.
---
## Design choices
- **WordLevel tokenization** (no subword splitting)
- **Lowercasing**
- **Unicode NFKC normalization**
- **Apostrophes preserved everywhere**
(e.g. `don't`, `lion's`, `'hello'`, `James'`)
- **Aggressive punctuation isolation**, including:
- sentence punctuation (`. , ! ? ; :`)
- brackets (`() [] {}`)
- slashes (`/`)
- double quotes (straight and curly)
- en/em dashes (`– —`)
- **Repeated punctuation collapsed**
(`!!! → !`, `??? → ?`, `... → .`)
- English-focused
This tokenizer intentionally favors **lexical transparency** over vocabulary compactness.
---
## Files
- `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)
The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`.
---
## Usage
### With `transformers`
```
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")
print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids))
|