divilian's picture
Fix transformers syntax.
9699209
# LilChatBot WordLevel Tokenizer
A **WordLevel tokenizer** trained for the *LilChatBot* project.
This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.
---
## Design choices
- **WordLevel tokenization** (no subword splitting)
- **Lowercasing**
- **Unicode NFKC normalization**
- **Apostrophes preserved everywhere**
(e.g. `don't`, `lion's`, `'hello'`, `James'`)
- **Aggressive punctuation isolation**, including:
- sentence punctuation (`. , ! ? ; :`)
- brackets (`() [] {}`)
- slashes (`/`)
- double quotes (straight and curly)
- en/em dashes (`– —`)
- **Repeated punctuation collapsed**
(`!!! → !`, `??? → ?`, `... → .`)
- English-focused
This tokenizer intentionally favors **lexical transparency** over vocabulary compactness.
---
## Files
- `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)
The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`.
---
## Usage
### With `transformers`
```
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")
print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids))