File size: 1,365 Bytes
16d69c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9699209
16d69c2
 
 
 
9699209
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# LilChatBot WordLevel Tokenizer

A **WordLevel tokenizer** trained for the *LilChatBot* project.

This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.

---

## Design choices

- **WordLevel tokenization** (no subword splitting)
- **Lowercasing**
- **Unicode NFKC normalization**
- **Apostrophes preserved everywhere**  
  (e.g. `don't`, `lion's`, `'hello'`, `James'`)
- **Aggressive punctuation isolation**, including:
  - sentence punctuation (`. , ! ? ; :`)
  - brackets (`() [] {}`)
  - slashes (`/`)
  - double quotes (straight and curly)
  - en/em dashes (`– —`)
- **Repeated punctuation collapsed**  
  (`!!! → !`, `??? → ?`, `... → .`)
- English-focused

This tokenizer intentionally favors **lexical transparency** over vocabulary compactness.

---

## Files

- `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)

The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`.

---

## Usage

### With `transformers`

```
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")

print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids))