divilian
/

lilchatbot-tokenizer

Model card Files Files and versions

lilchatbot-tokenizer / README.md

divilian's picture

Fix transformers syntax.

9699209 4 months ago

|

history blame contribute delete

1.37 kB

	# LilChatBot WordLevel Tokenizer

	A WordLevel tokenizer trained for the LilChatBot project.

	This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.

	---

	## Design choices

	- WordLevel tokenization (no subword splitting)
	- Lowercasing
	- Unicode NFKC normalization
	- Apostrophes preserved everywhere
	(e.g. `don't`, `lion's`, `'hello'`, `James'`)
	- Aggressive punctuation isolation, including:
	- sentence punctuation (`. , ! ? ; :`)
	- brackets (`() [] {}`)
	- slashes (`/`)
	- double quotes (straight and curly)
	- en/em dashes (`– —`)
	- Repeated punctuation collapsed
	(`!!! → !`, `??? → ?`, `... → .`)
	- English-focused

	This tokenizer intentionally favors lexical transparency over vocabulary compactness.

	---

	## Files

	- `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)

	The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`.

	---

	## Usage

	### With `transformers`

	```
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")

	print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids))