antalvdb
/

lib-tokenizer

+---
+language:
+- en
+tags:
+- tokenizer
+- lib
+- less-is-better
+- subword
+- cognitively-inspired
+license: apache-2.0
+---
+# LiB Tokenizer
+A tokenizer trained with the **LiB** (Less is Better) algorithm, a cognitively-inspired
+online learning approach to vocabulary acquisition. Unlike BPE or Unigram, LiB builds
+its vocabulary incrementally by simulating a reading process: for each training sentence
+it segments the input with the current vocabulary, generates candidate units from adjacent
+chunks, tests whether adding a candidate reduces segmentation length, and reorders or
+prunes units based on reward and punishment signals.
+## Vocabulary
+- **Size:** 50,000 tokens (including special tokens)
+- **Training corpus:** [edufineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (English web text)
+- **Training epochs:** 10,000
+- **Max token length:** 12 characters
+- **Byte-level fallback:** enabled — non-Latin characters are decomposed into UTF-8 byte tokens (`<0x00>`–`<0xFF>`), keeping the vocabulary budget focused on meaningful units
+## Special tokens
+| Token | Purpose |
+|-------|---------|
+| `<\|endoftext\|>` | End of document |
+| `<pad>` | Padding |
+| `<s>` | Beginning of sequence |
+| `</s>` | End of sequence |
+## Usage
+This tokenizer requires the LiB fork of the HuggingFace `tokenizers` library:
+```bash
+git clone -b lib-model https://github.com/antalvdb/tokenizers
+cd tokenizers/bindings/python
+maturin develop --release
+```
+Then:
+```python
+from tokenizers import Tokenizer
+from tokenizers.decoders import ByteFallback, Fuse, Sequence
+tokenizer = Tokenizer.from_pretrained("antalvdb/lib-tokenizer")
+tokenizer.decoder = Sequence([ByteFallback(), Fuse()])
+encoded = tokenizer.encode("The cat sat on the mat.")
+print(encoded.tokens)
+decoded = tokenizer.decode(encoded.ids)
+print(decoded)
+```
+## How LiB differs from BPE and Unigram
+| Property | BPE | Unigram | LiB |
+|----------|-----|---------|-----|
+| Learning | Batch, greedy merges | EM over corpus | Online, one sentence at a time |
+| Vocabulary order | Merge frequency | Log-likelihood | Priority (reward/punishment) |
+| Supra-word tokens | No | No | Yes (multi-word units) |
+| Cognitively motivated | No | No | Yes |
+## Citation
+The LiB algorithm was developed by Jinbiao Yang. If you use this tokenizer, please cite
+the original work:
+```
+@software{lib-tokenizer,
+  author = {van den Bosch, Antal},
+  title  = {LiB Tokenizer — HuggingFace Implementation},
+  year   = {2026},
+  url    = {https://huggingface.co/antalvdb/lib-tokenizer}
+}
+```
+## Links
+- [tokenizers fork (Rust implementation)](https://github.com/antalvdb/tokenizers/tree/lib-model)
+- [LiB repository (training scripts)](https://github.com/antalvdb/LiB/tree/feature/hf-compatible-tokenizer)