File size: 4,300 Bytes
811acf2
 
 
 
 
 
 
048c2df
811acf2
53cc5cc
 
 
811acf2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
048c2df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
811acf2
 
 
 
 
 
53cc5cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
language:
- en
tags:
- tokenizer
- lib
- less-is-better
- supra-word
- cognitively-inspired
license: gpl-3.0
datasets:
- MLZoo/edu-fineweb-10B
---

# LiB Tokenizer

A tokenizer trained with the **LiB** (Less is Better) algorithm, a cognitively-inspired
online learning approach to vocabulary acquisition. Unlike BPE or Unigram, LiB builds
its vocabulary incrementally by simulating a reading process: for each training sentence
it segments the input with the current vocabulary, generates candidate units from adjacent
chunks, tests whether adding a candidate reduces segmentation length, and reorders or
prunes units based on reward and punishment signals.

## Vocabulary

- **Size:** 50,000 tokens (including special tokens)
- **Training corpus:** [edufineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (English web text)
- **Training epochs:** 10,000
- **Max token length:** 12 characters
- **Byte-level fallback:** enabled — non-Latin characters are decomposed into UTF-8 byte tokens (`<0x00>``<0xFF>`), keeping the vocabulary budget focused on meaningful units

## Special tokens

| Token | Purpose |
|-------|---------|
| `<\|endoftext\|>` | End of document |
| `<pad>` | Padding |
| `<s>` | Beginning of sequence |
| `</s>` | End of sequence |

## Usage

This tokenizer requires the LiB fork of the HuggingFace `tokenizers` library:

```bash
git clone -b lib-model https://github.com/antalvdb/tokenizers
cd tokenizers/bindings/python
maturin develop --release
```

Then:

```python
from tokenizers import Tokenizer
from tokenizers.decoders import ByteFallback, Fuse, Sequence

tokenizer = Tokenizer.from_pretrained("antalvdb/lib-tokenizer")
tokenizer.decoder = Sequence([ByteFallback(), Fuse()])

encoded = tokenizer.encode("The cat sat on the mat.")
print(encoded.tokens)
decoded = tokenizer.decode(encoded.ids)
print(decoded)
```

## How LiB differs from BPE and Unigram

| Property | BPE | Unigram | LiB |
|----------|-----|---------|-----|
| Learning | Batch, greedy merges | EM over corpus | Online, one sentence at a time |
| Vocabulary order | Merge frequency | Log-likelihood | Priority (reward/punishment) |
| Supra-word tokens | No | No | Yes (multi-word units) |
| Cognitively motivated | No | No | Yes |

## Citation

The LiB algorithm was developed by Jinbiao Yang. If you use this tokenizer, please cite
the original work:

```
@inproceedings{yang-etal-2020-less,
    title = "Less is Better: A cognitively inspired unsupervised model for language segmentation",
    author = "Yang, Jinbiao  and
      Frank, Stefan L.  and
      van den Bosch, Antal",
    editor = "Zock, Michael  and
      Chersoni, Emmanuele  and
      Lenci, Alessandro  and
      Santus, Enrico",
    booktitle = "Proceedings of the Workshop on the Cognitive Aspects of the Lexicon",
    month = dec,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.cogalex-1.4/",
    pages = "33--45",
    abstract = "Language users process utterances by segmenting them into many cognitive units, which vary in their sizes and linguistic levels. Although we can do such unitization/segmentation easily, its cognitive mechanism is still not clear. This paper proposes an unsupervised model, Less-is-Better (LiB), to simulate the human cognitive process with respect to language unitization/segmentation. LiB follows the principle of least effort and aims to build a lexicon which minimizes the number of unit tokens (alleviating the effort of analysis) and number of unit types (alleviating the effort of storage) at the same time on any given corpus. LiB{'}s workflow is inspired by empirical cognitive phenomena. The design makes the mechanism of LiB cognitively plausible and the computational requirement light-weight. The lexicon generated by LiB performs the best among different types of lexicons (e.g. ground-truth words) both from an information-theoretical view and a cognitive view, which suggests that the LiB lexicon may be a plausible proxy of the mental lexicon."
}
```

## Links

- [tokenizers fork (Rust implementation)](https://github.com/antalvdb/tokenizers/tree/lib-model)
- [LiB repository (training scripts)](https://github.com/antalvdb/LiB/tree/feature/hf-compatible-tokenizer)