antalvdb commited on
Commit
811acf2
·
verified ·
1 Parent(s): 1489718

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - tokenizer
6
+ - lib
7
+ - less-is-better
8
+ - subword
9
+ - cognitively-inspired
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # LiB Tokenizer
14
+
15
+ A tokenizer trained with the **LiB** (Less is Better) algorithm, a cognitively-inspired
16
+ online learning approach to vocabulary acquisition. Unlike BPE or Unigram, LiB builds
17
+ its vocabulary incrementally by simulating a reading process: for each training sentence
18
+ it segments the input with the current vocabulary, generates candidate units from adjacent
19
+ chunks, tests whether adding a candidate reduces segmentation length, and reorders or
20
+ prunes units based on reward and punishment signals.
21
+
22
+ ## Vocabulary
23
+
24
+ - **Size:** 50,000 tokens (including special tokens)
25
+ - **Training corpus:** [edufineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (English web text)
26
+ - **Training epochs:** 10,000
27
+ - **Max token length:** 12 characters
28
+ - **Byte-level fallback:** enabled — non-Latin characters are decomposed into UTF-8 byte tokens (`<0x00>`–`<0xFF>`), keeping the vocabulary budget focused on meaningful units
29
+
30
+ ## Special tokens
31
+
32
+ | Token | Purpose |
33
+ |-------|---------|
34
+ | `<\|endoftext\|>` | End of document |
35
+ | `<pad>` | Padding |
36
+ | `<s>` | Beginning of sequence |
37
+ | `</s>` | End of sequence |
38
+
39
+ ## Usage
40
+
41
+ This tokenizer requires the LiB fork of the HuggingFace `tokenizers` library:
42
+
43
+ ```bash
44
+ git clone -b lib-model https://github.com/antalvdb/tokenizers
45
+ cd tokenizers/bindings/python
46
+ maturin develop --release
47
+ ```
48
+
49
+ Then:
50
+
51
+ ```python
52
+ from tokenizers import Tokenizer
53
+ from tokenizers.decoders import ByteFallback, Fuse, Sequence
54
+
55
+ tokenizer = Tokenizer.from_pretrained("antalvdb/lib-tokenizer")
56
+ tokenizer.decoder = Sequence([ByteFallback(), Fuse()])
57
+
58
+ encoded = tokenizer.encode("The cat sat on the mat.")
59
+ print(encoded.tokens)
60
+ decoded = tokenizer.decode(encoded.ids)
61
+ print(decoded)
62
+ ```
63
+
64
+ ## How LiB differs from BPE and Unigram
65
+
66
+ | Property | BPE | Unigram | LiB |
67
+ |----------|-----|---------|-----|
68
+ | Learning | Batch, greedy merges | EM over corpus | Online, one sentence at a time |
69
+ | Vocabulary order | Merge frequency | Log-likelihood | Priority (reward/punishment) |
70
+ | Supra-word tokens | No | No | Yes (multi-word units) |
71
+ | Cognitively motivated | No | No | Yes |
72
+
73
+ ## Citation
74
+
75
+ The LiB algorithm was developed by Jinbiao Yang. If you use this tokenizer, please cite
76
+ the original work:
77
+
78
+ ```
79
+ @software{lib-tokenizer,
80
+ author = {van den Bosch, Antal},
81
+ title = {LiB Tokenizer — HuggingFace Implementation},
82
+ year = {2026},
83
+ url = {https://huggingface.co/antalvdb/lib-tokenizer}
84
+ }
85
+ ```
86
+
87
+ ## Links
88
+
89
+ - [tokenizers fork (Rust implementation)](https://github.com/antalvdb/tokenizers/tree/lib-model)
90
+ - [LiB repository (training scripts)](https://github.com/antalvdb/LiB/tree/feature/hf-compatible-tokenizer)