Update README.md
Browse files
README.md
CHANGED
|
@@ -36,13 +36,29 @@ This model uses the **BLOOM tokenizer** and is otherwise **identical** to the ot
|
|
| 36 |
## Tokenizer
|
| 37 |
|
| 38 |
- **Tokenizer:** BLOOM
|
| 39 |
-
- **
|
| 40 |
-
- **Vocabulary size:**
|
| 41 |
-
- **
|
| 42 |
-
- **
|
| 43 |
-
- **
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
---
|
| 48 |
|
|
|
|
| 36 |
## Tokenizer
|
| 37 |
|
| 38 |
- **Tokenizer:** BLOOM
|
| 39 |
+
- **Tokenization method:** BPE
|
| 40 |
+
- **Vocabulary size:** 250,680
|
| 41 |
+
- **Out-of-vocabulary handling:** Byte-fallback
|
| 42 |
+
- **Language coverage:** Multilingual
|
| 43 |
+
- **Pretokenization source:** BLOOM
|
| 44 |
|
| 45 |
+
**Processing details (Table 3):**
|
| 46 |
+
- **Numbers:** Split into individual digits
|
| 47 |
+
- **Contractions:** Learned
|
| 48 |
+
- **Unicode normalization:** NFKC
|
| 49 |
+
- **Whitespace / boundary markers:** ▁ (SentencePiece-style whitespace marker)
|
| 50 |
+
- **Continuation / subword markers:** BPE continuation tokens
|
| 51 |
+
|
| 52 |
+
## Why BLOOM?
|
| 53 |
+
|
| 54 |
+
BLOOM was included in TokSuite to represent a **large-vocabulary multilingual BPE tokenizer** trained for broad cross-lingual coverage. As described in the tokenizer selection rationale of the TokSuite paper, BLOOM exemplifies a design choice that prioritizes extensive vocabulary capacity while maintaining subword-based segmentation.
|
| 55 |
+
|
| 56 |
+
Including BLOOM enables TokSuite to study tokenizer behavior in settings where:
|
| 57 |
+
- vocabulary size is large,
|
| 58 |
+
- segmentation follows BPE-style merges,
|
| 59 |
+
- and multilingual text is handled through a shared tokenizer.
|
| 60 |
+
|
| 61 |
+
This makes BLOOM a representative example of multilingual BPE tokenization.
|
| 62 |
|
| 63 |
---
|
| 64 |
|