Malikeh1375 commited on
Commit
27b80fa
·
verified ·
1 Parent(s): 36deedc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -6
README.md CHANGED
@@ -36,13 +36,29 @@ This model uses the **BLOOM tokenizer** and is otherwise **identical** to the ot
36
  ## Tokenizer
37
 
38
  - **Tokenizer:** BLOOM
39
- - **Tokenizer type:** SentencePiece (BPE-based)
40
- - **Vocabulary size:** ~250K tokens
41
- - **Multilingual support:** Yes
42
- - **Out-of-vocabulary handling:** Subword decomposition
43
- - **Normalization:** Unicode normalization via SentencePiece preprocessing
44
 
45
- The BLOOM tokenizer is designed to support multilingual text across many scripts and writing systems, using a large learned subword vocabulary to balance coverage and segmentation granularity.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ---
48
 
 
36
  ## Tokenizer
37
 
38
  - **Tokenizer:** BLOOM
39
+ - **Tokenization method:** BPE
40
+ - **Vocabulary size:** 250,680
41
+ - **Out-of-vocabulary handling:** Byte-fallback
42
+ - **Language coverage:** Multilingual
43
+ - **Pretokenization source:** BLOOM
44
 
45
+ **Processing details (Table 3):**
46
+ - **Numbers:** Split into individual digits
47
+ - **Contractions:** Learned
48
+ - **Unicode normalization:** NFKC
49
+ - **Whitespace / boundary markers:** ▁ (SentencePiece-style whitespace marker)
50
+ - **Continuation / subword markers:** BPE continuation tokens
51
+
52
+ ## Why BLOOM?
53
+
54
+ BLOOM was included in TokSuite to represent a **large-vocabulary multilingual BPE tokenizer** trained for broad cross-lingual coverage. As described in the tokenizer selection rationale of the TokSuite paper, BLOOM exemplifies a design choice that prioritizes extensive vocabulary capacity while maintaining subword-based segmentation.
55
+
56
+ Including BLOOM enables TokSuite to study tokenizer behavior in settings where:
57
+ - vocabulary size is large,
58
+ - segmentation follows BPE-style merges,
59
+ - and multilingual text is handled through a shared tokenizer.
60
+
61
+ This makes BLOOM a representative example of multilingual BPE tokenization.
62
 
63
  ---
64