toksuite
/

bigscience-bloom

Text Generation

text-generation-inference

Model card Files Files and versions

Malikeh1375 commited on Dec 18, 2025

Commit

27b80fa

·

verified ·

1 Parent(s): 36deedc

Update README.md

Files changed (1) hide show

README.md +22 -6

README.md CHANGED Viewed

@@ -36,13 +36,29 @@ This model uses the **BLOOM tokenizer** and is otherwise **identical** to the ot
 ## Tokenizer
 - **Tokenizer:** BLOOM
-- **Tokenizer type:** SentencePiece (BPE-based)
-- **Vocabulary size:** ~250K tokens
-- **Multilingual support:** Yes
-- **Out-of-vocabulary handling:** Subword decomposition
-- **Normalization:** Unicode normalization via SentencePiece preprocessing
-The BLOOM tokenizer is designed to support multilingual text across many scripts and writing systems, using a large learned subword vocabulary to balance coverage and segmentation granularity.
 ---

 ## Tokenizer
 - **Tokenizer:** BLOOM
+- **Tokenization method:** BPE
+- **Vocabulary size:** 250,680
+- **Out-of-vocabulary handling:** Byte-fallback
+- **Language coverage:** Multilingual
+- **Pretokenization source:** BLOOM
+**Processing details (Table 3):**
+- **Numbers:** Split into individual digits
+- **Contractions:** Learned
+- **Unicode normalization:** NFKC
+- **Whitespace / boundary markers:** ▁ (SentencePiece-style whitespace marker)
+- **Continuation / subword markers:** BPE continuation tokens
+## Why BLOOM?
+BLOOM was included in TokSuite to represent a **large-vocabulary multilingual BPE tokenizer** trained for broad cross-lingual coverage. As described in the tokenizer selection rationale of the TokSuite paper, BLOOM exemplifies a design choice that prioritizes extensive vocabulary capacity while maintaining subword-based segmentation.
+Including BLOOM enables TokSuite to study tokenizer behavior in settings where:
+- vocabulary size is large,
+- segmentation follows BPE-style merges,
+- and multilingual text is handled through a shared tokenizer.
+This makes BLOOM a representative example of multilingual BPE tokenization.
 ---