Malikeh1375 commited on
Commit
ca4f565
·
verified ·
1 Parent(s): a012db2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -7
README.md CHANGED
@@ -17,7 +17,7 @@ pipeline_tag: text-generation
17
  library_name: transformers
18
  ---
19
 
20
- <p align="center">
21
  <img src="./toksuite-logo.png" alt="TokSuite Logo" width="260"/>
22
  </p>
23
 
@@ -34,13 +34,29 @@ This model uses the **GPT-2 tokenizer** and is otherwise **identical** to the ot
34
  ## Tokenizer
35
 
36
  - **Tokenizer:** GPT-2
37
- - **Tokenizer type:** Byte Pair Encoding (BPE)
38
- - **Vocabulary size:** ~50K tokens
39
- - **Multilingual support:** Limited
40
- - **Out-of-vocabulary handling:** Byte-level fallback
41
- - **Normalization:** Minimal; preserves raw byte structure
42
 
43
- The GPT-2 tokenizer operates on UTF-8 byte sequences and applies BPE merges, allowing it to represent arbitrary Unicode text while maintaining a compact learned vocabulary.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ---
46
 
 
17
  library_name: transformers
18
  ---
19
 
20
+ <p align="left">
21
  <img src="./toksuite-logo.png" alt="TokSuite Logo" width="260"/>
22
  </p>
23
 
 
34
  ## Tokenizer
35
 
36
  - **Tokenizer:** GPT-2
37
+ - **Tokenization method:** BPE
38
+ - **Vocabulary size:** 50,257
39
+ - **Out-of-vocabulary handling:** Byte-fallback
40
+ - **Language coverage:** English-only
41
+ - **Pretokenization source:** GPT-2
42
 
43
+ **Processing details (Table 3):**
44
+ - **Numbers:** Split into individual digits
45
+ - **Contractions:** Learned
46
+ - **Unicode normalization:** None
47
+ - **Whitespace / boundary markers:** Whitespace encoded as part of tokens
48
+ - **Continuation / subword markers:** BPE continuation tokens
49
+
50
+ ## Why GPT-2?
51
+
52
+ GPT-2 was included in TokSuite to represent a **canonical English BPE tokenizer** that has been widely adopted in early large-scale language models. As described in the tokenizer selection rationale of the TokSuite paper, GPT-2 provides a well-established reference point for studying subword tokenization without explicit normalization or language-specific preprocessing.
53
+
54
+ Including GPT-2 enables TokSuite to study tokenizer behavior in settings where:
55
+ - tokenization is optimized for English,
56
+ - preprocessing and normalization are minimal,
57
+ - and whitespace is handled implicitly through token boundaries.
58
+
59
+ This makes GPT-2 a foundational tokenizer design within the TokSuite collection.
60
 
61
  ---
62