toksuite
/

gpt2

Text Generation

text-generation-inference

Model card Files Files and versions

Malikeh1375 commited on Dec 18, 2025

Commit

ca4f565

·

verified ·

1 Parent(s): a012db2

Update README.md

Files changed (1) hide show

README.md +23 -7

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ pipeline_tag: text-generation
 library_name: transformers
 ---
-<p align="center">
   <img src="./toksuite-logo.png" alt="TokSuite Logo" width="260"/>
 </p>
@@ -34,13 +34,29 @@ This model uses the **GPT-2 tokenizer** and is otherwise **identical** to the ot
 ## Tokenizer
 - **Tokenizer:** GPT-2
-- **Tokenizer type:** Byte Pair Encoding (BPE)
-- **Vocabulary size:** ~50K tokens
-- **Multilingual support:** Limited
-- **Out-of-vocabulary handling:** Byte-level fallback
-- **Normalization:** Minimal; preserves raw byte structure
-The GPT-2 tokenizer operates on UTF-8 byte sequences and applies BPE merges, allowing it to represent arbitrary Unicode text while maintaining a compact learned vocabulary.
 ---

 library_name: transformers
 ---
+<p align="left">
   <img src="./toksuite-logo.png" alt="TokSuite Logo" width="260"/>
 </p>
 ## Tokenizer
 - **Tokenizer:** GPT-2
+- **Tokenization method:** BPE
+- **Vocabulary size:** 50,257
+- **Out-of-vocabulary handling:** Byte-fallback
+- **Language coverage:** English-only
+- **Pretokenization source:** GPT-2
+**Processing details (Table 3):**
+- **Numbers:** Split into individual digits
+- **Contractions:** Learned
+- **Unicode normalization:** None
+- **Whitespace / boundary markers:** Whitespace encoded as part of tokens
+- **Continuation / subword markers:** BPE continuation tokens
+## Why GPT-2?
+GPT-2 was included in TokSuite to represent a **canonical English BPE tokenizer** that has been widely adopted in early large-scale language models. As described in the tokenizer selection rationale of the TokSuite paper, GPT-2 provides a well-established reference point for studying subword tokenization without explicit normalization or language-specific preprocessing.
+Including GPT-2 enables TokSuite to study tokenizer behavior in settings where:
+- tokenization is optimized for English,
+- preprocessing and normalization are minimal,
+- and whitespace is handled implicitly through token boundaries.
+This makes GPT-2 a foundational tokenizer design within the TokSuite collection.
 ---