GPT-X2
Collection
1 item • Updated
A 32,768-vocabulary BPE tokenizer trained on 20GB of FineWeb-Edu using the HuggingFace tokenizers library.
Property
Value
Vocab Size
32,768
Type
Byte-level BPE
Training Data
50GB FineWeb-Edu (sample-100BT)
Special Tokens
<|endoftext|> (id=0), <|padding|> (id=1)
Context Length
1024
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/gpt-tok-32k-fineweb")text = "The quick brown fox jumps over the lazy dog."tokens = tokenizer.encode(text)print(f"Tokens: {len(tokens)}")print(tokenizer.decode(tokens))
Trained using the HuggingFace tokenizers library with: