GPT-X2-TOK-32k

A 32,768-vocabulary BPE tokenizer trained on 20GB of FineWeb-Edu using the HuggingFace tokenizers library.

Key Details

Property

Value

Vocab Size

32,768

Type

Byte-level BPE

Training Data

50GB FineWeb-Edu (sample-100BT)

Special Tokens

<|endoftext|> (id=0), <|padding|> (id=1)

Context Length

1024

Why 32K?

  • ~9% better compression on FineWeb-Edu compared to Llama's generic 32K tokenizer (810 vs 887 tokens on representative samples)
  • Smaller embedding table than 50K (GPT-2) — saves ~10M parameters that can be reinvested into transformer layers
  • Domain-optimized — trained specifically on educational web text, capturing common patterns in the training distribution

Usage

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/gpt-tok-32k-fineweb")text = "The quick brown fox jumps over the lazy dog."tokens = tokenizer.encode(text)print(f"Tokens: {len(tokens)}")print(tokenizer.decode(tokens))

Training

Trained using the HuggingFace tokenizers library with:

  • Byte-level BPE
  • No normalization
  • ByteLevel pre-tokenizer with regex splitting
  • 50GB of streaming FineWeb-Edu text

Used By

  • GPT-X2-125M-50BT — 125M parameter Llama-style language model trained on 50B tokens
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Datdanboi25/GPT-X2-TOK-32K

Collection including Datdanboi25/GPT-X2-TOK-32K