GPT-X2-TOK-32k

A 32,768-vocabulary BPE tokenizer trained on 20GB of FineWeb-Edu using the HuggingFace tokenizers library.

Key Details

Property

Value

Vocab Size

32,768

Type

Byte-level BPE

Training Data

50GB FineWeb-Edu (sample-100BT)

Special Tokens

<|endoftext|> (id=0), <|padding|> (id=1)

Context Length

1024

Why 32K?

~9% better compression on FineWeb-Edu compared to Llama's generic 32K tokenizer (810 vs 887 tokens on representative samples)
Smaller embedding table than 50K (GPT-2) — saves ~10M parameters that can be reinvested into transformer layers
Domain-optimized — trained specifically on educational web text, capturing common patterns in the training distribution

Usage

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/gpt-tok-32k-fineweb")text = "The quick brown fox jumps over the lazy dog."tokens = tokenizer.encode(text)print(f"Tokens: {len(tokens)}")print(tokenizer.decode(tokens))

Training

Trained using the HuggingFace tokenizers library with:

Byte-level BPE
No normalization
ByteLevel pre-tokenizer with regex splitting
50GB of streaming FineWeb-Edu text

Used By

GPT-X2-125M-50BT — 125M parameter Llama-style language model trained on 50B tokens

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datdanboi25
/

GPT-X2-TOK-32K

GPT-X2-TOK-32k

Key Details

Why 32K?

Usage

Training

Used By

Dataset used to train Datdanboi25/GPT-X2-TOK-32K

Collection including Datdanboi25/GPT-X2-TOK-32K

GPT-X2