Twitch BPE Tokenizer

A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on:

Efficient compression at word granularity while preserving roundtrip fidelity
Stable fast-training on very large corpora (sampling + batch merges)
Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace

This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications.

Highlights

Granularity: word (with byte fallback for out-of-vocab code points)
Typical vocab size: ~53k (configurable at train time)
Validation metrics (indicative): entropy compression ratio ≈ 3.0; roundtrip ≈ 0.997
Space-attached merges (GPT-2 style) for better compression of common prefixes

Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab.

Intended Use & Limitations

Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics.
The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts.
Not a language model. It provides tokenization only (vocab, merges, and config).

Training Data & Cleaning

Source: Twitch chat style messages (user-provided corpus).
Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe.
Placeholders/macros: URLs → <URL>, user mentions → <USER>; emojis and some patterns protected.
Granularity during training: word with macro protection; fallback to bytes for unseen characters.

Tokenization Details

Pretokenization splits into words and whitespace tokens, preserving structure required for good merges.
Byte fallback ensures roundtrip for any Unicode inputs.
Special tokens reserved with fixed IDs: <PAD>, <UNK>, <BOS>, <EOS>.
Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements.

Files

bpe_vocab.json — token → id mapping
bpe_merges.txt — merge rules (ordered)
tokenizer_config.json — basic tokenizer configuration

Quickstart

from huggingface_hub import snapshot_download
from src.bpe_tokenizer import BPETokenizer
from src import config as CFG

repo_id = "Soldier-Boy/twitch-tokenizer"

# Download all artifacts into a local directory
local_dir = snapshot_download(repo_id)

# Optional: align granularity with training (recommended)
CFG.TOKEN_GRANULARITY = 'word'

# Initialize and use
tok = BPETokenizer(tokenizer_dir=local_dir)
text = "hey @user check this: https://t.tv/AbC"
ids = tok.encode(text)
print(ids)
print(tok.decode(ids))

Using without importing this repo

If you don't want to depend on this repository's code, copy the BPETokenizer class (and its small dependencies) into your project and point it to a folder containing the three files above.

Metrics & Evaluation (indicative)

Compression ratio (entropy): ~3.03 on ~200k validation lines
Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines)

Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution.

Reproducibility

Train/evaluate end-to-end with the included pipeline (from this repo):

# Train (example; adjust paths and targets)
python run_pipeline.py --data_path data/twitch_chat.parquet     --text_column Message     --vocab_target 56000     --granularity word     --ratio_metric entropy     --fast_train

# Artifacts will be written to the tokenizer/ directory

Changelog

v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab.

License

MIT License

Acknowledgments

Thanks to the open-source tokenizer literature and community work around BPE and text normalization.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Soldier-Boy
/

twitch-tokenizer