Twitch BPE Tokenizer

A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on:

  • Efficient compression at word granularity while preserving roundtrip fidelity
  • Stable fast-training on very large corpora (sampling + batch merges)
  • Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace

This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications.

Highlights

  • Granularity: word (with byte fallback for out-of-vocab code points)
  • Typical vocab size: ~53k (configurable at train time)
  • Validation metrics (indicative): entropy compression ratio β‰ˆ 3.0; roundtrip β‰ˆ 0.997
  • Space-attached merges (GPT-2 style) for better compression of common prefixes

Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab.

Intended Use & Limitations

  • Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics.
  • The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts.
  • Not a language model. It provides tokenization only (vocab, merges, and config).

Training Data & Cleaning

  • Source: Twitch chat style messages (user-provided corpus).
  • Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe.
  • Placeholders/macros: URLs β†’ <URL>, user mentions β†’ <USER>; emojis and some patterns protected.
  • Granularity during training: word with macro protection; fallback to bytes for unseen characters.

Tokenization Details

  • Pretokenization splits into words and whitespace tokens, preserving structure required for good merges.
  • Byte fallback ensures roundtrip for any Unicode inputs.
  • Special tokens reserved with fixed IDs: <PAD>, <UNK>, <BOS>, <EOS>.
  • Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements.

Files

  • bpe_vocab.json β€” token β†’ id mapping
  • bpe_merges.txt β€” merge rules (ordered)
  • tokenizer_config.json β€” basic tokenizer configuration

Quickstart

from huggingface_hub import snapshot_download
from src.bpe_tokenizer import BPETokenizer
from src import config as CFG

repo_id = "Soldier-Boy/twitch-tokenizer"

# Download all artifacts into a local directory
local_dir = snapshot_download(repo_id)

# Optional: align granularity with training (recommended)
CFG.TOKEN_GRANULARITY = 'word'

# Initialize and use
tok = BPETokenizer(tokenizer_dir=local_dir)
text = "hey @user check this: https://t.tv/AbC"
ids = tok.encode(text)
print(ids)
print(tok.decode(ids))

Using without importing this repo

If you don't want to depend on this repository's code, copy the BPETokenizer class (and its small dependencies) into your project and point it to a folder containing the three files above.

Metrics & Evaluation (indicative)

  • Compression ratio (entropy): ~3.03 on ~200k validation lines
  • Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines)

Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution.

Reproducibility

Train/evaluate end-to-end with the included pipeline (from this repo):

# Train (example; adjust paths and targets)
python run_pipeline.py --data_path data/twitch_chat.parquet     --text_column Message     --vocab_target 56000     --granularity word     --ratio_metric entropy     --fast_train

# Artifacts will be written to the tokenizer/ directory

Changelog

  • v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab.

License

MIT License

Acknowledgments

Thanks to the open-source tokenizer literature and community work around BPE and text normalization.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Soldier-Boy/twitch-tokenizer 1