--- tags: - tokenizer - bpe - twitch - chat - nlp license: mit library_name: none language: - en pretty_name: Twitch BPE Tokenizer --- # Twitch BPE Tokenizer A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on: - Efficient compression at word granularity while preserving roundtrip fidelity - Stable fast-training on very large corpora (sampling + batch merges) - Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications. ## Highlights - Granularity: word (with byte fallback for out-of-vocab code points) - Typical vocab size: ~53k (configurable at train time) - Validation metrics (indicative): entropy compression ratio ≈ 3.0; roundtrip ≈ 0.997 - Space-attached merges (GPT-2 style) for better compression of common prefixes > Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab. ## Intended Use & Limitations - Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics. - The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts. - Not a language model. It provides tokenization only (vocab, merges, and config). ## Training Data & Cleaning - Source: Twitch chat style messages (user-provided corpus). - Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe. - Placeholders/macros: URLs → ``, user mentions → ``; emojis and some patterns protected. - Granularity during training: `word` with macro protection; fallback to bytes for unseen characters. ## Tokenization Details - Pretokenization splits into words and whitespace tokens, preserving structure required for good merges. - Byte fallback ensures roundtrip for any Unicode inputs. - Special tokens reserved with fixed IDs: ``, ``, ``, ``. - Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements. ## Files - `bpe_vocab.json` — token → id mapping - `bpe_merges.txt` — merge rules (ordered) - `tokenizer_config.json` — basic tokenizer configuration ## Quickstart ```python from huggingface_hub import snapshot_download from src.bpe_tokenizer import BPETokenizer from src import config as CFG repo_id = "Soldier-Boy/twitch-tokenizer" # Download all artifacts into a local directory local_dir = snapshot_download(repo_id) # Optional: align granularity with training (recommended) CFG.TOKEN_GRANULARITY = 'word' # Initialize and use tok = BPETokenizer(tokenizer_dir=local_dir) text = "hey @user check this: https://t.tv/AbC" ids = tok.encode(text) print(ids) print(tok.decode(ids)) ``` ### Using without importing this repo If you don't want to depend on this repository's code, copy the `BPETokenizer` class (and its small dependencies) into your project and point it to a folder containing the three files above. ## Metrics & Evaluation (indicative) - Compression ratio (entropy): ~3.03 on ~200k validation lines - Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines) Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution. ## Reproducibility Train/evaluate end-to-end with the included pipeline (from this repo): ```bash # Train (example; adjust paths and targets) python run_pipeline.py --data_path data/twitch_chat.parquet --text_column Message --vocab_target 56000 --granularity word --ratio_metric entropy --fast_train # Artifacts will be written to the tokenizer/ directory ``` ## Changelog - v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab. ## License MIT License ## Acknowledgments Thanks to the open-source tokenizer literature and community work around BPE and text normalization.