Soldier-Boy
/

twitch-tokenizer

@@ -4,39 +4,100 @@ tags:
 - bpe
 - twitch
 - chat
 license: mit
 library_name: none
 language:
 - en
 ---
 # Twitch BPE Tokenizer
-A custom Byte Pair Encoding (BPE) tokenizer trained on Twitch chat.
-- Granularity: word
-- Vocab size: ~53k (configurable)
-- Metrics (val): entropy compression ratio ~3.0, roundtrip ~0.997
 ## Files
-- bpe_vocab.json
-- bpe_merges.txt
-- tokenizer_config.json
-## Usage
 ```python
-from huggingface_hub import hf_hub_download
-from src.bpe_tokenizer import BPETokenizer  # or copy the class into your project
 repo_id = "Soldier-Boy/twitch-tokenizer"
-vocab_path = hf_hub_download(repo_id, filename="bpe_vocab.json")
-merges_path = hf_hub_download(repo_id, filename="bpe_merges.txt")
-config_path = hf_hub_download(repo_id, filename="tokenizer_config.json")
-# Place the three files in a folder and point BPETokenizer to it
-# or adapt BPETokenizer to accept file paths directly.
 ```
-## Notes
-- Mentions/URLs are normalized to placeholders (e.g., <USER>, <URL>) by the pipeline.
-- Tokenization allows space-attached merges for efficiency (like GPT-2).

 - bpe
 - twitch
 - chat
+- nlp
 license: mit
 library_name: none
 language:
 - en
+pretty_name: Twitch BPE Tokenizer
 ---
 # Twitch BPE Tokenizer
+A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on:
+- Efficient compression at word granularity while preserving roundtrip fidelity
+- Stable fast-training on very large corpora (sampling + batch merges)
+- Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace
+This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications.
+## Highlights
+- Granularity: word (with byte fallback for out-of-vocab code points)
+- Typical vocab size: ~53k (configurable at train time)
+- Validation metrics (indicative): entropy compression ratio ≈ 3.0; roundtrip ≈ 0.997
+- Space-attached merges (GPT-2 style) for better compression of common prefixes
+> Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab.
+## Intended Use & Limitations
+- Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics.
+- The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts.
+- Not a language model. It provides tokenization only (vocab, merges, and config).
+## Training Data & Cleaning
+- Source: Twitch chat style messages (user-provided corpus).
+- Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe.
+- Placeholders/macros: URLs → `<URL>`, user mentions → `<USER>`; emojis and some patterns protected.
+- Granularity during training: `word` with macro protection; fallback to bytes for unseen characters.
+## Tokenization Details
+- Pretokenization splits into words and whitespace tokens, preserving structure required for good merges.
+- Byte fallback ensures roundtrip for any Unicode inputs.
+- Special tokens reserved with fixed IDs: `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`.
+- Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements.
 ## Files
+- `bpe_vocab.json` — token → id mapping
+- `bpe_merges.txt` — merge rules (ordered)
+- `tokenizer_config.json` — basic tokenizer configuration
+## Quickstart
 ```python
+from huggingface_hub import snapshot_download
+from src.bpe_tokenizer import BPETokenizer
+from src import config as CFG
 repo_id = "Soldier-Boy/twitch-tokenizer"
+# Download all artifacts into a local directory
+local_dir = snapshot_download(repo_id)
+# Optional: align granularity with training (recommended)
+CFG.TOKEN_GRANULARITY = 'word'
+# Initialize and use
+tok = BPETokenizer(tokenizer_dir=local_dir)
+text = "hey @user check this: https://t.tv/AbC"
+ids = tok.encode(text)
+print(ids)
+print(tok.decode(ids))
 ```
+### Using without importing this repo
+If you don't want to depend on this repository's code, copy the `BPETokenizer` class (and its small dependencies) into your project and point it to a folder containing the three files above.
+## Metrics & Evaluation (indicative)
+- Compression ratio (entropy): ~3.03 on ~200k validation lines
+- Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines)
+Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution.
+## Reproducibility
+Train/evaluate end-to-end with the included pipeline (from this repo):
+```bash
+# Train (example; adjust paths and targets)
+python run_pipeline.py --data_path data/twitch_chat.parquet     --text_column Message     --vocab_target 56000     --granularity word     --ratio_metric entropy     --fast_train
+# Artifacts will be written to the tokenizer/ directory
+```
+## Changelog
+- v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab.
+## License
+MIT License
+## Acknowledgments
+Thanks to the open-source tokenizer literature and community work around BPE and text normalization.