|
|
--- |
|
|
tags: |
|
|
- tokenizer |
|
|
- bpe |
|
|
- twitch |
|
|
- chat |
|
|
- nlp |
|
|
license: mit |
|
|
library_name: none |
|
|
language: |
|
|
- en |
|
|
pretty_name: Twitch BPE Tokenizer |
|
|
--- |
|
|
|
|
|
# Twitch BPE Tokenizer |
|
|
|
|
|
A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on: |
|
|
|
|
|
- Efficient compression at word granularity while preserving roundtrip fidelity |
|
|
- Stable fast-training on very large corpora (sampling + batch merges) |
|
|
- Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace |
|
|
|
|
|
This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications. |
|
|
|
|
|
## Highlights |
|
|
- Granularity: word (with byte fallback for out-of-vocab code points) |
|
|
- Typical vocab size: ~53k (configurable at train time) |
|
|
- Validation metrics (indicative): entropy compression ratio β 3.0; roundtrip β 0.997 |
|
|
- Space-attached merges (GPT-2 style) for better compression of common prefixes |
|
|
|
|
|
> Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab. |
|
|
|
|
|
## Intended Use & Limitations |
|
|
- Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics. |
|
|
- The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts. |
|
|
- Not a language model. It provides tokenization only (vocab, merges, and config). |
|
|
|
|
|
## Training Data & Cleaning |
|
|
- Source: Twitch chat style messages (user-provided corpus). |
|
|
- Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe. |
|
|
- Placeholders/macros: URLs β `<URL>`, user mentions β `<USER>`; emojis and some patterns protected. |
|
|
- Granularity during training: `word` with macro protection; fallback to bytes for unseen characters. |
|
|
|
|
|
## Tokenization Details |
|
|
- Pretokenization splits into words and whitespace tokens, preserving structure required for good merges. |
|
|
- Byte fallback ensures roundtrip for any Unicode inputs. |
|
|
- Special tokens reserved with fixed IDs: `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`. |
|
|
- Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements. |
|
|
|
|
|
## Files |
|
|
- `bpe_vocab.json` β token β id mapping |
|
|
- `bpe_merges.txt` β merge rules (ordered) |
|
|
- `tokenizer_config.json` β basic tokenizer configuration |
|
|
|
|
|
## Quickstart |
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
from src.bpe_tokenizer import BPETokenizer |
|
|
from src import config as CFG |
|
|
|
|
|
repo_id = "Soldier-Boy/twitch-tokenizer" |
|
|
|
|
|
# Download all artifacts into a local directory |
|
|
local_dir = snapshot_download(repo_id) |
|
|
|
|
|
# Optional: align granularity with training (recommended) |
|
|
CFG.TOKEN_GRANULARITY = 'word' |
|
|
|
|
|
# Initialize and use |
|
|
tok = BPETokenizer(tokenizer_dir=local_dir) |
|
|
text = "hey @user check this: https://t.tv/AbC" |
|
|
ids = tok.encode(text) |
|
|
print(ids) |
|
|
print(tok.decode(ids)) |
|
|
``` |
|
|
|
|
|
### Using without importing this repo |
|
|
If you don't want to depend on this repository's code, copy the `BPETokenizer` class (and its small dependencies) into your project and point it to a folder containing the three files above. |
|
|
|
|
|
## Metrics & Evaluation (indicative) |
|
|
- Compression ratio (entropy): ~3.03 on ~200k validation lines |
|
|
- Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines) |
|
|
|
|
|
Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution. |
|
|
|
|
|
## Reproducibility |
|
|
Train/evaluate end-to-end with the included pipeline (from this repo): |
|
|
```bash |
|
|
# Train (example; adjust paths and targets) |
|
|
python run_pipeline.py --data_path data/twitch_chat.parquet --text_column Message --vocab_target 56000 --granularity word --ratio_metric entropy --fast_train |
|
|
|
|
|
# Artifacts will be written to the tokenizer/ directory |
|
|
``` |
|
|
|
|
|
## Changelog |
|
|
- v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab. |
|
|
|
|
|
## License |
|
|
MIT License |
|
|
|
|
|
## Acknowledgments |
|
|
Thanks to the open-source tokenizer literature and community work around BPE and text normalization. |
|
|
|
|
|
|