File size: 4,220 Bytes
fcf2a19 be5f8fc fcf2a19 be5f8fc fcf2a19 be5f8fc fcf2a19 be5f8fc fcf2a19 be5f8fc fcf2a19 be5f8fc fcf2a19 be5f8fc fcf2a19 be5f8fc fcf2a19 be5f8fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
---
tags:
- tokenizer
- bpe
- twitch
- chat
- nlp
license: mit
library_name: none
language:
- en
pretty_name: Twitch BPE Tokenizer
---
# Twitch BPE Tokenizer
A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on:
- Efficient compression at word granularity while preserving roundtrip fidelity
- Stable fast-training on very large corpora (sampling + batch merges)
- Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace
This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications.
## Highlights
- Granularity: word (with byte fallback for out-of-vocab code points)
- Typical vocab size: ~53k (configurable at train time)
- Validation metrics (indicative): entropy compression ratio β 3.0; roundtrip β 0.997
- Space-attached merges (GPT-2 style) for better compression of common prefixes
> Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab.
## Intended Use & Limitations
- Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics.
- The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts.
- Not a language model. It provides tokenization only (vocab, merges, and config).
## Training Data & Cleaning
- Source: Twitch chat style messages (user-provided corpus).
- Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe.
- Placeholders/macros: URLs β `<URL>`, user mentions β `<USER>`; emojis and some patterns protected.
- Granularity during training: `word` with macro protection; fallback to bytes for unseen characters.
## Tokenization Details
- Pretokenization splits into words and whitespace tokens, preserving structure required for good merges.
- Byte fallback ensures roundtrip for any Unicode inputs.
- Special tokens reserved with fixed IDs: `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`.
- Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements.
## Files
- `bpe_vocab.json` β token β id mapping
- `bpe_merges.txt` β merge rules (ordered)
- `tokenizer_config.json` β basic tokenizer configuration
## Quickstart
```python
from huggingface_hub import snapshot_download
from src.bpe_tokenizer import BPETokenizer
from src import config as CFG
repo_id = "Soldier-Boy/twitch-tokenizer"
# Download all artifacts into a local directory
local_dir = snapshot_download(repo_id)
# Optional: align granularity with training (recommended)
CFG.TOKEN_GRANULARITY = 'word'
# Initialize and use
tok = BPETokenizer(tokenizer_dir=local_dir)
text = "hey @user check this: https://t.tv/AbC"
ids = tok.encode(text)
print(ids)
print(tok.decode(ids))
```
### Using without importing this repo
If you don't want to depend on this repository's code, copy the `BPETokenizer` class (and its small dependencies) into your project and point it to a folder containing the three files above.
## Metrics & Evaluation (indicative)
- Compression ratio (entropy): ~3.03 on ~200k validation lines
- Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines)
Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution.
## Reproducibility
Train/evaluate end-to-end with the included pipeline (from this repo):
```bash
# Train (example; adjust paths and targets)
python run_pipeline.py --data_path data/twitch_chat.parquet --text_column Message --vocab_target 56000 --granularity word --ratio_metric entropy --fast_train
# Artifacts will be written to the tokenizer/ directory
```
## Changelog
- v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab.
## License
MIT License
## Acknowledgments
Thanks to the open-source tokenizer literature and community work around BPE and text normalization.
|