Add README
Browse files
README.md
CHANGED
|
@@ -4,39 +4,100 @@ tags:
|
|
| 4 |
- bpe
|
| 5 |
- twitch
|
| 6 |
- chat
|
|
|
|
| 7 |
license: mit
|
| 8 |
library_name: none
|
| 9 |
language:
|
| 10 |
- en
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# Twitch BPE Tokenizer
|
| 14 |
|
| 15 |
-
A
|
| 16 |
|
| 17 |
-
-
|
| 18 |
-
-
|
| 19 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
## Files
|
| 22 |
-
- bpe_vocab.json
|
| 23 |
-
- bpe_merges.txt
|
| 24 |
-
- tokenizer_config.json
|
| 25 |
|
| 26 |
-
##
|
| 27 |
```python
|
| 28 |
-
from huggingface_hub import
|
| 29 |
-
from src.bpe_tokenizer import BPETokenizer
|
|
|
|
| 30 |
|
| 31 |
repo_id = "Soldier-Boy/twitch-tokenizer"
|
| 32 |
-
vocab_path = hf_hub_download(repo_id, filename="bpe_vocab.json")
|
| 33 |
-
merges_path = hf_hub_download(repo_id, filename="bpe_merges.txt")
|
| 34 |
-
config_path = hf_hub_download(repo_id, filename="tokenizer_config.json")
|
| 35 |
|
| 36 |
-
#
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
```
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- bpe
|
| 5 |
- twitch
|
| 6 |
- chat
|
| 7 |
+
- nlp
|
| 8 |
license: mit
|
| 9 |
library_name: none
|
| 10 |
language:
|
| 11 |
- en
|
| 12 |
+
pretty_name: Twitch BPE Tokenizer
|
| 13 |
---
|
| 14 |
|
| 15 |
# Twitch BPE Tokenizer
|
| 16 |
|
| 17 |
+
A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on:
|
| 18 |
|
| 19 |
+
- Efficient compression at word granularity while preserving roundtrip fidelity
|
| 20 |
+
- Stable fast-training on very large corpora (sampling + batch merges)
|
| 21 |
+
- Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace
|
| 22 |
+
|
| 23 |
+
This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications.
|
| 24 |
+
|
| 25 |
+
## Highlights
|
| 26 |
+
- Granularity: word (with byte fallback for out-of-vocab code points)
|
| 27 |
+
- Typical vocab size: ~53k (configurable at train time)
|
| 28 |
+
- Validation metrics (indicative): entropy compression ratio β 3.0; roundtrip β 0.997
|
| 29 |
+
- Space-attached merges (GPT-2 style) for better compression of common prefixes
|
| 30 |
+
|
| 31 |
+
> Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab.
|
| 32 |
+
|
| 33 |
+
## Intended Use & Limitations
|
| 34 |
+
- Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics.
|
| 35 |
+
- The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts.
|
| 36 |
+
- Not a language model. It provides tokenization only (vocab, merges, and config).
|
| 37 |
+
|
| 38 |
+
## Training Data & Cleaning
|
| 39 |
+
- Source: Twitch chat style messages (user-provided corpus).
|
| 40 |
+
- Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe.
|
| 41 |
+
- Placeholders/macros: URLs β `<URL>`, user mentions β `<USER>`; emojis and some patterns protected.
|
| 42 |
+
- Granularity during training: `word` with macro protection; fallback to bytes for unseen characters.
|
| 43 |
+
|
| 44 |
+
## Tokenization Details
|
| 45 |
+
- Pretokenization splits into words and whitespace tokens, preserving structure required for good merges.
|
| 46 |
+
- Byte fallback ensures roundtrip for any Unicode inputs.
|
| 47 |
+
- Special tokens reserved with fixed IDs: `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`.
|
| 48 |
+
- Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements.
|
| 49 |
|
| 50 |
## Files
|
| 51 |
+
- `bpe_vocab.json` β token β id mapping
|
| 52 |
+
- `bpe_merges.txt` β merge rules (ordered)
|
| 53 |
+
- `tokenizer_config.json` β basic tokenizer configuration
|
| 54 |
|
| 55 |
+
## Quickstart
|
| 56 |
```python
|
| 57 |
+
from huggingface_hub import snapshot_download
|
| 58 |
+
from src.bpe_tokenizer import BPETokenizer
|
| 59 |
+
from src import config as CFG
|
| 60 |
|
| 61 |
repo_id = "Soldier-Boy/twitch-tokenizer"
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
# Download all artifacts into a local directory
|
| 64 |
+
local_dir = snapshot_download(repo_id)
|
| 65 |
+
|
| 66 |
+
# Optional: align granularity with training (recommended)
|
| 67 |
+
CFG.TOKEN_GRANULARITY = 'word'
|
| 68 |
+
|
| 69 |
+
# Initialize and use
|
| 70 |
+
tok = BPETokenizer(tokenizer_dir=local_dir)
|
| 71 |
+
text = "hey @user check this: https://t.tv/AbC"
|
| 72 |
+
ids = tok.encode(text)
|
| 73 |
+
print(ids)
|
| 74 |
+
print(tok.decode(ids))
|
| 75 |
```
|
| 76 |
|
| 77 |
+
### Using without importing this repo
|
| 78 |
+
If you don't want to depend on this repository's code, copy the `BPETokenizer` class (and its small dependencies) into your project and point it to a folder containing the three files above.
|
| 79 |
+
|
| 80 |
+
## Metrics & Evaluation (indicative)
|
| 81 |
+
- Compression ratio (entropy): ~3.03 on ~200k validation lines
|
| 82 |
+
- Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines)
|
| 83 |
+
|
| 84 |
+
Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution.
|
| 85 |
+
|
| 86 |
+
## Reproducibility
|
| 87 |
+
Train/evaluate end-to-end with the included pipeline (from this repo):
|
| 88 |
+
```bash
|
| 89 |
+
# Train (example; adjust paths and targets)
|
| 90 |
+
python run_pipeline.py --data_path data/twitch_chat.parquet --text_column Message --vocab_target 56000 --granularity word --ratio_metric entropy --fast_train
|
| 91 |
+
|
| 92 |
+
# Artifacts will be written to the tokenizer/ directory
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## Changelog
|
| 96 |
+
- v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab.
|
| 97 |
+
|
| 98 |
+
## License
|
| 99 |
+
MIT License
|
| 100 |
+
|
| 101 |
+
## Acknowledgments
|
| 102 |
+
Thanks to the open-source tokenizer literature and community work around BPE and text normalization.
|
| 103 |
+
|