Soldier-Boy
/

twitch-tokenizer

Model card Files Files and versions

Soldier-Boy commited on Nov 16, 2025

Commit

fcf2a19

·

verified ·

1 Parent(s): a1c81d5

Add README

Files changed (1) hide show

README.md +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+tags:
+- tokenizer
+- bpe
+- twitch
+- chat
+license: mit
+library_name: none
+language:
+- en
+---
+# Twitch BPE Tokenizer
+A custom Byte Pair Encoding (BPE) tokenizer trained on Twitch chat.
+- Granularity: word
+- Vocab size: ~53k (configurable)
+- Metrics (val): entropy compression ratio ~3.0, roundtrip ~0.997
+## Files
+- bpe_vocab.json
+- bpe_merges.txt
+- tokenizer_config.json
+## Usage
+```python
+from huggingface_hub import hf_hub_download
+from src.bpe_tokenizer import BPETokenizer  # or copy the class into your project
+repo_id = "Soldier-Boy/twitch-tokenizer"
+vocab_path = hf_hub_download(repo_id, filename="bpe_vocab.json")
+merges_path = hf_hub_download(repo_id, filename="bpe_merges.txt")
+config_path = hf_hub_download(repo_id, filename="tokenizer_config.json")
+# Place the three files in a folder and point BPETokenizer to it
+# or adapt BPETokenizer to accept file paths directly.
+```
+## Notes
+- Mentions/URLs are normalized to placeholders (e.g., <USER>, <URL>) by the pipeline.
+- Tokenization allows space-attached merges for efficiency (like GPT-2).