Soldier-Boy commited on
Commit
be5f8fc
Β·
verified Β·
1 Parent(s): fcf2a19

Add README

Browse files
Files changed (1) hide show
  1. README.md +79 -18
README.md CHANGED
@@ -4,39 +4,100 @@ tags:
4
  - bpe
5
  - twitch
6
  - chat
 
7
  license: mit
8
  library_name: none
9
  language:
10
  - en
 
11
  ---
12
 
13
  # Twitch BPE Tokenizer
14
 
15
- A custom Byte Pair Encoding (BPE) tokenizer trained on Twitch chat.
16
 
17
- - Granularity: word
18
- - Vocab size: ~53k (configurable)
19
- - Metrics (val): entropy compression ratio ~3.0, roundtrip ~0.997
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Files
22
- - bpe_vocab.json
23
- - bpe_merges.txt
24
- - tokenizer_config.json
25
 
26
- ## Usage
27
  ```python
28
- from huggingface_hub import hf_hub_download
29
- from src.bpe_tokenizer import BPETokenizer # or copy the class into your project
 
30
 
31
  repo_id = "Soldier-Boy/twitch-tokenizer"
32
- vocab_path = hf_hub_download(repo_id, filename="bpe_vocab.json")
33
- merges_path = hf_hub_download(repo_id, filename="bpe_merges.txt")
34
- config_path = hf_hub_download(repo_id, filename="tokenizer_config.json")
35
 
36
- # Place the three files in a folder and point BPETokenizer to it
37
- # or adapt BPETokenizer to accept file paths directly.
 
 
 
 
 
 
 
 
 
 
38
  ```
39
 
40
- ## Notes
41
- - Mentions/URLs are normalized to placeholders (e.g., <USER>, <URL>) by the pipeline.
42
- - Tokenization allows space-attached merges for efficiency (like GPT-2).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - bpe
5
  - twitch
6
  - chat
7
+ - nlp
8
  license: mit
9
  library_name: none
10
  language:
11
  - en
12
+ pretty_name: Twitch BPE Tokenizer
13
  ---
14
 
15
  # Twitch BPE Tokenizer
16
 
17
+ A compact, production-ready Byte Pair Encoding (BPE) tokenizer trained on large-scale Twitch chat. It focuses on:
18
 
19
+ - Efficient compression at word granularity while preserving roundtrip fidelity
20
+ - Stable fast-training on very large corpora (sampling + batch merges)
21
+ - Sensible cleaning and macro protection for URLs, mentions, emojis, and whitespace
22
+
23
+ This repository hosts the tokenizer artifacts needed to use the model in your own pipelines or applications.
24
+
25
+ ## Highlights
26
+ - Granularity: word (with byte fallback for out-of-vocab code points)
27
+ - Typical vocab size: ~53k (configurable at train time)
28
+ - Validation metrics (indicative): entropy compression ratio β‰ˆ 3.0; roundtrip β‰ˆ 0.997
29
+ - Space-attached merges (GPT-2 style) for better compression of common prefixes
30
+
31
+ > Note: Metrics are data-dependent and will vary by dataset and exact configuration. Values above reflect typical results on held-out Twitch chat samples of ~200k lines with ~53k vocab.
32
+
33
+ ## Intended Use & Limitations
34
+ - Intended for tokenizing informal, high-throughput chat or social text for downstream modeling or analytics.
35
+ - The training corpus (Twitch chat) can contain colloquialisms, slang, and potentially offensive content. Exercise care when deploying in user-facing contexts.
36
+ - Not a language model. It provides tokenization only (vocab, merges, and config).
37
+
38
+ ## Training Data & Cleaning
39
+ - Source: Twitch chat style messages (user-provided corpus).
40
+ - Normalization: NFC; controls stripped; repeated characters capped; whitespace collapsed where safe.
41
+ - Placeholders/macros: URLs β†’ `<URL>`, user mentions β†’ `<USER>`; emojis and some patterns protected.
42
+ - Granularity during training: `word` with macro protection; fallback to bytes for unseen characters.
43
+
44
+ ## Tokenization Details
45
+ - Pretokenization splits into words and whitespace tokens, preserving structure required for good merges.
46
+ - Byte fallback ensures roundtrip for any Unicode inputs.
47
+ - Special tokens reserved with fixed IDs: `<PAD>`, `<UNK>`, `<BOS>`, `<EOS>`.
48
+ - Merge strategy during training includes batch selection, bucket caps to avoid overshoot, and early-stop on stable ratio improvements.
49
 
50
  ## Files
51
+ - `bpe_vocab.json` β€” token β†’ id mapping
52
+ - `bpe_merges.txt` β€” merge rules (ordered)
53
+ - `tokenizer_config.json` β€” basic tokenizer configuration
54
 
55
+ ## Quickstart
56
  ```python
57
+ from huggingface_hub import snapshot_download
58
+ from src.bpe_tokenizer import BPETokenizer
59
+ from src import config as CFG
60
 
61
  repo_id = "Soldier-Boy/twitch-tokenizer"
 
 
 
62
 
63
+ # Download all artifacts into a local directory
64
+ local_dir = snapshot_download(repo_id)
65
+
66
+ # Optional: align granularity with training (recommended)
67
+ CFG.TOKEN_GRANULARITY = 'word'
68
+
69
+ # Initialize and use
70
+ tok = BPETokenizer(tokenizer_dir=local_dir)
71
+ text = "hey @user check this: https://t.tv/AbC"
72
+ ids = tok.encode(text)
73
+ print(ids)
74
+ print(tok.decode(ids))
75
  ```
76
 
77
+ ### Using without importing this repo
78
+ If you don't want to depend on this repository's code, copy the `BPETokenizer` class (and its small dependencies) into your project and point it to a folder containing the three files above.
79
+
80
+ ## Metrics & Evaluation (indicative)
81
+ - Compression ratio (entropy): ~3.03 on ~200k validation lines
82
+ - Roundtrip accuracy: ~0.997 (exact byte-for-byte decoding for most lines)
83
+
84
+ Evaluation is performed with word granularity, entropy ratio metric, and a ~53k vocabulary. For fixed-length codes, ratios will differ by token distribution.
85
+
86
+ ## Reproducibility
87
+ Train/evaluate end-to-end with the included pipeline (from this repo):
88
+ ```bash
89
+ # Train (example; adjust paths and targets)
90
+ python run_pipeline.py --data_path data/twitch_chat.parquet --text_column Message --vocab_target 56000 --granularity word --ratio_metric entropy --fast_train
91
+
92
+ # Artifacts will be written to the tokenizer/ directory
93
+ ```
94
+
95
+ ## Changelog
96
+ - v1.0: Public release. Stable fast-training, macro protection, and strong validation metrics with ~53k vocab.
97
+
98
+ ## License
99
+ MIT License
100
+
101
+ ## Acknowledgments
102
+ Thanks to the open-source tokenizer literature and community work around BPE and text normalization.
103
+