Soldier-Boy commited on
Commit
fcf2a19
·
verified ·
1 Parent(s): a1c81d5

Add README

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - tokenizer
4
+ - bpe
5
+ - twitch
6
+ - chat
7
+ license: mit
8
+ library_name: none
9
+ language:
10
+ - en
11
+ ---
12
+
13
+ # Twitch BPE Tokenizer
14
+
15
+ A custom Byte Pair Encoding (BPE) tokenizer trained on Twitch chat.
16
+
17
+ - Granularity: word
18
+ - Vocab size: ~53k (configurable)
19
+ - Metrics (val): entropy compression ratio ~3.0, roundtrip ~0.997
20
+
21
+ ## Files
22
+ - bpe_vocab.json
23
+ - bpe_merges.txt
24
+ - tokenizer_config.json
25
+
26
+ ## Usage
27
+ ```python
28
+ from huggingface_hub import hf_hub_download
29
+ from src.bpe_tokenizer import BPETokenizer # or copy the class into your project
30
+
31
+ repo_id = "Soldier-Boy/twitch-tokenizer"
32
+ vocab_path = hf_hub_download(repo_id, filename="bpe_vocab.json")
33
+ merges_path = hf_hub_download(repo_id, filename="bpe_merges.txt")
34
+ config_path = hf_hub_download(repo_id, filename="tokenizer_config.json")
35
+
36
+ # Place the three files in a folder and point BPETokenizer to it
37
+ # or adapt BPETokenizer to accept file paths directly.
38
+ ```
39
+
40
+ ## Notes
41
+ - Mentions/URLs are normalized to placeholders (e.g., <USER>, <URL>) by the pipeline.
42
+ - Tokenization allows space-attached merges for efficiency (like GPT-2).