2796gauravc commited on
Commit
bbeb0b9
·
verified ·
1 Parent(s): af58c45

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TinyGuardrail Tokenizer
2
+
3
+ Advanced BPE-based tokenizer for TinyGuardrail safety model.
4
+
5
+ ## Specifications
6
+
7
+ - **Vocabulary Size**: 16,000
8
+ - **Max Length**: 512
9
+ - **Min Frequency**: 2
10
+ - **Special Tokens**: <pad>, <unk>, <cls>, <sep>
11
+ - **BPE Merges**: 141
12
+
13
+ ## Usage
14
+
15
+ ```python
16
+ from src.data.tokenizer import load_tokenizer
17
+
18
+ # Load from HuggingFace
19
+ tokenizer = load_tokenizer(hf_repo="2796gauravc/tinyguardrail-tokenizer")
20
+
21
+ # Or load from local path
22
+ tokenizer = load_tokenizer("outputs/tokenizer.pkl")
23
+
24
+ # Encode text
25
+ tokens = tokenizer.encode("Your text here")
26
+
27
+ # Decode tokens
28
+ text = tokenizer.decode(tokens)
29
+ ```