| # TinyGuardrail Tokenizer | |
| Advanced BPE-based tokenizer for TinyGuardrail safety model. | |
| ## Specifications | |
| - **Vocabulary Size**: 16,000 | |
| - **Max Length**: 512 | |
| - **Min Frequency**: 2 | |
| - **Special Tokens**: <pad>, <unk>, <cls>, <sep> | |
| - **BPE Merges**: 141 | |
| ## Usage | |
| ```python | |
| from src.data.tokenizer import load_tokenizer | |
| # Load from HuggingFace | |
| tokenizer = load_tokenizer(hf_repo="2796gauravc/tinyguardrail-tokenizer") | |
| # Or load from local path | |
| tokenizer = load_tokenizer("outputs/tokenizer.pkl") | |
| # Encode text | |
| tokens = tokenizer.encode("Your text here") | |
| # Decode tokens | |
| text = tokenizer.decode(tokens) | |
| ``` | |