File size: 611 Bytes
bbeb0b9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | # TinyGuardrail Tokenizer
Advanced BPE-based tokenizer for TinyGuardrail safety model.
## Specifications
- **Vocabulary Size**: 16,000
- **Max Length**: 512
- **Min Frequency**: 2
- **Special Tokens**: <pad>, <unk>, <cls>, <sep>
- **BPE Merges**: 141
## Usage
```python
from src.data.tokenizer import load_tokenizer
# Load from HuggingFace
tokenizer = load_tokenizer(hf_repo="2796gauravc/tinyguardrail-tokenizer")
# Or load from local path
tokenizer = load_tokenizer("outputs/tokenizer.pkl")
# Encode text
tokens = tokenizer.encode("Your text here")
# Decode tokens
text = tokenizer.decode(tokens)
```
|