TinyGuardrail Tokenizer
Advanced BPE-based tokenizer for TinyGuardrail safety model.
Specifications
- Vocabulary Size: 16,000
- Max Length: 512
- Min Frequency: 2
- Special Tokens: , , ,
- BPE Merges: 141
Usage
from src.data.tokenizer import load_tokenizer
# Load from HuggingFace
tokenizer = load_tokenizer(hf_repo="2796gauravc/tinyguardrail-tokenizer")
# Or load from local path
tokenizer = load_tokenizer("outputs/tokenizer.pkl")
# Encode text
tokens = tokenizer.encode("Your text here")
# Decode tokens
text = tokenizer.decode(tokens)