2796gauravc's picture
Upload README.md with huggingface_hub
bbeb0b9 verified
# TinyGuardrail Tokenizer
Advanced BPE-based tokenizer for TinyGuardrail safety model.
## Specifications
- **Vocabulary Size**: 16,000
- **Max Length**: 512
- **Min Frequency**: 2
- **Special Tokens**: <pad>, <unk>, <cls>, <sep>
- **BPE Merges**: 141
## Usage
```python
from src.data.tokenizer import load_tokenizer
# Load from HuggingFace
tokenizer = load_tokenizer(hf_repo="2796gauravc/tinyguardrail-tokenizer")
# Or load from local path
tokenizer = load_tokenizer("outputs/tokenizer.pkl")
# Encode text
tokens = tokenizer.encode("Your text here")
# Decode tokens
text = tokenizer.decode(tokens)
```