File size: 611 Bytes
bbeb0b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# TinyGuardrail Tokenizer

Advanced BPE-based tokenizer for TinyGuardrail safety model.

## Specifications

- **Vocabulary Size**: 16,000
- **Max Length**: 512
- **Min Frequency**: 2
- **Special Tokens**: <pad>, <unk>, <cls>, <sep>
- **BPE Merges**: 141

## Usage

```python
from src.data.tokenizer import load_tokenizer

# Load from HuggingFace
tokenizer = load_tokenizer(hf_repo="2796gauravc/tinyguardrail-tokenizer")

# Or load from local path
tokenizer = load_tokenizer("outputs/tokenizer.pkl")

# Encode text
tokens = tokenizer.encode("Your text here")

# Decode tokens
text = tokenizer.decode(tokens)
```