Indic Tokenizer v2
Custom SentencePiece Unigram tokenizer trained on:
- Hindi, Tamil, Telugu corpora
- Code-mixed Hinglish data
Features
- 40–70% fewer tokens vs GPT-2
- Script-aware tokenization
- Better handling of Indic languages
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained( "your-username/indic-tokenizer-v2", trust_remote_code=True )
print(tokenizer.tokenize("नमस्ते मित्र, कैसे हो?"))