| # Indic Tokenizer v2 | |
| Custom SentencePiece Unigram tokenizer trained on: | |
| - Hindi, Tamil, Telugu corpora | |
| - Code-mixed Hinglish data | |
| ## Features | |
| - 40–70% fewer tokens vs GPT-2 | |
| - Script-aware tokenization | |
| - Better handling of Indic languages | |
| ## Usage | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "your-username/indic-tokenizer-v2", | |
| trust_remote_code=True | |
| ) | |
| print(tokenizer.tokenize("नमस्ते मित्र, कैसे हो?")) | |