# Indic Tokenizer v2 Custom SentencePiece Unigram tokenizer trained on: - Hindi, Tamil, Telugu corpora - Code-mixed Hinglish data ## Features - 40–70% fewer tokens vs GPT-2 - Script-aware tokenization - Better handling of Indic languages ## Usage from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "your-username/indic-tokenizer-v2", trust_remote_code=True ) print(tokenizer.tokenize("नमस्ते मित्र, कैसे हो?"))