PraneetNS's picture
Upload folder using huggingface_hub
8d85ec0 verified

Indic Tokenizer v2

Custom SentencePiece Unigram tokenizer trained on:

  • Hindi, Tamil, Telugu corpora
  • Code-mixed Hinglish data

Features

  • 40–70% fewer tokens vs GPT-2
  • Script-aware tokenization
  • Better handling of Indic languages

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained( "your-username/indic-tokenizer-v2", trust_remote_code=True )

print(tokenizer.tokenize("नमस्ते मित्र, कैसे हो?"))