PraneetNS's picture
Upload folder using huggingface_hub
8d85ec0 verified
# Indic Tokenizer v2
Custom SentencePiece Unigram tokenizer trained on:
- Hindi, Tamil, Telugu corpora
- Code-mixed Hinglish data
## Features
- 40–70% fewer tokens vs GPT-2
- Script-aware tokenization
- Better handling of Indic languages
## Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"your-username/indic-tokenizer-v2",
trust_remote_code=True
)
print(tokenizer.tokenize("नमस्ते मित्र, कैसे हो?"))