Tokenizer

#2
by krishna-chaitanya1 - opened

I tested the tokenizer on a variety of Hindi sentences, including simple phrases, questions, code-mixed inputs (Hindi + English), and sentences with numbers and symbols.

test_sentences = [
"राम स्कूल से आ रहा था", # simple Hindi
"आज मौसम बहुत अच्छा है", # common sentence
"मुझे Python programming पसंद है", # code-mixed
"₹5000 में अच्छा फोन कौन सा है?", # numbers + symbols
"दिल्ली में बारिश हो रही है", # named entity
"क्या तुम कल आओगे?", # question
"AI और LLMs का future क्या है?", # English + Hindi + abbreviations
]

It is able to accurately preserve the original meaning of the sentences during encoding and decoding. It also handles mixed-language inputs effectively, keeping English words like “Python”, “AI”, and “LLMs” intact while processing Hindi text smoothly.

Overall, the tokenizer shows solid performance across diverse Hindi inputs.

Ṛta AI Labs org

Thank you so much for your feedback Krishna

Sign up or log in to comment