Tokenizer

by krishna-chaitanya1 - opened 1 day ago

I tested the tokenizer on a variety of Hindi sentences, including simple phrases, questions, code-mixed inputs (Hindi + English), and sentences with numbers and symbols.

test_sentences = [
"राम स्कूल से आ रहा था", # simple Hindi
"आज मौसम बहुत अच्छा है", # common sentence
"मुझे Python programming पसंद है", # code-mixed
"₹5000 में अच्छा फोन कौन सा है?", # numbers + symbols
"दिल्ली में बारिश हो रही है", # named entity
"क्या तुम कल आओगे?", # question
"AI और LLMs का future क्या है?", # English + Hindi + abbreviations
]

It is able to accurately preserve the original meaning of the sentences during encoding and decoding. It also handles mixed-language inputs effectively, keeping English words like “Python”, “AI”, and “LLMs” intact while processing Hindi text smoothly.

Overall, the tokenizer shows solid performance across diverse Hindi inputs.

vishesh-t27

Ṛta AI Labs org 1 day ago

Thank you so much for your feedback Krishna

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment