Tokenization issue with diacritics

#1
by banouz - opened

I tried using this tokenizer on diacritized texts but it fails to recognize the SHADDA (◌ّ) and replaces it with token. It also fails to tokenize MultiWordTokens correctly when the input text is diacritized (Tried with input "وَإنَّما،")

Sign up or log in to comment