I tried using this tokenizer on diacritized texts but it fails to recognize the SHADDA (◌ّ) and replaces it with token. It also fails to tokenize MultiWordTokens correctly when the input text is diacritized (Tried with input "وَإنَّما،")
· Sign up or log in to comment