Tokenization issue with diacritics

by banouz - opened Jul 25, 2024

Jul 25, 2024

I tried using this tokenizer on diacritized texts but it fails to recognize the SHADDA (◌ّ) and replaces it with token. It also fails to tokenize MultiWordTokens correctly when the input text is diacritized (Tried with input "وَإنَّما،")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment