Unknown Token Problem with Swedish Characters in TrOCR
I am working on Swedish handwritten OCR using Huggingface Transformers and TrOCR. My dataset and ground truth are UTF-8 encoded and contain Swedish characters (å, ä, ö, Å, Ä, Ö). The tokenizer vocabulary (from both Microsoft and Riksarkivet models) includes these characters.
However, during inference and evaluation, TrOCR never predicts these Swedish characters—they are always replaced by the unknown token ("�"), even though they are present in the vocab and training data.
To debug, I have tried all possible combinations of model and processor:
Model: microsoft/trocr-base-handwritten or Riksarkivet/trocr-base-handwritten-hist-swe-2
Processor: microsoft/trocr-base-handwritten or Riksarkivet/trocr-base-handwritten-hist-swe-2
This means I have tested:
Microsoft model + Microsoft processor
Microsoft model + Riksarkivet processor
Riksarkivet model + Riksarkivet processor
In all cases, the Swedish characters are never predicted. The tokenizer appears to be configured correctly, and the data is valid. What could be causing this persistent unknown token problem for å, ä, ö, Å, Ä, Ö?