why 'microsoft/trocr-small-printed' don't have vocab.json?

by yaop - opened Sep 13, 2022

Sep 13, 2022

Hello, thank you for your great job ,now ,i have a question,why 'microsoft/trocr-small-printed' don't have vocab.json?where is it?

wmax

Sep 14, 2022

Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:

pip install sentencepiece

and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.

nielsr

Sep 14, 2022

It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).

Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

yaop

Oct 14, 2022

It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).

Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

Thank you,it really helped me.

yaop

Oct 14, 2022

Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:
pip install sentencepiece
and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.

Thanks a lot , i will try it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment