Instructions to use microsoft/trocr-small-printed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/trocr-small-printed with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="microsoft/trocr-small-printed")# Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed") model = AutoModelForImageTextToText.from_pretrained("microsoft/trocr-small-printed") - Notebooks
- Google Colab
- Kaggle
why 'microsoft/trocr-small-printed' don't have vocab.json?
Hello, thank you for your great job ,now ,i have a question,why 'microsoft/trocr-small-printed' don't have vocab.json?where is it?
Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:
pip install sentencepiece
and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.
It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).
Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>
It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).
Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed") >>> type(tokenizer) <class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>
Thank you,it really helped me.
Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:
pip install sentencepieceand the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.
Thanks a lot , i will try it.