No output(?)

#5
by jbarth-ubhd - opened

Tried it this way:

#!/usr/bin/env python3
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

# load image from the IAM database
url = "https://digi.ub.uni-heidelberg.de/diglitData/v/suetterlin/0001_page_370214_line_r1l5_docId_17239.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

processor = TrOCRProcessor.from_pretrained('dh-unibe/trocr-kurrent')
model = VisionEncoderDecoderModel.from_pretrained('dh-unibe/trocr-kurrent')
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

(a minimal modified version of the example from trocr-base-handwritten) but no (empty line) output...

The trocr-base-handwritten example works.

PS: Ubuntu 24.04. Tried Python 3.11 ... 3.14. GPU is a RTX 5060 Ti — 16 GB.

Digital Humanities @ University of Bern org
edited Mar 5

I guess the error is in the processor. As described in the other issue 3, you can load the original processor and then feed the model

#!/usr/bin/env python3
import logging

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests
# Log debug messages
logging.basicConfig(level=logging.DEBUG)

# load image from the IAM database
url = "https://digi.ub.uni-heidelberg.de/diglitData/v/suetterlin/0001_page_370214_line_r1l5_docId_17239.jpg"

image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# Load the initial processor instead the one from the model as described in issue 3 (huggingface.co/dh-unibe/trocr-kurrent/discussions/3)
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained('dh-unibe/trocr-kurrent')
pixel_values = processor(images=image, return_tensors="pt").pixel_values
print(pixel_values)
generated_ids = model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text[0])

The above script ouptuts Vogelschutz vom 17 . September. I'm also not so sure about the influence of the different processors, but I guess the results will not be as expected...

Btw. I'm not from the Digital Humanities Bern team, just part of this HF group

Thanks! And for segmentation we could use https://github.com/mittagessen/kraken $ kraken -i image.tif lines.json segment -bl

Digital Humanities @ University of Bern org

For segmentation you could use SAM with prompts, this may give you even better results

Digital Humanities @ University of Bern org
edited Mar 11

There are several approaches of using SAM to segement text in images. Perhaps you want to have a look at hi-sam. You can find the basic idea in the following paper https://arxiv.org/abs/2401.17904 and the weights for the pre-trained model on github Yukinori Yamamoto applied this approach to Spanish handwriting from the 18th century and had good results. I guess this would also be possible to other languages and historical domains. You can find a blog post about this on medium

Or PP-OCRv5 for segmentation. Seems relative reliable, but the "polygons" only have 4 corners. I don't know to what extent text from the lines below/above affects the OCR.

Sign up or log in to comment