No output(?)

by jbarth-ubhd - opened Mar 5

Mar 5

Tried it this way:

#!/usr/bin/env python3
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

# load image from the IAM database
url = "https://digi.ub.uni-heidelberg.de/diglitData/v/suetterlin/0001_page_370214_line_r1l5_docId_17239.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

processor = TrOCRProcessor.from_pretrained('dh-unibe/trocr-kurrent')
model = VisionEncoderDecoderModel.from_pretrained('dh-unibe/trocr-kurrent')
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

(a minimal modified version of the example from trocr-base-handwritten) but no (empty line) output...

The trocr-base-handwritten example works.

jbarth-ubhd

Mar 5

PS: Ubuntu 24.04. Tried Python 3.11 ... 3.14. GPU is a RTX 5060 Ti — 16 GB.

tobinski404

Digital Humanities @ University of Bern org Mar 5

•

edited Mar 5

I guess the error is in the processor. As described in the other issue 3, you can load the original processor and then feed the model

#!/usr/bin/env python3
import logging

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests
# Log debug messages
logging.basicConfig(level=logging.DEBUG)

# load image from the IAM database
url = "https://digi.ub.uni-heidelberg.de/diglitData/v/suetterlin/0001_page_370214_line_r1l5_docId_17239.jpg"

image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# Load the initial processor instead the one from the model as described in issue 3 (huggingface.co/dh-unibe/trocr-kurrent/discussions/3)
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
model = VisionEncoderDecoderModel.from_pretrained('dh-unibe/trocr-kurrent')
pixel_values = processor(images=image, return_tensors="pt").pixel_values
print(pixel_values)
generated_ids = model.generate(pixel_values)

generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text[0])

The above script ouptuts Vogelschutz vom 17 . September. I'm also not so sure about the influence of the different processors, but I guess the results will not be as expected...

Btw. I'm not from the Digital Humanities Bern team, just part of this HF group

jbarth-ubhd

Mar 10

Thanks! And for segmentation we could use https://github.com/mittagessen/kraken $ kraken -i image.tif lines.json segment -bl

tobinski404

Digital Humanities @ University of Bern org Mar 11

For segmentation you could use SAM with prompts, this may give you even better results

jbarth-ubhd

Mar 11

https://github.com/luca-medeiros/lang-segment-anything or https://huggingface.co/docs/transformers/model_doc/sam3 ?

jbarth-ubhd

Mar 11

https://ai.meta.com/research/sam3/

tobinski404

Digital Humanities @ University of Bern org Mar 11

•

edited Mar 11

There are several approaches of using SAM to segement text in images. Perhaps you want to have a look at hi-sam. You can find the basic idea in the following paper https://arxiv.org/abs/2401.17904 and the weights for the pre-trained model on github Yukinori Yamamoto applied this approach to Spanish handwriting from the 18th century and had good results. I guess this would also be possible to other languages and historical domains. You can find a blog post about this on medium

jbarth-ubhd

Mar 18

Or PP-OCRv5 for segmentation. Seems relative reliable, but the "polygons" only have 4 corners. I don't know to what extent text from the lines below/above affects the OCR.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment