How to correctly determine the coordinates for this prompt: "OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>"

#6
by Anaudia - opened

I would like to use the model on specific parts of my image, but I am not sure how to transform the boundary boxes I have into the loc parameters used in the prompt.

Docling org

Hello, thanks for pointing this out. Perhaps we need to have a helper function somewhere visible. You can find a function that takes in normalized coords or pixel coords in [xmin, ymin, xmax, ymax] at the demo here:
https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo/blob/12df581e7fb68a527eb8e857c6a1caea6da3828c/app.py#L35

asnassar changed discussion status to closed

Hello,
Looking at https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo/blob/12df581e7fb68a527eb8e857c6a1caea6da3828c/app.py#L35 - should coordinates passed to OCR prompt be in the 500x500 range or it should be pixel numbers?

In normalize_values function OCR region coordinates get normalized with the maximum region coordinate, not with the actual image size which looks rather strange, additionally:

examples=[[{"text": "Convert this page to docling.", "files": ["example_images/2d0fbcc50e88065a040a537b717620e964fb4453314b71d83f3ed3425addcef6.png"]}],
          [{"text": "Convert this table to OTSL.", "files": ["example_images/image-2.jpg"]}],
          [{"text": "Convert code to text.", "files": ["example_images/7666.jpg"]}],
          [{"text": "Convert formula to latex.", "files": ["example_images/2433.jpg"]}],
          [{"text": "Convert chart to OTSL.", "files": ["example_images/06236926002285.png"]}],
          [{"text": "OCR the text in location [47, 531, 167, 565]", "files": ["example_images/s2w_example.png"]}],
          [{"text": "Extract all section header elements on the page.", "files": ["example_images/paper_3.png"]}],
          [{"text": "Identify element at location [123, 413, 1059, 1061]", "files": ["example_images/redhat.png"]}],
          [{"text": "Convert this page to docling.", "files": ["example_images/gazette_de_france.jpg"]}],
          ]

Is it a typo in the demo source? Seems like normalization does not work at all for the OCR the text in location [47, 531, 167, 565] prompt because normalize_values is called only if prompt contains OCR at text at substring:

    if "OCR at text at" in text or "Identify element" in text or "formula" in text:
        text = normalize_values(text, target_max=500)

Indeed, if I pass OCR the text in location [47, 531, 167, 565] with example_images/s2w_example.png without normalization and assuming I'm dealing with pixel coordinates, I get expected result . However, it does not work for other regions for me.

Sign up or log in to comment