whether special instruction is need to trigger OCR location function?

#38

by liupei0408 - opened Oct 27, 2023

Oct 27, 2023

as mentioned above, whether special instruction is need for OCR location feature using Fuyu-8b to get same result as showing in blog?

Nooodles

Oct 30, 2023

Molbap

Nov 3, 2023

Hi @liupei0408 , @Nooodles : you can try this from the new release of transformers! @pcuenq worked on the bbox postprocessing, you can localise text by doing:

from PIL import Image
import requests
import io
from transformers import FuyuForCausalLM, FuyuProcessor

pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')

bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n Williams"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))

model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')

outputs = model.generate(**model_inputs, max_new_tokens=10)
post_processed_bbox_tokens = processor.post_process_box_coordinates(outputs)[0]
model_outputs = processor.decode(post_processed_bbox_tokens, skip_special_tokens=True)
prediction = model_outputs.split('\x04', 1)[1] if '\x04' in model_outputs else ''

prediction will output the coordinates of the text Williams in the image.

Nooodles

Dec 1, 2023

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment