Getting Bounding Boxes for Vision

#29

by sujan2023 - opened Mar 6, 2025

sujan2023

Mar 6, 2025

•

edited Mar 6, 2025

generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
output_scores=True,
return_dict_in_generate=True,
)

After generating the output, I tried to fetch the bounding boxes like this,
bounding_boxes = getattr(generate_output, "box_coordinates", None)

I am pretty sure, Ph4-multimodal-instruct doesn't provide bounding_box like Florence-2.
However it would be great, if Ph4-multimodal-instruct would have provided that information because it is doing the Optical character recognition.

Any idea how to get the bounding boxes from the model would be a great help in case of vision capability.
Or Am I missing something.

Any suggestion or idea will be highly appreciable.
Regard.

sujan2023

Mar 6, 2025

I have also attempted to get the attention patterns like this,
with torch.no_grad():
outputs = self.model(
**inputs,
output_attentions=True,
output_hidden_states=True
)
attention_patterns = outputs.attentions

But it seems that outputs doesn't have any attentions attribute.
Any suggestion will be highly appreciable
Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment