Instructions to use microsoft/Phi-4-multimodal-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-4-multimodal-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Getting Bounding Boxes for Vision
generate_ids = model.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
output_scores=True,
return_dict_in_generate=True,
)
After generating the output, I tried to fetch the bounding boxes like this,
bounding_boxes = getattr(generate_output, "box_coordinates", None)
I am pretty sure, Ph4-multimodal-instruct doesn't provide bounding_box like Florence-2.
However it would be great, if Ph4-multimodal-instruct would have provided that information because it is doing the Optical character recognition.
Any idea how to get the bounding boxes from the model would be a great help in case of vision capability.
Or Am I missing something.
Any suggestion or idea will be highly appreciable.
Regard.
I have also attempted to get the attention patterns like this,
with torch.no_grad():
outputs = self.model(
**inputs,
output_attentions=True,
output_hidden_states=True
)
attention_patterns = outputs.attentions
But it seems that outputs doesn't have any attentions attribute.
Any suggestion will be highly appreciable
Thanks.