Instructions to use microsoft/Phi-4-multimodal-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-4-multimodal-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Chat template support
Hi, thanks for the nice model! Would it be possible to add official support for HF's chat templates? For instance, at this time when using the tokenizer.apply_chat_templatefunction on a dictionary of messages I get
line 119, in generate
batch_prompts = [self.processor.apply_chat_template(messages, add_generation_prompt=True) for messages in all_messages]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
line 1219, in apply_chat_template
prompt = self.tokenizer.apply_chat_template(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
line 1695, in apply_chat_template
rendered_chat = compiled_template.render(
^^^^^^^^^^^^^^^^^^^^^^^^^
line 1304, in render
self.environment.handle_exception()
line 939, in handle_exception
raise rewrite_traceback_stack(source=source)
File "<template>", line 1, in top-level template code
TypeError: can only concatenate str (not "list") to str
Is there a way this could be fixed, in order to avoid hard-coding the chat template specifically for this model?
Thanks!
For reference, here I am passing a list of list of dictionaries as the messages, since I'm using a batch of prompts. So for example, I would have something like:
[
[ {'role: 'system', 'content': 'System prompt'}, {'role: 'user', 'content': 'User prompt'} ],
[ {'role: 'system', 'content': 'System prompt'}, {'role': 'user', 'content': [{'type': 'image'}, {'type': 'text', 'text': 'Question about the image'}]}]
]
Maybe you can take a look at the sample inference code
https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py
@merlerm currently system message is only enable in language, and not in vision and speech.
Hope this help