Instructions to use microsoft/Phi-4-multimodal-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-4-multimodal-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/Phi-4-multimodal-instruct", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-multimodal-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Unable to run on V100
Hi, I'm trying to run on a V100 GPU and am following the recommendation of setting attn_implementation='eager', but this still returns RuntimeError: FlashAttention only supports Ampere GPUs or newer..
Any idea what is going on here?
Thanks!
V100 uses the Volta architecture. Try it on A100, A6000, or A40 GPUs!
I don't have have access to those, but it says in the model card that it should work on V100
Try na running it as a subprocess ?
subprocess.run(
"pip install flash-attn --no-build-isolation",
env={"FLASH_ATTENTION_SKIP_CUDA_BUILD": "TRUE"},
shell=True
)
Flash attention will not work on this type of GPU. My question is why the model still tries to run flash attention when attn_implementation='eager' is set
Because it overrides the model configuration,
_attn_implementation": "flash_attention_2"