Instructions to use microsoft/Phi-3.5-vision-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-3.5-vision-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="microsoft/Phi-3.5-vision-instruct", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-vision-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Phi-3.5-vision-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Phi-3.5-vision-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3.5-vision-instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/microsoft/Phi-3.5-vision-instruct
- SGLang
How to use microsoft/Phi-3.5-vision-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3.5-vision-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3.5-vision-instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3.5-vision-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3.5-vision-instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use microsoft/Phi-3.5-vision-instruct with Docker Model Runner:
docker model run hf.co/microsoft/Phi-3.5-vision-instruct
total images must be the same as the number of image tags
I've extracted 6 frames from a short video. However, when I create the inputs I receive an error. I'm using 2 V100 GPUs in Databricks. Any ideas? The Autoprocessor does not have an image_tag parameter. I confirmed there are 6 images in the keyframes directory and these are loaded using Image.open.
Code:
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
model_id = "microsoft/Phi-3.5-vision-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True, torch_dtype="auto", _attn_implementation='eager')
placeholder = ""
messages = [
{"role": "user", "content": placeholder+"Please summarize"},
]
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, num_crops=4)
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
loaded_images = [Image.open(path) for path in keyframe_paths]
inputs = processor(prompt, [loaded_images], return_tensors="pt")
Error message:
AssertionError: total images must be the same as the number of image tags, got 0 image tags and 6 images
File , line 4
1 loaded_images = [Image.open(path) for path in keyframe_paths]
2 #inputs = processor(text=prompt, images=loaded_images, return_tensors="pt").to("cuda:0")
----> 4 inputs = processor(prompt, loaded_images, return_tensors="pt")
File ~/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3.5-vision-instruct/c68f85286eac3fb376a17068e820e738a89c194a/processing_phi3_v.py:377, in Phi3VProcessor.call(self, text, images, padding, truncation, max_length, return_tensors)
375 else:
376 image_inputs = {}
--> 377 inputs = self._convert_images_texts_to_inputs(image_inputs, text, padding=padding, truncation=truncation, max_length=max_length, return_tensors=return_tensors)
378 return inputs
File ~/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3.5-vision-instruct/c68f85286eac3fb376a17068e820e738a89c194a/processing_phi3_v.py:435, in Phi3VProcessor._convert_images_texts_to_inputs(self, images, texts, padding, truncation, max_length, return_tensors)
433 assert unique_image_ids == list(range(1, len(unique_image_ids)+1)), f"image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be {unique_image_ids}"
434 # total images must be the same as the number of image tags
--> 435 assert len(unique_image_ids) == len(images), f"total images must be the same as the number of image tags, got {len(unique_image_ids)} image tags and {len(images)} images"
437 image_ids_pad = [[-iid]*num_img_tokens[iid-1] for iid in image_ids]
439 def insert_separator(X, sep_list):
Hi @wvangils , your prompt should include image tags in the prompt. You should have your placeholder like this:
placeholder = ""
images = []
for i in range(len(loaded_images )):
placeholder += f"<|image_{i+1}|>\n"
messages = [
{"role": "user", "content": f"{placeholder}Please Summarize."},
]
Note that image numbers start with 1 (attempting to start with 0 may result in an error).