Instructions to use microsoft/Phi-3-vision-128k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-3-vision-128k-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/Phi-3-vision-128k-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-vision-128k-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Phi-3-vision-128k-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Phi-3-vision-128k-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-vision-128k-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/Phi-3-vision-128k-instruct
- SGLang
How to use microsoft/Phi-3-vision-128k-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3-vision-128k-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-vision-128k-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3-vision-128k-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-vision-128k-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/Phi-3-vision-128k-instruct with Docker Model Runner:
docker model run hf.co/microsoft/Phi-3-vision-128k-instruct
phi3 image tokens
I am looking at using phi-3-vision models to try and describe an image. However, I couldn't help but notice that the number of tokens that an image takes is quite large (~2000). Is this correct, or a potential bug? I have included a code snippet so that you can check my assumptions:
From my understanding of VLMs they simply take an image, and use CLIP or similar to project one image to one (or few tokens), so that they become a "language token".
Side questions
Incase it helps me understand phi,
- Where is the 17 coming from in the below image shape.
- Why is the
image_sizes(1, 2) and not (1, 1) given that I have only referenced one image.
from PIL import Image
import requests
from transformers import AutoProcessor
model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
messages = [
{"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
]
url = "https://sm.ign.com/t/ign_ap/review/d/deadpool-r/deadpool-review_2s7s.1200.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt")
{k: v.shape for k, v in inputs.items()}
# {'input_ids': torch.Size([1, 2371]),
# 'attention_mask': torch.Size([1, 2371]),
# 'pixel_values': torch.Size([1, 17, 3, 336, 336]),
# 'image_sizes': torch.Size([1, 2])}
I tried to do image.resize((128, 128)) but this only increased the number of tokens to 2500+.
I think phi3-vision is based on llava-1.6.
It looks like the 17 is from number of image crops.
You could take a look of this llava-1.6.
https://llava-vl.github.io/blog/2024-01-30-llava-next/