Instructions to use microsoft/Phi-3-vision-128k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3-vision-128k-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3-vision-128k-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-vision-128k-instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/Phi-3-vision-128k-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3-vision-128k-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-vision-128k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3-vision-128k-instruct

SGLang

How to use microsoft/Phi-3-vision-128k-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3-vision-128k-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-vision-128k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3-vision-128k-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-vision-128k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-3-vision-128k-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3-vision-128k-instruct
```

phi3 image tokens

#42

by sachin - opened Jun 22, 2024

Discussion

sachin

Jun 22, 2024

I am looking at using phi-3-vision models to try and describe an image. However, I couldn't help but notice that the number of tokens that an image takes is quite large (~2000). Is this correct, or a potential bug? I have included a code snippet so that you can check my assumptions:

From my understanding of VLMs they simply take an image, and use CLIP or similar to project one image to one (or few tokens), so that they become a "language token".

Side questions

Incase it helps me understand phi,

Where is the 17 coming from in the below image shape.
Why is the image_sizes (1, 2) and not (1, 1) given that I have only referenced one image.

from PIL import Image
import requests
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
]
url = "https://sm.ign.com/t/ign_ap/review/d/deadpool-r/deadpool-review_2s7s.1200.jpg"
image = Image.open(requests.get(url, stream=True).raw)

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt")

{k: v.shape for k, v in inputs.items()}
# {'input_ids': torch.Size([1, 2371]),
# 'attention_mask': torch.Size([1, 2371]),
# 'pixel_values': torch.Size([1, 17, 3, 336, 336]),
# 'image_sizes': torch.Size([1, 2])}

I tried to do image.resize((128, 128)) but this only increased the number of tokens to 2500+.

2U1

Jun 25, 2024

I think phi3-vision is based on llava-1.6.
It looks like the 17 is from number of image crops.

You could take a look of this llava-1.6.
https://llava-vl.github.io/blog/2024-01-30-llava-next/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment