Instructions to use microsoft/Phi-3-vision-128k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3-vision-128k-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3-vision-128k-instruct")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("microsoft/Phi-3-vision-128k-instruct", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/Phi-3-vision-128k-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3-vision-128k-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-vision-128k-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3-vision-128k-instruct

SGLang

How to use microsoft/Phi-3-vision-128k-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3-vision-128k-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-vision-128k-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3-vision-128k-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-vision-128k-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use microsoft/Phi-3-vision-128k-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3-vision-128k-instruct
```

How to achieve faster inference speed？

#18

by zkDU - opened May 25, 2024

Discussion

zkDU

May 25, 2024

I ran the code on 4090 according to the Sample inference code, but found that it seems to take about 3.14-4.85 seconds to complete inference. I wonder if there is any way to speed up inference?

nmstoker

May 25, 2024

Am AFK now so will need to time it later to compare. Purely anecdotally I found it seemed quite snappy on my 3090.

Also that timing isn't bad compared to the free trial speeds on Azure AI (granted that's using shared resources thus you probably wouldn't expect it to be particularly fast): https://ai.azure.com/explore/models/Phi-3-vision-128k-instruct/version/1/registry/azureml

It would be interesting to analyse the impact of downsampling images, to see if there is a sweet spot for improved speed at an acceptable loss in reading/observation accuracy (assuming this does improve speed at all).

2U1

May 31, 2024

@nmstoker I have used the 4-bit model, also downsampled the image but it doesn't speed up at all.

haiduc32

Jun 12, 2024

I changed in config "torch_dtype": "bfloat16", to float16. provided some marginal improvements in speed. For ex on rtx 4090, from an average of 1.36s to 1.25s. For rtx 3090 from 2s to 1.8s.
Also depends on the OS. On Windows, on 4080S, got about 3 sec. Same scripts (slightly modified python example from huggingface) on Linux 1.66. Have no idea why - disclaimer: these were different machines. But I had the same behavior on my home GTX 1080 - from 30 sec on windows, down to about 22 on WSL on the same machine.

wamozart

Jun 13, 2024

@haiduc32 I was testing it on EC2 g5.xl and the inference time is much slower, which instances do you recommend using? I need to achieve the inference time you got?

haiduc32

Jun 14, 2024

sorry, maybe I did not explain it correctly, my results are for my own images with my custom prompt, but based on the python example. Just changed prompt and image. So it would depend what image and prompt you provide. The results I provided are only to show the comparative difference when using float16, and between different GPUs I tested.
I did not try any cloud dedicated GPUs. I found kind of a GPU sharing service, and that allowed me to test on different consumer grade GPUs as I was wondering what's the minimum budget for my scenario.
I do want to try ONNX (I've seen your post), but not a priority for me right now, as for the moment I gave up on Phi 3 Vision and I'm using ChatGpt-4o - more cost effective for my scenario, though the response times are worse.

nguyenbh changed discussion status to closed Aug 4, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment