Instructions to use microsoft/Phi-3-vision-128k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-3-vision-128k-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/Phi-3-vision-128k-instruct")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("microsoft/Phi-3-vision-128k-instruct", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Phi-3-vision-128k-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Phi-3-vision-128k-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-vision-128k-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/Phi-3-vision-128k-instruct
- SGLang
How to use microsoft/Phi-3-vision-128k-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3-vision-128k-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-vision-128k-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3-vision-128k-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-vision-128k-instruct", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/Phi-3-vision-128k-instruct with Docker Model Runner:
docker model run hf.co/microsoft/Phi-3-vision-128k-instruct
How to achieve faster inference speed?
I ran the code on 4090 according to the Sample inference code, but found that it seems to take about 3.14-4.85 seconds to complete inference. I wonder if there is any way to speed up inference?
Am AFK now so will need to time it later to compare. Purely anecdotally I found it seemed quite snappy on my 3090.
Also that timing isn't bad compared to the free trial speeds on Azure AI (granted that's using shared resources thus you probably wouldn't expect it to be particularly fast): https://ai.azure.com/explore/models/Phi-3-vision-128k-instruct/version/1/registry/azureml
It would be interesting to analyse the impact of downsampling images, to see if there is a sweet spot for improved speed at an acceptable loss in reading/observation accuracy (assuming this does improve speed at all).
I changed in config "torch_dtype": "bfloat16", to float16. provided some marginal improvements in speed. For ex on rtx 4090, from an average of 1.36s to 1.25s. For rtx 3090 from 2s to 1.8s.
Also depends on the OS. On Windows, on 4080S, got about 3 sec. Same scripts (slightly modified python example from huggingface) on Linux 1.66. Have no idea why - disclaimer: these were different machines. But I had the same behavior on my home GTX 1080 - from 30 sec on windows, down to about 22 on WSL on the same machine.
sorry, maybe I did not explain it correctly, my results are for my own images with my custom prompt, but based on the python example. Just changed prompt and image. So it would depend what image and prompt you provide. The results I provided are only to show the comparative difference when using float16, and between different GPUs I tested.
I did not try any cloud dedicated GPUs. I found kind of a GPU sharing service, and that allowed me to test on different consumer grade GPUs as I was wondering what's the minimum budget for my scenario.
I do want to try ONNX (I've seen your post), but not a priority for me right now, as for the moment I gave up on Phi 3 Vision and I'm using ChatGpt-4o - more cost effective for my scenario, though the response times are worse.