| --- |
| license: mit |
| base_model: |
| - microsoft/Phi-3.5-vision-instruct |
| tags: |
| - GUI |
| - Agent |
| - Grounding |
| - CUA |
| --- |
| |
| # Microsoft Phi-Ground-4B-7C |
|
|
| <p align="center"> |
| <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">π€ HomePage</a> | <a href="https://huggingface.co/papers/2507.23779" target="_blank">π Paper </a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">π Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> π Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/new_annotations" target="_blank"> π Eval data </a> |
| </p> |
|
|
|  |
|
|
| **Phi-Ground-4B-7C** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1008x672. The Phi-Ground |
| model family achieves state-of-the-art performance across all five grounding benchmarks for |
| models under 10B parameters in agent settings. In the end-to-end model setting, our model still |
| achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe |
| that the various details discussed in the tech report, along with our successes and failures, not only clarify |
| the construction of grounding models but also benefit other perception tasks. |
|
|
| ### Main results |
|
|
|  |
|
|
| ### Usage |
| The current `transformers` version can be verified with: `pip list | grep transformers`. |
|
|
| Examples of required packages: |
| ``` |
| flash_attn==2.5.8 |
| numpy==1.24.4 |
| Pillow==10.3.0 |
| Requests==2.31.0 |
| torch==2.3.0 |
| torchvision==0.18.0 |
| transformers==4.43.0 |
| accelerate==0.30.0 |
| ``` |
|
|
|
|
| ### Input Formats |
|
|
| The model require strict input format including fixed image resolution, instruction-first order and system prompt. |
|
|
| Input preprocessing |
|
|
| ```python |
| from PIL import Image |
| def process_image(img): |
| |
| target_width, target_height = 336 * 3, 336 *2 |
| |
| img_ratio = img.width / img.height |
| target_ratio = target_width / target_height |
| |
| if img_ratio > target_ratio: |
| new_width = target_width |
| new_height = int(new_width / img_ratio) |
| else: |
| new_height = target_height |
| new_width = int(new_height * img_ratio) |
| reshape_ratio = new_width / img.width |
| |
| img = img.resize((new_width, new_height), Image.LANCZOS) |
| new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255)) |
| paste_position = (0, 0) |
| new_img.paste(img, paste_position) |
| return new_img |
| |
| instruction = "<your instruction>" |
| prompt = """<|user|> |
| The description of the element: |
| {RE} |
| |
| Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000. |
| <|image_1|> |
| <|end|> |
| <|assistant|>""".format(RE=instriuction) |
| |
| image_path = "<your image path>" |
| image = process_image(Image.open(image_path)) |
| ``` |
|
|
|
|
| Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. We also provide [End-to-end examples](https://github.com/microsoft/Phi-Ground/tree/main/examples/call_example.py) and [benchmark results reproduction](https://github.com/microsoft/Phi-Ground/tree/main/benchmark/test_sspro.sh). |