|
|
--- |
|
|
base_model: |
|
|
- microsoft/Phi-3.5-vision-instruct |
|
|
license: mit |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- GUI |
|
|
- Agent |
|
|
- Grounding |
|
|
- CUA |
|
|
--- |
|
|
|
|
|
# Microsoft Phi-Ground-4B-7C |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://microsoft.github.io/Phi-Ground/" target="_blank">π€ HomePage</a> | <a href="https://huggingface.co/papers/2507.23779" target="_blank">π Paper </a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">π Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> π Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/new_annotations" target="_blank"> π Eval data </a> |
|
|
</p> |
|
|
|
|
|
 |
|
|
|
|
|
**Phi-Ground-4B-7C** is a member of the Phi-Ground model family, introduced in the technical report [Phi-Ground Tech Report: Advancing Perception in GUI Grounding](https://huggingface.co/papers/2507.23779). It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1008x672. |
|
|
|
|
|
The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, this model achieves SOTA results with scores of **43.2** on ScreenSpot-pro and **27.2** on UI-Vision. |
|
|
|
|
|
### Main results |
|
|
|
|
|
 |
|
|
|
|
|
### Usage |
|
|
The current `transformers` version can be verified with: `pip list | grep transformers`. |
|
|
|
|
|
Examples of required packages: |
|
|
``` |
|
|
flash_attn==2.5.8 |
|
|
numpy==1.24.4 |
|
|
Pillow==10.3.0 |
|
|
Requests==2.31.0 |
|
|
torch==2.3.0 |
|
|
torchvision==0.18.0 |
|
|
transformers==4.43.0 |
|
|
accelerate==0.30.0 |
|
|
``` |
|
|
|
|
|
|
|
|
### Input Formats |
|
|
|
|
|
The model requires a strict input format including fixed image resolution, instruction-first order and system prompt. |
|
|
|
|
|
**Input Preprocessing** |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
def process_image(img): |
|
|
|
|
|
target_width, target_height = 336 * 3, 336 * 2 |
|
|
|
|
|
img_ratio = img.width / img.height |
|
|
target_ratio = target_width / target_height |
|
|
|
|
|
if img_ratio > target_ratio: |
|
|
new_width = target_width |
|
|
new_height = int(new_width / img_ratio) |
|
|
else: |
|
|
new_height = target_height |
|
|
new_width = int(new_height * img_ratio) |
|
|
reshape_ratio = new_width / img.width |
|
|
|
|
|
img = img.resize((new_width, new_height), Image.LANCZOS) |
|
|
new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255)) |
|
|
paste_position = (0, 0) |
|
|
new_img.paste(img, paste_position) |
|
|
return new_img |
|
|
|
|
|
instruction = "<your instruction>" |
|
|
prompt = """<|user|> |
|
|
The description of the element: |
|
|
{RE} |
|
|
|
|
|
Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000. |
|
|
<|image_1|> |
|
|
<|end|> |
|
|
<|assistant|>""".format(RE=instruction) |
|
|
|
|
|
image_path = "<your image path>" |
|
|
image = process_image(Image.open(image_path)) |
|
|
``` |
|
|
|
|
|
You can use the Hugging Face `transformers` library or [vLLM](https://github.com/vllm-project/vllm) for inference. For further details, including end-to-end examples and benchmark reproduction, please visit the [official GitHub repository](https://github.com/microsoft/Phi-Ground). |
|
|
|
|
|
### Citation |
|
|
If you find this work useful, please cite: |
|
|
```bibtex |
|
|
@article{zhang2025phi, |
|
|
title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding}, |
|
|
author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others}, |
|
|
journal={arXiv preprint arXiv:2507.23779}, |
|
|
year={2025} |
|
|
} |
|
|
``` |