File size: 3,236 Bytes
3b30c22
 
 
 
 
 
 
 
 
 
 
 
 
 
c3fc252
3b30c22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3fc252
3b30c22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3fc252
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: mit
base_model:
- microsoft/Phi-3.5-vision-instruct
tags:
- GUI
- Agent
- Grounding
- CUA
---

# Microsoft Phi-Ground-4B-7C

<p align="center">
   <a href="https://microsoft.github.io/Phi-Ground/" target="_blank">πŸ€– HomePage</a> | <a href="https://huggingface.co/papers/2507.23779" target="_blank">πŸ“„ Paper </a> | <a href="https://arxiv.org/abs/2507.23779" target="_blank">πŸ“„ Arxiv </a> | <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> 😊 Model </a> | <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/new_annotations" target="_blank"> 😊 Eval data </a> 
</p>

![overview](docs/images/abstract.png)

**Phi-Ground-4B-7C** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1008x672. The Phi-Ground
 model family achieves state-of-the-art performance across all five grounding benchmarks for
 models under 10B parameters in agent settings. In the end-to-end model setting, our model still
 achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe
 that the various details discussed in the tech report, along with our successes and failures, not only clarify
 the construction of grounding models but also benefit other perception tasks.

### Main results

![overview](docs/images/r1.png)

### Usage
The current `transformers` version can be verified with: `pip list | grep transformers`.

Examples of required packages:
```
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
```


### Input Formats

The model require strict input format including fixed image resolution, instruction-first order and system prompt.

Input preprocessing

```python
from PIL import Image
def process_image(img):

    target_width, target_height = 336 * 3, 336 *2
 
    img_ratio = img.width / img.height  
    target_ratio = target_width / target_height
   
    if img_ratio > target_ratio:  
        new_width = target_width  
        new_height = int(new_width / img_ratio)
    else:  
        new_height = target_height
        new_width = int(new_height * img_ratio)  
    reshape_ratio = new_width / img.width

    img = img.resize((new_width, new_height), Image.LANCZOS)  
    new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))  
    paste_position = (0, 0)  
    new_img.paste(img, paste_position)
    return new_img

instruction = "<your instruction>"
prompt = """<|user|>
The description of the element: 
{RE}

Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
<|image_1|>
<|end|>
<|assistant|>""".format(RE=instriuction)

image_path = "<your image path>"
image = process_image(Image.open(image_path))
```


Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. We also provide [End-to-end examples](https://github.com/microsoft/Phi-Ground/tree/main/examples/call_example.py) and [benchmark results reproduction](https://github.com/microsoft/Phi-Ground/tree/main/benchmark/test_sspro.sh).