Add pipeline_tag, library_name and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -13
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: mit
3
  base_model:
4
  - microsoft/Phi-3.5-vision-instruct
 
 
 
5
  tags:
6
  - GUI
7
  - Agent
@@ -17,12 +19,9 @@ tags:
17
 
18
  ![overview](docs/images/abstract.png)
19
 
20
- **Phi-Ground-4B-7C** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1008x672. The Phi-Ground
21
- model family achieves state-of-the-art performance across all five grounding benchmarks for
22
- models under 10B parameters in agent settings. In the end-to-end model setting, our model still
23
- achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe
24
- that the various details discussed in the tech report, along with our successes and failures, not only clarify
25
- the construction of grounding models but also benefit other perception tasks.
26
 
27
  ### Main results
28
 
@@ -46,15 +45,15 @@ accelerate==0.30.0
46
 
47
  ### Input Formats
48
 
49
- The model require strict input format including fixed image resolution, instruction-first order and system prompt.
50
 
51
- Input preprocessing
52
 
53
  ```python
54
  from PIL import Image
55
  def process_image(img):
56
 
57
- target_width, target_height = 336 * 3, 336 *2
58
 
59
  img_ratio = img.width / img.height
60
  target_ratio = target_width / target_height
@@ -81,11 +80,21 @@ The description of the element:
81
  Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
82
  <|image_1|>
83
  <|end|>
84
- <|assistant|>""".format(RE=instriuction)
85
 
86
  image_path = "<your image path>"
87
  image = process_image(Image.open(image_path))
88
  ```
89
 
90
-
91
- Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. We also provide [End-to-end examples](https://github.com/microsoft/Phi-Ground/tree/main/examples/call_example.py) and [benchmark results reproduction](https://github.com/microsoft/Phi-Ground/tree/main/benchmark/test_sspro.sh).
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - microsoft/Phi-3.5-vision-instruct
4
+ license: mit
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
  tags:
8
  - GUI
9
  - Agent
 
19
 
20
  ![overview](docs/images/abstract.png)
21
 
22
+ **Phi-Ground-4B-7C** is a member of the Phi-Ground model family, introduced in the technical report [Phi-Ground Tech Report: Advancing Perception in GUI Grounding](https://huggingface.co/papers/2507.23779). It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1008x672.
23
+
24
+ The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, this model achieves SOTA results with scores of **43.2** on ScreenSpot-pro and **27.2** on UI-Vision.
 
 
 
25
 
26
  ### Main results
27
 
 
45
 
46
  ### Input Formats
47
 
48
+ The model requires a strict input format including fixed image resolution, instruction-first order and system prompt.
49
 
50
+ **Input Preprocessing**
51
 
52
  ```python
53
  from PIL import Image
54
  def process_image(img):
55
 
56
+ target_width, target_height = 336 * 3, 336 * 2
57
 
58
  img_ratio = img.width / img.height
59
  target_ratio = target_width / target_height
 
80
  Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
81
  <|image_1|>
82
  <|end|>
83
+ <|assistant|>""".format(RE=instruction)
84
 
85
  image_path = "<your image path>"
86
  image = process_image(Image.open(image_path))
87
  ```
88
 
89
+ You can use the Hugging Face `transformers` library or [vLLM](https://github.com/vllm-project/vllm) for inference. For further details, including end-to-end examples and benchmark reproduction, please visit the [official GitHub repository](https://github.com/microsoft/Phi-Ground).
90
+
91
+ ### Citation
92
+ If you find this work useful, please cite:
93
+ ```bibtex
94
+ @article{zhang2025phi,
95
+ title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding},
96
+ author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others},
97
+ journal={arXiv preprint arXiv:2507.23779},
98
+ year={2025}
99
+ }
100
+ ```