microsoft
/

Phi-Ground

@@ -1,7 +1,9 @@
 ---
-license: mit
 base_model:
 - microsoft/Phi-3.5-vision-instruct
 tags:
 - GUI
 - Agent
@@ -17,12 +19,9 @@ tags:
 ![overview](docs/images/abstract.png)
-**Phi-Ground-4B-7C** is one of the Phi-Ground model family, finetuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with fixed input resolution 1008x672. The Phi-Ground
- model family achieves state-of-the-art performance across all five grounding benchmarks for
- models under 10B parameters in agent settings. In the end-to-end model setting, our model still
- achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe
- that the various details discussed in the tech report, along with our successes and failures, not only clarify
- the construction of grounding models but also benefit other perception tasks.
 ### Main results
@@ -46,15 +45,15 @@ accelerate==0.30.0
 ### Input Formats
-The model require strict input format including fixed image resolution, instruction-first order and system prompt.
-Input preprocessing
 ```python
 from PIL import Image
 def process_image(img):
-    target_width, target_height = 336 * 3, 336 *2
     img_ratio = img.width / img.height
     target_ratio = target_width / target_height
@@ -81,11 +80,21 @@ The description of the element:
 Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
 <|image_1|>
 <|end|>
-<|assistant|>""".format(RE=instriuction)
 image_path = "<your image path>"
 image = process_image(Image.open(image_path))
 ```
-Then you can use huggingface model or [vllm](https://github.com/vllm-project/vllm) to inference. We also provide [End-to-end examples](https://github.com/microsoft/Phi-Ground/tree/main/examples/call_example.py) and [benchmark results reproduction](https://github.com/microsoft/Phi-Ground/tree/main/benchmark/test_sspro.sh).

 ---
 base_model:
 - microsoft/Phi-3.5-vision-instruct
+license: mit
+pipeline_tag: image-text-to-text
+library_name: transformers
 tags:
 - GUI
 - Agent
 ![overview](docs/images/abstract.png)
+**Phi-Ground-4B-7C** is a member of the Phi-Ground model family, introduced in the technical report [Phi-Ground Tech Report: Advancing Perception in GUI Grounding](https://huggingface.co/papers/2507.23779). It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1008x672.
+The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, this model achieves SOTA results with scores of **43.2** on ScreenSpot-pro and **27.2** on UI-Vision.
 ### Main results
 ### Input Formats
+The model requires a strict input format including fixed image resolution, instruction-first order and system prompt.
+**Input Preprocessing**
 ```python
 from PIL import Image
 def process_image(img):
+    target_width, target_height = 336 * 3, 336 * 2
     img_ratio = img.width / img.height
     target_ratio = target_width / target_height
 Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
 <|image_1|>
 <|end|>
+<|assistant|>""".format(RE=instruction)
 image_path = "<your image path>"
 image = process_image(Image.open(image_path))
 ```
+You can use the Hugging Face `transformers` library or [vLLM](https://github.com/vllm-project/vllm) for inference. For further details, including end-to-end examples and benchmark reproduction, please visit the [official GitHub repository](https://github.com/microsoft/Phi-Ground).
+### Citation
+If you find this work useful, please cite:
+```bibtex
+@article{zhang2025phi,
+  title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding},
+  author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others},
+  journal={arXiv preprint arXiv:2507.23779},
+  year={2025}
+}
+```