--- base_model: - microsoft/Phi-3.5-vision-instruct license: mit pipeline_tag: image-text-to-text library_name: transformers tags: - GUI - Agent - Grounding - CUA --- # Microsoft Phi-Ground-4B-7C

🤖 HomePage | 📄 Paper | 📄 Arxiv | 😊 Model | 😊 Eval data

![overview](docs/images/abstract.png) **Phi-Ground-4B-7C** is a member of the Phi-Ground model family, introduced in the technical report [Phi-Ground Tech Report: Advancing Perception in GUI Grounding](https://huggingface.co/papers/2507.23779). It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1008x672. The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, this model achieves SOTA results with scores of **43.2** on ScreenSpot-pro and **27.2** on UI-Vision. ### Main results ![overview](docs/images/r1.png) ### Usage The current `transformers` version can be verified with: `pip list | grep transformers`. Examples of required packages: ``` flash_attn==2.5.8 numpy==1.24.4 Pillow==10.3.0 Requests==2.31.0 torch==2.3.0 torchvision==0.18.0 transformers==4.43.0 accelerate==0.30.0 ``` ### Input Formats The model requires a strict input format including fixed image resolution, instruction-first order and system prompt. **Input Preprocessing** ```python from PIL import Image def process_image(img): target_width, target_height = 336 * 3, 336 * 2 img_ratio = img.width / img.height target_ratio = target_width / target_height if img_ratio > target_ratio: new_width = target_width new_height = int(new_width / img_ratio) else: new_height = target_height new_width = int(new_height * img_ratio) reshape_ratio = new_width / img.width img = img.resize((new_width, new_height), Image.LANCZOS) new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255)) paste_position = (0, 0) new_img.paste(img, paste_position) return new_img instruction = "" prompt = """<|user|> The description of the element: {RE} Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000. <|image_1|> <|end|> <|assistant|>""".format(RE=instruction) image_path = "" image = process_image(Image.open(image_path)) ``` You can use the Hugging Face `transformers` library or [vLLM](https://github.com/vllm-project/vllm) for inference. For further details, including end-to-end examples and benchmark reproduction, please visit the [official GitHub repository](https://github.com/microsoft/Phi-Ground). ### Citation If you find this work useful, please cite: ```bibtex @article{zhang2025phi, title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding}, author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others}, journal={arXiv preprint arXiv:2507.23779}, year={2025} } ```