Phi-Ground / README.md

Add pipeline_tag, library_name and improve model card

e432222 verified about 2 months ago

3.63 kB

	---
	base_model:
	- microsoft/Phi-3.5-vision-instruct
	license: mit
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- GUI
	- Agent
	- Grounding
	- CUA
	---

	# Microsoft Phi-Ground-4B-7C

	<p align="center">
	<a href="https://microsoft.github.io/Phi-Ground/" target="_blank">🤖 HomePage</a> \| <a href="https://huggingface.co/papers/2507.23779" target="_blank">📄 Paper </a> \| <a href="https://arxiv.org/abs/2507.23779" target="_blank">📄 Arxiv </a> \| <a href="https://huggingface.co/microsoft/Phi-Ground" target="_blank"> 😊 Model </a> \| <a href="https://github.com/microsoft/Phi-Ground/tree/main/benchmark/new_annotations" target="_blank"> 😊 Eval data </a>
	</p>

	![overview](docs/images/abstract.png)

	Phi-Ground-4B-7C is a member of the Phi-Ground model family, introduced in the technical report [Phi-Ground Tech Report: Advancing Perception in GUI Grounding](https://huggingface.co/papers/2507.23779). It is fine-tuned from [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) with a fixed input resolution of 1008x672.

	The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, this model achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision.

	### Main results

	![overview](docs/images/r1.png)

	### Usage
	The current `transformers` version can be verified with: `pip list \| grep transformers`.

	Examples of required packages:
	```
	flash_attn==2.5.8
	numpy==1.24.4
	Pillow==10.3.0
	Requests==2.31.0
	torch==2.3.0
	torchvision==0.18.0
	transformers==4.43.0
	accelerate==0.30.0
	```


	### Input Formats

	The model requires a strict input format including fixed image resolution, instruction-first order and system prompt.

	Input Preprocessing

	```python
	from PIL import Image
	def process_image(img):

	target_width, target_height = 336 * 3, 336 * 2

	img_ratio = img.width / img.height
	target_ratio = target_width / target_height

	if img_ratio > target_ratio:
	new_width = target_width
	new_height = int(new_width / img_ratio)
	else:
	new_height = target_height
	new_width = int(new_height * img_ratio)
	reshape_ratio = new_width / img.width

	img = img.resize((new_width, new_height), Image.LANCZOS)
	new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
	paste_position = (0, 0)
	new_img.paste(img, paste_position)
	return new_img

	instruction = "<your instruction>"
	prompt = """<\|user\|>
	The description of the element:
	{RE}

	Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
	<\|image_1\|>
	<\|end\|>
	<\|assistant\|>""".format(RE=instruction)

	image_path = "<your image path>"
	image = process_image(Image.open(image_path))
	```

	You can use the Hugging Face `transformers` library or [vLLM](https://github.com/vllm-project/vllm) for inference. For further details, including end-to-end examples and benchmark reproduction, please visit the [official GitHub repository](https://github.com/microsoft/Phi-Ground).

	### Citation
	If you find this work useful, please cite:
	```bibtex
	@article{zhang2025phi,
	title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding},
	author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others},
	journal={arXiv preprint arXiv:2507.23779},
	year={2025}
	}
	```