Phi-Ground / README.md
nielsr's picture
nielsr HF Staff
Add pipeline_tag, library_name and improve model card
e432222 verified
|
raw
history blame
3.63 kB
metadata
base_model:
  - microsoft/Phi-3.5-vision-instruct
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - GUI
  - Agent
  - Grounding
  - CUA

Microsoft Phi-Ground-4B-7C

πŸ€– HomePage | πŸ“„ Paper | πŸ“„ Arxiv | 😊 Model | 😊 Eval data

overview

Phi-Ground-4B-7C is a member of the Phi-Ground model family, introduced in the technical report Phi-Ground Tech Report: Advancing Perception in GUI Grounding. It is fine-tuned from microsoft/Phi-3.5-vision-instruct with a fixed input resolution of 1008x672.

The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, this model achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision.

Main results

overview

Usage

The current transformers version can be verified with: pip list | grep transformers.

Examples of required packages:

flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0

Input Formats

The model requires a strict input format including fixed image resolution, instruction-first order and system prompt.

Input Preprocessing

from PIL import Image
def process_image(img):

    target_width, target_height = 336 * 3, 336 * 2
 
    img_ratio = img.width / img.height  
    target_ratio = target_width / target_height
   
    if img_ratio > target_ratio:  
        new_width = target_width  
        new_height = int(new_width / img_ratio)
    else:  
        new_height = target_height
        new_width = int(new_height * img_ratio)  
    reshape_ratio = new_width / img.width

    img = img.resize((new_width, new_height), Image.LANCZOS)  
    new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))  
    paste_position = (0, 0)  
    new_img.paste(img, paste_position)
    return new_img

instruction = "<your instruction>"
prompt = """<|user|>
The description of the element: 
{RE}

Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
<|image_1|>
<|end|>
<|assistant|>""".format(RE=instruction)

image_path = "<your image path>"
image = process_image(Image.open(image_path))

You can use the Hugging Face transformers library or vLLM for inference. For further details, including end-to-end examples and benchmark reproduction, please visit the official GitHub repository.

Citation

If you find this work useful, please cite:

@article{zhang2025phi,
  title={Phi-Ground Tech Report: Advancing Perception in GUI Grounding},
  author={Zhang, Miaosen and Xu, Ziqiang and Zhu, Jialiang and Dai, Qi and Qiu, Kai and Yang, Yifan and Luo, Chong and Chen, Tianyi and Wagle, Justin and Franklin, Tim and others},
  journal={arXiv preprint arXiv:2507.23779},
  year={2025}
}