Image-Text-to-Text
Transformers
Safetensors
English
qwen2_5_vl
gui
agent
gui-grounding
reinforcement-learning
conversational
text-generation-inference
Instructions to use InfiX-ai/InfiGUI-G1-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use InfiX-ai/InfiGUI-G1-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="InfiX-ai/InfiGUI-G1-3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("InfiX-ai/InfiGUI-G1-3B") model = AutoModelForMultimodalLM.from_pretrained("InfiX-ai/InfiGUI-G1-3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use InfiX-ai/InfiGUI-G1-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "InfiX-ai/InfiGUI-G1-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InfiX-ai/InfiGUI-G1-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/InfiX-ai/InfiGUI-G1-3B
- SGLang
How to use InfiX-ai/InfiGUI-G1-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "InfiX-ai/InfiGUI-G1-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InfiX-ai/InfiGUI-G1-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "InfiX-ai/InfiGUI-G1-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InfiX-ai/InfiGUI-G1-3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use InfiX-ai/InfiGUI-G1-3B with Docker Model Runner:
docker model run hf.co/InfiX-ai/InfiGUI-G1-3B
Improve model card: Add project page link and evaluation section, update citation
#2
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -18,6 +18,7 @@ tags:
|
|
| 18 |
This repository contains the InfiGUI-G1-3B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
|
| 19 |
|
| 20 |
[](https://github.com/InfiXAI/InfiGUI-G1)
|
|
|
|
| 21 |
|
| 22 |
## Paper Abstract
|
| 23 |
|
|
@@ -217,7 +218,92 @@ On the widely-used ScreenSpot-V2 benchmark, which provides comprehensive coverag
|
|
| 217 |
<img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
|
| 218 |
</div>
|
| 219 |
|
| 220 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
|
| 222 |
If you find this work useful, we would be grateful if you consider citing the following papers:
|
| 223 |
|
|
@@ -245,8 +331,12 @@ If you find this work useful, we would be grateful if you consider citing the fo
|
|
| 245 |
```bibtex
|
| 246 |
@article{liu2025infiguiagent,
|
| 247 |
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
|
| 248 |
-
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and
|
| 249 |
journal={arXiv preprint arXiv:2501.04575},
|
| 250 |
year={2025}
|
| 251 |
}
|
| 252 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
This repository contains the InfiGUI-G1-3B model from the paper **[InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization](https://arxiv.org/abs/2508.05731)**.
|
| 19 |
|
| 20 |
[](https://github.com/InfiXAI/InfiGUI-G1)
|
| 21 |
+
[](https://osatlas.github.io/)
|
| 22 |
|
| 23 |
## Paper Abstract
|
| 24 |
|
|
|
|
| 218 |
<img src="https://raw.githubusercontent.com/InfiXAI/InfiGUI-G1/main/assets/results_screenspot-v2.png" width="90%" alt="ScreenSpot-V2 Results">
|
| 219 |
</div>
|
| 220 |
|
| 221 |
+
## ⚙️ Evaluation
|
| 222 |
+
|
| 223 |
+
This section provides instructions for reproducing the evaluation results reported in our paper.
|
| 224 |
+
|
| 225 |
+
### 1. Getting Started
|
| 226 |
+
|
| 227 |
+
Clone the repository and navigate to the project directory:
|
| 228 |
+
|
| 229 |
+
```bash
|
| 230 |
+
git clone https://github.com/InfiXAI/InfiGUI-G1.git
|
| 231 |
+
cd InfiGUI-G1
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
### 2. Environment Setup
|
| 235 |
+
|
| 236 |
+
The evaluation pipeline is built upon the [vLLM](https://github.com/vllm-project/vllm) library for efficient inference. For detailed installation guidance, please refer to the official vLLM repository. The specific versions used to obtain the results reported in our paper are as follows:
|
| 237 |
+
|
| 238 |
+
- **Python**: `3.10.12`
|
| 239 |
+
- **PyTorch**: `2.6.0`
|
| 240 |
+
- **Transformers**: `4.50.1`
|
| 241 |
+
- **vLLM**: `0.8.2`
|
| 242 |
+
- **CUDA**: `12.6`
|
| 243 |
+
|
| 244 |
+
The reported results were obtained on a server equipped with 4 x NVIDIA H800 GPUs.
|
| 245 |
+
|
| 246 |
+
### 3. Model Download
|
| 247 |
+
|
| 248 |
+
Download the InfiGUI-G1 models from the Hugging Face Hub into the `./models` directory.
|
| 249 |
+
|
| 250 |
+
```bash
|
| 251 |
+
# Create a directory for models
|
| 252 |
+
mkdir -p ./models
|
| 253 |
+
|
| 254 |
+
# Download InfiGUI-G1-3B
|
| 255 |
+
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-3B --local-dir ./models/InfiGUI-G1-3B
|
| 256 |
+
|
| 257 |
+
# Download InfiGUI-G1-7B
|
| 258 |
+
huggingface-cli download --resume-download InfiX-ai/InfiGUI-G1-7B --local-dir ./models/InfiGUI-G1-7B
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
### 4. Dataset Download and Preparation
|
| 262 |
+
|
| 263 |
+
Download the required evaluation benchmarks into the `./data` directory.
|
| 264 |
+
|
| 265 |
+
```bash
|
| 266 |
+
# Create a directory for datasets
|
| 267 |
+
mkdir -p ./data
|
| 268 |
+
|
| 269 |
+
# Download benchmarks
|
| 270 |
+
huggingface-cli download --repo-type dataset --resume-download likaixin/ScreenSpot-Pro --local-dir ./data/ScreenSpot-Pro
|
| 271 |
+
huggingface-cli download --repo-type dataset --resume-download ServiceNow/ui-vision --local-dir ./data/ui-vision
|
| 272 |
+
huggingface-cli download --repo-type dataset --resume-download OS-Copilot/ScreenSpot-v2 --local-dir ./data/ScreenSpot-v2
|
| 273 |
+
huggingface-cli download --repo-type dataset --resume-download OpenGVLab/MMBench-GUI --local-dir ./data/MMBench-GUI
|
| 274 |
+
huggingface-cli download --repo-type dataset --resume-download vaundys/I2E-Bench --local-dir ./data/I2E-Bench
|
| 275 |
+
```
|
| 276 |
+
|
| 277 |
+
After downloading, some datasets require unzipping compressed image files.
|
| 278 |
+
|
| 279 |
+
```bash
|
| 280 |
+
# Unzip images for ScreenSpot-v2
|
| 281 |
+
unzip ./data/ScreenSpot-v2/screenspotv2_image.zip -d ./data/ScreenSpot-v2/
|
| 282 |
+
|
| 283 |
+
# Unzip images for MMBench-GUI
|
| 284 |
+
unzip ./data/MMBench-GUI/MMBench-GUI-OfflineImages.zip -d ./data/MMBench-GUI/
|
| 285 |
+
```
|
| 286 |
+
|
| 287 |
+
### 5. Running the Evaluation
|
| 288 |
+
|
| 289 |
+
To run the evaluation, use the `eval/eval.py` script. You must specify the path to the model, the benchmark name, and the tensor parallel size.
|
| 290 |
+
|
| 291 |
+
Here is an example command to evaluate the `InfiGUI-G1-3B` model on the `screenspot-pro` benchmark using 4 GPUs:
|
| 292 |
+
|
| 293 |
+
```bash
|
| 294 |
+
python eval/eval.py \
|
| 295 |
+
./models/InfiGUI-G1-3B \
|
| 296 |
+
--benchmark screenspot-pro \
|
| 297 |
+
--tensor-parallel 4
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
- **`model_path`**: The first positional argument specifies the path to the downloaded model directory (e.g., `./models/InfiGUI-G1-3B`).
|
| 301 |
+
- **`--benchmark`**: Specifies the benchmark to evaluate. Available options include `screenspot-pro`, `screenspot-v2`, `ui-vision`, `mmbench-gui`, and `i2e-bench`.
|
| 302 |
+
- **`--tensor-parallel`**: Sets the tensor parallelism size, which should typically match the number of available GPUs.
|
| 303 |
+
|
| 304 |
+
Evaluation results, including detailed logs and performance metrics, will be saved to the `./output/{model_name}/{benchmark}/` directory.
|
| 305 |
+
|
| 306 |
+
## 📚 Citation Information
|
| 307 |
|
| 308 |
If you find this work useful, we would be grateful if you consider citing the following papers:
|
| 309 |
|
|
|
|
| 331 |
```bibtex
|
| 332 |
@article{liu2025infiguiagent,
|
| 333 |
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
|
| 334 |
+
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
|
| 335 |
journal={arXiv preprint arXiv:2501.04575},
|
| 336 |
year={2025}
|
| 337 |
}
|
| 338 |
+
```
|
| 339 |
+
|
| 340 |
+
## 🙏 Acknowledgements
|
| 341 |
+
|
| 342 |
+
We would like to express our gratitude for the following open-source projects: [VERL](https://github.com/volcengine/verl), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) and [vLLM](https://github.com/vllm-project/vllm).
|