File size: 4,745 Bytes
82e4012
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4e30d1
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vista
- qwen3.5
- vision-language
- gui-grounding
- reinforcement-learning
base_model:
- Qwen3.5-4B
- Qwen3.5-9B
metrics:
- accuracy
---

# VISTA-9B

VISTA-9B are GUI-grounding vision-language models trained from Qwen3.5 9B backbones with **VISTA: View-Consistent Self-Verified Training for GUI Grounding**.

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Paper](https://img.shields.io/badge/Paper-PDF-red?logo=adobeacrobatreader&logoColor=white)](https://zjuscl.github.io/VISTA/static/pdfs/vista.pdf)
[![Website](https://img.shields.io/badge/🌐%20Website-VISTA-blue)](https://zjuscl.github.io/VISTA)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-green?logo=github)](https://github.com/ZJUSCL/VISTA)


## Model Description

VISTA-9B is a GUI-grounding model that maps a screenshot and a natural-language instruction to a click coordinate in the normalized `0-1000` image frame.

- **View-consistent GRPO training.** VISTA builds each GRPO comparison group from target-preserving views of the same GUI instance, with exact coordinate remapping across cropped views. This exposes localization behavior under semantically equivalent but geometrically different screenshots.
- **Self-verified cross-view anchoring.** The training objective adds an oracle-format center-point anchor only when model-generated rollouts have already produced a maximum-reward prediction, stabilizing short coordinate generation without unconditional imitation on all-fail groups.


## Evaluation

Accuracy is reported for GUI grounding. The model predicts a normalized coordinate in the `0-1000` frame, and the prediction is counted as correct if the point lies inside the target element. All reported results use deterministic decoding at temperature 0 and single-view inference.

### Results on GUI Grounding benchmarks

| Model | SSPro | SSV2 | OSWorld-G | OSWorld-G-R |
| --- | ---: | ---: | ---: | ---: |
| Qwen3.5-4B | 60.3 | 90.4 | 54.4 | 66.8 |
| GRPO-4B | 62.2 | 94.2 | 59.9 | 69.2 |
| **VISTA-4B** | **64.2** | 93.8 | **61.2** | **69.7** |
| Δ  | **+2.0** | -0.4 | **+1.3** | **+0.5** |
| Qwen3.5-9B | 65.2 | 91.9 | 63.1 | 74.6 |
| GRPO-9B | 68.3 | 95.2 | 67.5 | 75.2 |
| **VISTA-9B** | **69.2** | **95.8** | **68.1** | **75.5** |
| Δ  | **+0.9** | **+0.6** | **+0.6** | **+0.3** |
| Qwen3.5-35B-A3B | 68.6 | 93.8 | 65.8 | 72.5 |
| GRPO-35B-A3B | 71.7 | 95.7 | 70.4 | 74.3 |
| **VISTA-35B-A3B** | **72.9** | **95.8** | **71.5** | **75.3** |
| Δ  | **+1.2** | **+0.1** | **+1.1** | **+1.0** |

## Quick Start

Use the same image-chat interface as the underlying Qwen3.5 vision-language model. The recommended prompt is:

```text
Output the center point of the position corresponding to the instruction: {instruction}. The output should just be the coordinates of a point, in the format [x,y].
```

Example:

```python
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "inclusionAI/VISTA-9B"  

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the search button"
prompt = (
    "Output the center point of the position corresponding to the instruction: "
    f"{instruction}. The output should just be the coordinates of a point, "
    "in the format [x,y]."
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = processor(
    text=[text],
    images=[image],
    padding=True,
    return_tensors="pt",
).to(model.device)

generated = model.generate(
    **inputs,
    max_new_tokens=32,
    do_sample=False,
)
new_tokens = generated[:, inputs.input_ids.shape[1]:]
response = processor.batch_decode(new_tokens, skip_special_tokens=True)[0].strip()
print(response)  # e.g. [512,384]
```

## Citation
Please consider citing if you find our work useful:
```plain
@misc{qiu2026vista,
      title={VISTA: View-Consistent Self-Verified Training for GUI Grounding},
      author={Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu},
      year={2026},
      eprint={2606.14579},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.14579},
}
```