File size: 5,910 Bytes
bd49f7d
 
a11346e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd49f7d
a11346e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: apache-2.0
language:
- en
base_model:
- ByteDance-Seed/UI-TARS-1.5-7B
pipeline_tag: image-text-to-text
tags:
- gui-agent
- computer-use
- multimodal
- vision-language
- qwen2_5_vl
- ui-tars
- robustness
- reinforcement-learning
- grpo
library_name: transformers
---

# AgentHijack-Agent

**AgentHijack-Agent** is the action-generation model released with the paper
[*AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions*](https://AgentHijack.github.io) (ICML 2026).

It is fine-tuned from [`UI-TARS-1.5-7B`](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) (Qwen2.5-VL architecture) using **Data-Augmented Group Relative Policy Optimization (DA-GRPO)** on the AgentHijack benchmark, with the goal of producing a computer-use agent that remains reliable under *common environment corruptions* (pop-ups, resolution changes, UI marks, subtitles, multi-apps, accidental touches, app minimization, network errors, and verification prompts).

The same checkpoint serves a dual role in the AgentHijack-Agent framework:

1. **Action generator** โ€” produces the next GUI action from screenshots + history.
2. **Onlooker** โ€” summarizes behavioral changes between consecutive screenshots and performs an initial environment check before execution.

- ๐Ÿ“„ **Paper:** *AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions* (ICML 2026)
- ๐ŸŒ **Project page:** https://AgentHijack.github.io
- ๐Ÿงฉ **Base model:** `ByteDance-Seed/UI-TARS-1.5-7B` (Qwen2.5-VL-7B architecture)
- ๐Ÿ›๏ธ **Affiliations:** TMLR Group, Hong Kong Baptist University

---

## Highlights

Compared with the base `UI-TARS-1.5-7B`, AgentHijack-Agent:

- **Improves average task success rate on the AgentHijack benchmark by +4.15%** (and a larger margin on UI-TARS-7B-DPO baseline).
- Maintains accurate grounding under **visual disruptors** (pop-ups, resolution change, marks, subtitle, multi-apps).
- Recovers from **unexpected operations** (accidental touch, app minimization) via behavioral summarization.
- Detects **environment errors** (network failure, login/verification prompts) up-front instead of looping on meaningless attempts.

See Table 2 and Figure 8 of the paper for full results and qualitative trajectories.

---

## Model details

| Field | Value |
|---|---|
| Architecture | `Qwen2_5_VLForConditionalGeneration` |
| Parameters | ~7B |
| Precision | `bfloat16` |
| Context length | 128k tokens |
| Image resolution | 1920 ร— 1080 (native, paper default) |
| Sharding | 4 ร— `safetensors` shards |
| Tokenizer | Inherited from UI-TARS-1.5-7B / Qwen2.5-VL |

### Training

- **Algorithm:** Data-Augmented GRPO (DA-GRPO), an extension of GRPO that rolls out the same instruction across *different corrupted environments* drawn from a corruption set `C`, instead of a single clean environment.
- **Framework:** [VERL](https://github.com/volcengine/verl).
- **Data:** 128 tasks sampled from the AgentHijack benchmark (built on top of OSWorld with 9 configurable corruption types, 3,321 tasks total).
- **Schedule:** 15 epochs.
- **Reward:** `r = r_success + r_format`, with an experience-replay buffer (following ARPO) to mitigate sparse-reward batches.
- **Optimization:** clip range [0.2, 0.3], KL loss disabled to encourage exploration.

---

## Usage

The model uses the standard Qwen2.5-VL / UI-TARS interface and is compatible with `transformers` and `vllm`.

### Action space

AgentHijack-Agent uses the same action space as UI-TARS-1.5-7B:

```
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='xxx')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait()
finished(content='xxx')
```

### Prompt template (action generator)

```
You are a GUI agent. You are given a task and your action history, with
screenshots. You need to perform the next action to complete the task.

## Output Format
```
Thought: ...
Action: ...
```

## Action Space

{action_space}

## Note
- Use {language} in `Thought` part.
- Write a small plan and finally summarize your next action (with its target
  element) in one sentence in `Thought` part.

## User Instruction
{instruction}
```

### Minimal inference example

```python
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "<your-username>/AgentHijack-Agent"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Build a chat with screenshot(s) + the action-generator prompt above,
# then run model.generate(...) as usual.
```

For the full agent framework (action generator + onlooker + environment checking), please refer to the code at [AgentHijack.github.io](https://AgentHijack.github.io).

---

## Citation

If you use this model or the AgentHijack benchmark, please cite:

```bibtex
@inproceedings{sun2026agenthijack,
  title     = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
  author    = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=0H5Im3Xvuf}
}
```

---

## Acknowledgements

This model is built on top of [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) and the [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) family, with training infrastructure based on [VERL](https://github.com/volcengine/verl). The benchmark environment extends [OSWorld](https://os-world.github.io/).