AgentHijack-Agent / README.md
Superjw's picture
Add files using upload-large-folder tool
a11346e verified
---
license: apache-2.0
language:
- en
base_model:
- ByteDance-Seed/UI-TARS-1.5-7B
pipeline_tag: image-text-to-text
tags:
- gui-agent
- computer-use
- multimodal
- vision-language
- qwen2_5_vl
- ui-tars
- robustness
- reinforcement-learning
- grpo
library_name: transformers
---
# AgentHijack-Agent
**AgentHijack-Agent** is the action-generation model released with the paper
[*AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions*](https://AgentHijack.github.io) (ICML 2026).
It is fine-tuned from [`UI-TARS-1.5-7B`](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) (Qwen2.5-VL architecture) using **Data-Augmented Group Relative Policy Optimization (DA-GRPO)** on the AgentHijack benchmark, with the goal of producing a computer-use agent that remains reliable under *common environment corruptions* (pop-ups, resolution changes, UI marks, subtitles, multi-apps, accidental touches, app minimization, network errors, and verification prompts).
The same checkpoint serves a dual role in the AgentHijack-Agent framework:
1. **Action generator** โ€” produces the next GUI action from screenshots + history.
2. **Onlooker** โ€” summarizes behavioral changes between consecutive screenshots and performs an initial environment check before execution.
- ๐Ÿ“„ **Paper:** *AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions* (ICML 2026)
- ๐ŸŒ **Project page:** https://AgentHijack.github.io
- ๐Ÿงฉ **Base model:** `ByteDance-Seed/UI-TARS-1.5-7B` (Qwen2.5-VL-7B architecture)
- ๐Ÿ›๏ธ **Affiliations:** TMLR Group, Hong Kong Baptist University
---
## Highlights
Compared with the base `UI-TARS-1.5-7B`, AgentHijack-Agent:
- **Improves average task success rate on the AgentHijack benchmark by +4.15%** (and a larger margin on UI-TARS-7B-DPO baseline).
- Maintains accurate grounding under **visual disruptors** (pop-ups, resolution change, marks, subtitle, multi-apps).
- Recovers from **unexpected operations** (accidental touch, app minimization) via behavioral summarization.
- Detects **environment errors** (network failure, login/verification prompts) up-front instead of looping on meaningless attempts.
See Table 2 and Figure 8 of the paper for full results and qualitative trajectories.
---
## Model details
| Field | Value |
|---|---|
| Architecture | `Qwen2_5_VLForConditionalGeneration` |
| Parameters | ~7B |
| Precision | `bfloat16` |
| Context length | 128k tokens |
| Image resolution | 1920 ร— 1080 (native, paper default) |
| Sharding | 4 ร— `safetensors` shards |
| Tokenizer | Inherited from UI-TARS-1.5-7B / Qwen2.5-VL |
### Training
- **Algorithm:** Data-Augmented GRPO (DA-GRPO), an extension of GRPO that rolls out the same instruction across *different corrupted environments* drawn from a corruption set `C`, instead of a single clean environment.
- **Framework:** [VERL](https://github.com/volcengine/verl).
- **Data:** 128 tasks sampled from the AgentHijack benchmark (built on top of OSWorld with 9 configurable corruption types, 3,321 tasks total).
- **Schedule:** 15 epochs.
- **Reward:** `r = r_success + r_format`, with an experience-replay buffer (following ARPO) to mitigate sparse-reward batches.
- **Optimization:** clip range [0.2, 0.3], KL loss disabled to encourage exploration.
---
## Usage
The model uses the standard Qwen2.5-VL / UI-TARS interface and is compatible with `transformers` and `vllm`.
### Action space
AgentHijack-Agent uses the same action space as UI-TARS-1.5-7B:
```
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='xxx')
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait()
finished(content='xxx')
```
### Prompt template (action generator)
```
You are a GUI agent. You are given a task and your action history, with
screenshots. You need to perform the next action to complete the task.
## Output Format
```
Thought: ...
Action: ...
```
## Action Space
{action_space}
## Note
- Use {language} in `Thought` part.
- Write a small plan and finally summarize your next action (with its target
element) in one sentence in `Thought` part.
## User Instruction
{instruction}
```
### Minimal inference example
```python
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "<your-username>/AgentHijack-Agent"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
# Build a chat with screenshot(s) + the action-generator prompt above,
# then run model.generate(...) as usual.
```
For the full agent framework (action generator + onlooker + environment checking), please refer to the code at [AgentHijack.github.io](https://AgentHijack.github.io).
---
## Citation
If you use this model or the AgentHijack benchmark, please cite:
```bibtex
@inproceedings{sun2026agenthijack,
title = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
author = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
booktitle = {Forty-third International Conference on Machine Learning},
year = {2026},
url = {https://openreview.net/forum?id=0H5Im3Xvuf}
}
```
---
## Acknowledgements
This model is built on top of [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) and the [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) family, with training infrastructure based on [VERL](https://github.com/volcengine/verl). The benchmark environment extends [OSWorld](https://os-world.github.io/).