--- license: apache-2.0 language: - en base_model: - ByteDance-Seed/UI-TARS-1.5-7B pipeline_tag: image-text-to-text tags: - gui-agent - computer-use - multimodal - vision-language - qwen2_5_vl - ui-tars - robustness - reinforcement-learning - grpo library_name: transformers --- # AgentHijack-Agent **AgentHijack-Agent** is the action-generation model released with the paper [*AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions*](https://AgentHijack.github.io) (ICML 2026). It is fine-tuned from [`UI-TARS-1.5-7B`](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) (Qwen2.5-VL architecture) using **Data-Augmented Group Relative Policy Optimization (DA-GRPO)** on the AgentHijack benchmark, with the goal of producing a computer-use agent that remains reliable under *common environment corruptions* (pop-ups, resolution changes, UI marks, subtitles, multi-apps, accidental touches, app minimization, network errors, and verification prompts). The same checkpoint serves a dual role in the AgentHijack-Agent framework: 1. **Action generator** — produces the next GUI action from screenshots + history. 2. **Onlooker** — summarizes behavioral changes between consecutive screenshots and performs an initial environment check before execution. - 📄 **Paper:** *AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions* (ICML 2026) - 🌐 **Project page:** https://AgentHijack.github.io - 🧩 **Base model:** `ByteDance-Seed/UI-TARS-1.5-7B` (Qwen2.5-VL-7B architecture) - 🏛️ **Affiliations:** TMLR Group, Hong Kong Baptist University --- ## Highlights Compared with the base `UI-TARS-1.5-7B`, AgentHijack-Agent: - **Improves average task success rate on the AgentHijack benchmark by +4.15%** (and a larger margin on UI-TARS-7B-DPO baseline). - Maintains accurate grounding under **visual disruptors** (pop-ups, resolution change, marks, subtitle, multi-apps). - Recovers from **unexpected operations** (accidental touch, app minimization) via behavioral summarization. - Detects **environment errors** (network failure, login/verification prompts) up-front instead of looping on meaningless attempts. See Table 2 and Figure 8 of the paper for full results and qualitative trajectories. --- ## Model details | Field | Value | |---|---| | Architecture | `Qwen2_5_VLForConditionalGeneration` | | Parameters | ~7B | | Precision | `bfloat16` | | Context length | 128k tokens | | Image resolution | 1920 × 1080 (native, paper default) | | Sharding | 4 × `safetensors` shards | | Tokenizer | Inherited from UI-TARS-1.5-7B / Qwen2.5-VL | ### Training - **Algorithm:** Data-Augmented GRPO (DA-GRPO), an extension of GRPO that rolls out the same instruction across *different corrupted environments* drawn from a corruption set `C`, instead of a single clean environment. - **Framework:** [VERL](https://github.com/volcengine/verl). - **Data:** 128 tasks sampled from the AgentHijack benchmark (built on top of OSWorld with 9 configurable corruption types, 3,321 tasks total). - **Schedule:** 15 epochs. - **Reward:** `r = r_success + r_format`, with an experience-replay buffer (following ARPO) to mitigate sparse-reward batches. - **Optimization:** clip range [0.2, 0.3], KL loss disabled to encourage exploration. --- ## Usage The model uses the standard Qwen2.5-VL / UI-TARS interface and is compatible with `transformers` and `vllm`. ### Action space AgentHijack-Agent uses the same action space as UI-TARS-1.5-7B: ``` click(start_box='<|box_start|>(x1,y1)<|box_end|>') left_double(start_box='<|box_start|>(x1,y1)<|box_end|>') right_single(start_box='<|box_start|>(x1,y1)<|box_end|>') drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>') hotkey(key='') type(content='xxx') scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left') wait() finished(content='xxx') ``` ### Prompt template (action generator) ``` You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format ``` Thought: ... Action: ... ``` ## Action Space {action_space} ## Note - Use {language} in `Thought` part. - Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part. ## User Instruction {instruction} ``` ### Minimal inference example ```python from transformers import AutoProcessor, AutoModelForImageTextToText import torch model_id = "/AgentHijack-Agent" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForImageTextToText.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) # Build a chat with screenshot(s) + the action-generator prompt above, # then run model.generate(...) as usual. ``` For the full agent framework (action generator + onlooker + environment checking), please refer to the code at [AgentHijack.github.io](https://AgentHijack.github.io). --- ## Citation If you use this model or the AgentHijack benchmark, please cite: ```bibtex @inproceedings{sun2026agenthijack, title = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions}, author = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han}, booktitle = {Forty-third International Conference on Machine Learning}, year = {2026}, url = {https://openreview.net/forum?id=0H5Im3Xvuf} } ``` --- ## Acknowledgements This model is built on top of [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) and the [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) family, with training infrastructure based on [VERL](https://github.com/volcengine/verl). The benchmark environment extends [OSWorld](https://os-world.github.io/).