Image-Text-to-Text
Transformers
Safetensors
English
qwen2_5_vl
gui-agent
computer-use
multimodal
vision-language
ui-tars
robustness
reinforcement-learning
grpo
conversational
text-generation-inference
Instructions to use TMLR-Group-HF/AgentHijack-Agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TMLR-Group-HF/AgentHijack-Agent with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="TMLR-Group-HF/AgentHijack-Agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("TMLR-Group-HF/AgentHijack-Agent") model = AutoModelForImageTextToText.from_pretrained("TMLR-Group-HF/AgentHijack-Agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TMLR-Group-HF/AgentHijack-Agent with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TMLR-Group-HF/AgentHijack-Agent" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TMLR-Group-HF/AgentHijack-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/TMLR-Group-HF/AgentHijack-Agent
- SGLang
How to use TMLR-Group-HF/AgentHijack-Agent with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TMLR-Group-HF/AgentHijack-Agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TMLR-Group-HF/AgentHijack-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TMLR-Group-HF/AgentHijack-Agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TMLR-Group-HF/AgentHijack-Agent", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use TMLR-Group-HF/AgentHijack-Agent with Docker Model Runner:
docker model run hf.co/TMLR-Group-HF/AgentHijack-Agent
| license: apache-2.0 | |
| language: | |
| - en | |
| base_model: | |
| - ByteDance-Seed/UI-TARS-1.5-7B | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - gui-agent | |
| - computer-use | |
| - multimodal | |
| - vision-language | |
| - qwen2_5_vl | |
| - ui-tars | |
| - robustness | |
| - reinforcement-learning | |
| - grpo | |
| library_name: transformers | |
| # AgentHijack-Agent | |
| **AgentHijack-Agent** is the action-generation model released with the paper | |
| [*AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions*](https://AgentHijack.github.io) (ICML 2026). | |
| It is fine-tuned from [`UI-TARS-1.5-7B`](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) (Qwen2.5-VL architecture) using **Data-Augmented Group Relative Policy Optimization (DA-GRPO)** on the AgentHijack benchmark, with the goal of producing a computer-use agent that remains reliable under *common environment corruptions* (pop-ups, resolution changes, UI marks, subtitles, multi-apps, accidental touches, app minimization, network errors, and verification prompts). | |
| The same checkpoint serves a dual role in the AgentHijack-Agent framework: | |
| 1. **Action generator** โ produces the next GUI action from screenshots + history. | |
| 2. **Onlooker** โ summarizes behavioral changes between consecutive screenshots and performs an initial environment check before execution. | |
| - ๐ **Paper:** *AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions* (ICML 2026) | |
| - ๐ **Project page:** https://AgentHijack.github.io | |
| - ๐งฉ **Base model:** `ByteDance-Seed/UI-TARS-1.5-7B` (Qwen2.5-VL-7B architecture) | |
| - ๐๏ธ **Affiliations:** TMLR Group, Hong Kong Baptist University | |
| --- | |
| ## Highlights | |
| Compared with the base `UI-TARS-1.5-7B`, AgentHijack-Agent: | |
| - **Improves average task success rate on the AgentHijack benchmark by +4.15%** (and a larger margin on UI-TARS-7B-DPO baseline). | |
| - Maintains accurate grounding under **visual disruptors** (pop-ups, resolution change, marks, subtitle, multi-apps). | |
| - Recovers from **unexpected operations** (accidental touch, app minimization) via behavioral summarization. | |
| - Detects **environment errors** (network failure, login/verification prompts) up-front instead of looping on meaningless attempts. | |
| See Table 2 and Figure 8 of the paper for full results and qualitative trajectories. | |
| --- | |
| ## Model details | |
| | Field | Value | | |
| |---|---| | |
| | Architecture | `Qwen2_5_VLForConditionalGeneration` | | |
| | Parameters | ~7B | | |
| | Precision | `bfloat16` | | |
| | Context length | 128k tokens | | |
| | Image resolution | 1920 ร 1080 (native, paper default) | | |
| | Sharding | 4 ร `safetensors` shards | | |
| | Tokenizer | Inherited from UI-TARS-1.5-7B / Qwen2.5-VL | | |
| ### Training | |
| - **Algorithm:** Data-Augmented GRPO (DA-GRPO), an extension of GRPO that rolls out the same instruction across *different corrupted environments* drawn from a corruption set `C`, instead of a single clean environment. | |
| - **Framework:** [VERL](https://github.com/volcengine/verl). | |
| - **Data:** 128 tasks sampled from the AgentHijack benchmark (built on top of OSWorld with 9 configurable corruption types, 3,321 tasks total). | |
| - **Schedule:** 15 epochs. | |
| - **Reward:** `r = r_success + r_format`, with an experience-replay buffer (following ARPO) to mitigate sparse-reward batches. | |
| - **Optimization:** clip range [0.2, 0.3], KL loss disabled to encourage exploration. | |
| --- | |
| ## Usage | |
| The model uses the standard Qwen2.5-VL / UI-TARS interface and is compatible with `transformers` and `vllm`. | |
| ### Action space | |
| AgentHijack-Agent uses the same action space as UI-TARS-1.5-7B: | |
| ``` | |
| click(start_box='<|box_start|>(x1,y1)<|box_end|>') | |
| left_double(start_box='<|box_start|>(x1,y1)<|box_end|>') | |
| right_single(start_box='<|box_start|>(x1,y1)<|box_end|>') | |
| drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>') | |
| hotkey(key='') | |
| type(content='xxx') | |
| scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left') | |
| wait() | |
| finished(content='xxx') | |
| ``` | |
| ### Prompt template (action generator) | |
| ``` | |
| You are a GUI agent. You are given a task and your action history, with | |
| screenshots. You need to perform the next action to complete the task. | |
| ## Output Format | |
| ``` | |
| Thought: ... | |
| Action: ... | |
| ``` | |
| ## Action Space | |
| {action_space} | |
| ## Note | |
| - Use {language} in `Thought` part. | |
| - Write a small plan and finally summarize your next action (with its target | |
| element) in one sentence in `Thought` part. | |
| ## User Instruction | |
| {instruction} | |
| ``` | |
| ### Minimal inference example | |
| ```python | |
| from transformers import AutoProcessor, AutoModelForImageTextToText | |
| import torch | |
| model_id = "<your-username>/AgentHijack-Agent" | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| model = AutoModelForImageTextToText.from_pretrained( | |
| model_id, torch_dtype=torch.bfloat16, device_map="auto" | |
| ) | |
| # Build a chat with screenshot(s) + the action-generator prompt above, | |
| # then run model.generate(...) as usual. | |
| ``` | |
| For the full agent framework (action generator + onlooker + environment checking), please refer to the code at [AgentHijack.github.io](https://AgentHijack.github.io). | |
| --- | |
| ## Citation | |
| If you use this model or the AgentHijack benchmark, please cite: | |
| ```bibtex | |
| @inproceedings{sun2026agenthijack, | |
| title = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions}, | |
| author = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han}, | |
| booktitle = {Forty-third International Conference on Machine Learning}, | |
| year = {2026}, | |
| url = {https://openreview.net/forum?id=0H5Im3Xvuf} | |
| } | |
| ``` | |
| --- | |
| ## Acknowledgements | |
| This model is built on top of [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) and the [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) family, with training infrastructure based on [VERL](https://github.com/volcengine/verl). The benchmark environment extends [OSWorld](https://os-world.github.io/). | |