Instructions to use TMLR-Group-HF/AgentHijack-Agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TMLR-Group-HF/AgentHijack-Agent with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="TMLR-Group-HF/AgentHijack-Agent")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("TMLR-Group-HF/AgentHijack-Agent")
model = AutoModelForImageTextToText.from_pretrained("TMLR-Group-HF/AgentHijack-Agent")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TMLR-Group-HF/AgentHijack-Agent with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TMLR-Group-HF/AgentHijack-Agent"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TMLR-Group-HF/AgentHijack-Agent",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/TMLR-Group-HF/AgentHijack-Agent

SGLang

How to use TMLR-Group-HF/AgentHijack-Agent with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TMLR-Group-HF/AgentHijack-Agent" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TMLR-Group-HF/AgentHijack-Agent",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TMLR-Group-HF/AgentHijack-Agent" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TMLR-Group-HF/AgentHijack-Agent",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use TMLR-Group-HF/AgentHijack-Agent with Docker Model Runner:
```
docker model run hf.co/TMLR-Group-HF/AgentHijack-Agent
```

AgentHijack-Agent / README.md

Superjw

Add files using upload-large-folder tool

a11346e verified 9 days ago

preview code

raw

history blame contribute delete

5.91 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- ByteDance-Seed/UI-TARS-1.5-7B
	pipeline_tag: image-text-to-text
	tags:
	- gui-agent
	- computer-use
	- multimodal
	- vision-language
	- qwen2_5_vl
	- ui-tars
	- robustness
	- reinforcement-learning
	- grpo
	library_name: transformers
	---

	# AgentHijack-Agent

	AgentHijack-Agent is the action-generation model released with the paper
	[AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions](https://AgentHijack.github.io) (ICML 2026).

	It is fine-tuned from [`UI-TARS-1.5-7B`](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) (Qwen2.5-VL architecture) using Data-Augmented Group Relative Policy Optimization (DA-GRPO) on the AgentHijack benchmark, with the goal of producing a computer-use agent that remains reliable under common environment corruptions (pop-ups, resolution changes, UI marks, subtitles, multi-apps, accidental touches, app minimization, network errors, and verification prompts).

	The same checkpoint serves a dual role in the AgentHijack-Agent framework:

	1. Action generator — produces the next GUI action from screenshots + history.
	2. Onlooker — summarizes behavioral changes between consecutive screenshots and performs an initial environment check before execution.

	- 📄 Paper: AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions (ICML 2026)
	- 🌐 Project page: https://AgentHijack.github.io
	- 🧩 Base model: `ByteDance-Seed/UI-TARS-1.5-7B` (Qwen2.5-VL-7B architecture)
	- 🏛️ Affiliations: TMLR Group, Hong Kong Baptist University

	---

	## Highlights

	Compared with the base `UI-TARS-1.5-7B`, AgentHijack-Agent:

	- Improves average task success rate on the AgentHijack benchmark by +4.15% (and a larger margin on UI-TARS-7B-DPO baseline).
	- Maintains accurate grounding under visual disruptors (pop-ups, resolution change, marks, subtitle, multi-apps).
	- Recovers from unexpected operations (accidental touch, app minimization) via behavioral summarization.
	- Detects environment errors (network failure, login/verification prompts) up-front instead of looping on meaningless attempts.

	See Table 2 and Figure 8 of the paper for full results and qualitative trajectories.

	---

	## Model details

	\| Field \| Value \|
	\|---\|---\|
	\| Architecture \| `Qwen2_5_VLForConditionalGeneration` \|
	\| Parameters \| ~7B \|
	\| Precision \| `bfloat16` \|
	\| Context length \| 128k tokens \|
	\| Image resolution \| 1920 × 1080 (native, paper default) \|
	\| Sharding \| 4 × `safetensors` shards \|
	\| Tokenizer \| Inherited from UI-TARS-1.5-7B / Qwen2.5-VL \|

	### Training

	- Algorithm: Data-Augmented GRPO (DA-GRPO), an extension of GRPO that rolls out the same instruction across different corrupted environments drawn from a corruption set `C`, instead of a single clean environment.
	- Framework: [VERL](https://github.com/volcengine/verl).
	- Data: 128 tasks sampled from the AgentHijack benchmark (built on top of OSWorld with 9 configurable corruption types, 3,321 tasks total).
	- Schedule: 15 epochs.
	- Reward: `r = r_success + r_format`, with an experience-replay buffer (following ARPO) to mitigate sparse-reward batches.
	- Optimization: clip range [0.2, 0.3], KL loss disabled to encourage exploration.

	---

	## Usage

	The model uses the standard Qwen2.5-VL / UI-TARS interface and is compatible with `transformers` and `vllm`.

	### Action space

	AgentHijack-Agent uses the same action space as UI-TARS-1.5-7B:

	```
	click(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>')
	left_double(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>')
	right_single(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>')
	drag(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>', end_box='<\|box_start\|>(x3,y3)<\|box_end\|>')
	hotkey(key='')
	type(content='xxx')
	scroll(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>', direction='down or up or right or left')
	wait()
	finished(content='xxx')
	```

	### Prompt template (action generator)

	```
	You are a GUI agent. You are given a task and your action history, with
	screenshots. You need to perform the next action to complete the task.

	## Output Format
	```
	Thought: ...
	Action: ...
	```

	## Action Space

	{action_space}

	## Note
	- Use {language} in `Thought` part.
	- Write a small plan and finally summarize your next action (with its target
	element) in one sentence in `Thought` part.

	## User Instruction
	{instruction}
	```

	### Minimal inference example

	```python
	from transformers import AutoProcessor, AutoModelForImageTextToText
	import torch

	model_id = "<your-username>/AgentHijack-Agent"
	processor = AutoProcessor.from_pretrained(model_id)
	model = AutoModelForImageTextToText.from_pretrained(
	model_id, torch_dtype=torch.bfloat16, device_map="auto"
	)

	# Build a chat with screenshot(s) + the action-generator prompt above,
	# then run model.generate(...) as usual.
	```

	For the full agent framework (action generator + onlooker + environment checking), please refer to the code at [AgentHijack.github.io](https://AgentHijack.github.io).

	---

	## Citation

	If you use this model or the AgentHijack benchmark, please cite:

	```bibtex
	@inproceedings{sun2026agenthijack,
	title = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
	author = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
	booktitle = {Forty-third International Conference on Machine Learning},
	year = {2026},
	url = {https://openreview.net/forum?id=0H5Im3Xvuf}
	}
	```

	---

	## Acknowledgements

	This model is built on top of [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) and the [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) family, with training infrastructure based on [VERL](https://github.com/volcengine/verl). The benchmark environment extends [OSWorld](https://os-world.github.io/).