Update README.md

9cc2fbb verified 9 months ago

8.05 kB

	---
	license: mit
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: visual-question-answering
	---


	This repository contains the model presented in [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620).

	Project page: https://github.com/lll6gg/UI-R1

	New version: [UI-R1-E-3B](https://huggingface.co/LZXzju/Qwen2.5-VL-3B-UI-R1-E)

	## Benchmark 1: ScreenSpotV2

	\| ScreenSpotV2 \| inference mode \| Mobile-T \| Mobile-I \| Desktop-T \| Desktop-I \| Web-T \| Web-I \| Avg↑ / Len↓ \|
	\| ------------- \| -------------- \| -------- \| -------- \| --------- \| --------- \| -------- \| -------- \| ----------------- \|
	\| OS-ATLAS-7B \| w/o thinking \| 95.2 \| 75.8 \| 90.7 \| 63.6 \| 90.6 \| 77.3 \| 84.1 / \|
	\| UI-TARS-7B \| w/o thinking \| 95.2 \| 79.1 \| 90.7 \| 68.6 \| 90.6 \| 78.3 \| 84.7 / \|
	\| UI-R1-3B (v1) \| w/ thinking \| 96.2 \| 84.3 \| 92.3 \| 63.6 \| 89.2 \| 75.4 \| 85.4 / 67 \|
	\| GUI-R1-3B \| w/ thinking \| 97.6 \| 78.2 \| 94.3 \| 64.3 \| 91.0 \| 72.4 \| 85.0 / 80 \|
	\| UI-R1-3B (v2) \| w/ thinking \| 97.6 \| 79.6 \| 92.3 \| 67.9 \| 88.9 \| 77.8 \| 85.8 / 60 \|
	\| UI-R1-E-3B \| w/o thinking \| 98.2 \| 83.9 \| 94.8 \| 75.0 \| 93.2 \| 83.7 \| 89.5 / 28 \|
	## Benchmark 2: ScreenSpot-Pro

	\| ScreenSpot-Pro \| inference mode \| Average Length↓ \| Average Accuracy↑ \|
	\| -------------- \| -------------- \| --------------- \| ---------------- \|
	\| UGround-7B \| w/o thinking \| - \| 16.5 \|
	\| OS-ATLAS-7B \| w/o thinking \| - \| 18.9 \|
	\| UI-R1-3B (v1) \| w/ thinking \| 102 \| 17.8 \|
	\| GUI-R1-3B \| w/ thinking \| 114 \| 26.6 \|
	\| UI-R1-3B (v2) \| w/ thinking \| 129 \| 29.8 \|
	\| UI-R1-E-3B \| w/o thinking \| 28 \| 33.5 \|
	## Leaderboard: UI-I2E-Bench
	\| Model \| ScreenSpot \| UI-I2E-Bench Avg \| ScreenSpot-Pro \| Avg \|
	\| :------------: \| :--------: \| :--------------: \| :------------: \| :--: \|
	\| UI-TARS-1.5-7B \| 88.1 \| 73.2 \| 42.2 \| 67.8 \|
	\| Uground-V1-72B \| 89.7 \| 76.3 \| 34.3 \| 66.8 \|
	\| UI-TARS-72B \| 88.4 \| 73.7 \| 38.1 \| 66.7 \|
	\| UI-R1-E-3B \| 89.2 \| 69.1 \| 33.5 \| 63.9 \|
	\| Uground-V1-7B \| 87.1 \| 70.3 \| 31.1 \| 62.8 \|
	\| InfiGUI-R1 \| 87.5 \| 69.7 \| 29.6 \| 62.3 \|
	\| UI-TARS-7B \| 89.5 \| 61.4 \| 35.7 \| 62.2 \|
	\| Qwen2.5-VL-72B \| 87.1 \| 51.4 \| 43.6 \| 60.7 \|
	\| UI-I2E-VLM-7B \| 82.5 \| 69.5 \| 23.6 \| 58.5 \|
	\| UI-TARS-2B \| 82.3 \| 62 \| 27.7 \| 57.3 \|
	\| Qwen2.5-VL-7B \| 84.7 \| 53.8 \| 29 \| 55.8 \|
	\| OmniParser-V2 \| 72 \| 54.8 \| 39.6 \| 55.5 \|
	\| Uground-V1-2B \| 78.8 \| 57.4 \| 26.6 \| 54.3 \|
	\| OS-Atlas-7B \| 82.5 \| 58.6 \| 18.9 \| 53.3 \|
	\| UI-R1-3B \| 83.3 \| 58.5 \| 17.8 \| 53.2 \|
	\| UGround-7B \| 74.1 \| 54.2 \| 16.5 \| 48.3 \|
	\| UI-I2E-VLM-4B \| 70.4 \| 53.4 \| 12.2 \| 45.3 \|
	\| OmniParser \| 73.9 \| 53.1 \| 8.3 \| 45.1 \|
	\| ShowUI-2B \| 76.8 \| 41.5 \| 7.7 \| 42 \|
	\| Qwen2.5-VL-3B \| 55.5 \| 41.7 \| 23.9 \| 41.3 \|
	\| Aguvis-7B \| 84.4 \| 53.2 \| 22.9 \| 40.4 \|
	\| OS-Atlas-4B \| 70.1 \| 44.3 \| 3.7 \| 39.4 \|
	\| Qwen2-VL-7B \| 42.6 \| 48.7 \| 1.6 \| 31 \|
	\| Seeclick \| 55.8 \| 26.4 \| 1.1 \| 27.8 \|
	\| InternVL2-4B \| 4.2 \| 0.9 \| 0.3 \| 1.8 \|

	## Evaluation Code for GUI Grounding

	1. Generation for UI-R1-E-3B：

	```python
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	args.model_path,
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="cpu",
	)
	model = model.to(torch.device(rank))
	model = model.eval()
	processor = AutoProcessor.from_pretrained(ori_processor_path)
	question_template = (
	f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
	"Please provide the action to perform (enumerate in ['click', 'scroll']) and the coordinate where the cursor is moved to(integer) if click is performed.\n"
	"Output the thinking process in <think> </think> and final answer in <answer> </answer> tags."
	"The output answer format should be as follows:\n"
	"<think> ... </think> <answer>[{'action': enum['click', 'scroll'], 'coordinate': [x, y]}]</answer>\n"
	"Please strictly follow the format."
	)
	query = '<image>\n' + question_template
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image_path}
	] + [{"type": "text", "text": query}],
	}
	]
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	generated_ids = model.generate(**inputs, max_new_tokens=1024)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	response = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	response = response[0]
	pred_coord, _ = extract_coord(response)
	```



	2. Rescale the predicted coordinate according to the image resize (especially image_size > 12845056)

	```python
	image = Image.open(image_path)
	origin_width, origin_height = image.size
	resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
	scale_x = origin_width / resized_width
	scale_y = origin_height / resized_height
	pred_coord[0] = int(pred_coord[0] * scale_x)
	pred_coord[1] = int(pred_coord[1] * scale_y)
	```

	Function smart_resize is from Qwen2VL：

	```python
	import math
	def smart_resize(
	height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
	):
	"""Rescales the image so that the following conditions are met:

	1. Both dimensions (height and width) are divisible by 'factor'.

	2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].

	3. The aspect ratio of the image is maintained as closely as possible.

	"""
	if height < factor or width < factor:
	raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
	elif max(height, width) / min(height, width) > 200:
	raise ValueError(
	f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
	)
	h_bar = round(height / factor) * factor
	w_bar = round(width / factor) * factor
	if h_bar * w_bar > max_pixels:
	beta = math.sqrt((height * width) / max_pixels)
	h_bar = math.floor(height / beta / factor) * factor
	w_bar = math.floor(width / beta / factor) * factor
	elif h_bar * w_bar < min_pixels:
	beta = math.sqrt(min_pixels / (height * width))
	h_bar = math.ceil(height * beta / factor) * factor
	w_bar = math.ceil(width * beta / factor) * factor
	return h_bar, w_bar
	```