File size: 7,841 Bytes
f9fe443
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0084756
 
 
 
 
 
 
 
 
f9fe443
56253ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9fe443
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
tags:
- GUI-Agent
- GUI-Perception
- Screen-Understanding
---

## Introduction

**HAR-GUI-3B** is a GUI-tailored native model (native end-to-end GUI agent) built upon Qwen2.5-VL-3B-Instruct. It was developed through our HAR Framework, incorporating a series of tailored training strategies. HAR-GUI-3B integrates a stable short-term memory for episodic reasoning, which can perceive the sequential clues of the episode flexibly and make reasonable use of it. This enhancement of reasoning can assist the GUI agent in executing long-horizon interaction and achieving consistent and persistent growth across GUI-oriented tasks. Further details can be found in our article.

## Quick Start
The following Python script demonstrates how to use the HAR-GUI-3B for GUI automation. This example assumes you have a local vLLM server running the model. You can adapt the code to fit your specific needs.
```bash
# Start vllm service
nohup python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2.5-VL-3B-Instruct --model ./HAR-GUI-3B -tp 4 > log.txt &
#nohup python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2.5-VL-72B-Instruct --model ./Qwen2.5-VL-72B-Instruct -tp 8 > log.txt &

# Mount your local directory
cd ./your_directory/
python3 -m http.server 6666
```

```python
# Instruction for Screen Grounding
temp = '''Locate the element on the screen with the function or description: [DESCRIPTIONS].
Keep the following output format: {"point 2d": [x,y], "label": description of the target element.}'''

elem_desc = "Description of the UI elements on the screen."
inst = temp.replace("[DESCRIPTIONS]", elem_desc)
```

```python
import requests
import json
from tqdm import tqdm

#############################################################################################
ACTION_SPACE = """
CLICK:(x,y): Click on the element at the coordinate point (x,y) on the screen, e.g., CLICK:(1980,224).
TYPE:typed_text: An action of typing a piece of text, e.g., TYPE:"Macbook-Pro 16G Black".
COMPLETE: The goal has been completed in the current screen state.
SCROLL:UP/DOWN/LEFT/RIGHT: Scroll in a specific direction, e.g., SCROLL:UP.
LONG_PRESS:(x,y): Long press at a specific point (x,y) on the screen, e.g., LONG_PRESS:(345,2218).
BACK: Go back to the previous screen, e.g, BACK.
HOME: Go to the home screen, e.g., HOME.
=========================================
OTHER_CUSTOM_ACTIONS: ...
"""

INFERENCE_INSTRUCTION = f"""
You are a skilled assistant, interacting with the screen to accomplish the user's goals.
Here is the action space:
{ACTION_SPACE}
Your overall goal is: <goal>(goal)</goal>
Actions completed at previous steps: <history>(history)</history>

The output format should be as follows:
<think>Analyze step by step based on guidance and screen state to choose the action.</think>
<answer>The action you finally choose from "action space".</answer>"""

ACT2SUM_INSTRUCTION = f"""
Step-by-step GUI navigation task. Briefly summarize the current action.
Action space:
{ACTION_SPACE}
Goal: <goal>(goal)</goal>
Current action: <action>(action)</action>

Output Format: <summary>One-sentence summary of the action based on the screen image.</summary>"""
#############################################################################################
def execute(meta_data):
    goal, hist, img_url = meta_data
    inference_temp = INFERENCE_INSTRUCTION.replace("(goal)", goal).replace("(history)", hist)
    pred = chat_HAR_GUI_3B(img_url, inference_temp)
    return pred

def act2sum_fn(meta_data):
    goal, cur_action, img_url = meta_data
    act2sum_temp = ACT2SUM_INSTRUCTION.replace("(goal)", goal).replace("(action)", cur_action)
    pred = chat_HAR_GUI_3B(img_url, act2sum_temp)
    # pred = chat_72B(img_url, act2sum_temp)
    return pred
#############################################################################################

url = "http://localhost:8000/v1/chat/completions"
headers = {
     "Content-Type": "application/json"
}

def chat_HAR_GUI_3B(img_url, query):
    content = []
    content.append({"type": "image_url", "image_url": {"url": img_url}})
    content.append({"type": "text", "text": query})
    data = {
        "model": "Qwen2.5-VL-3B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": content}
        ],
        "temperature":0}
        
    response = requests.post(url, headers=headers, data=json.dumps(data))
    response = response.json()
    response = response['choices'][0]['message']['content']

    return response

def chat_72B(img_url, query):
    content = []
    content.append({"type": "image_url", "image_url": {"url": img_url}})
    content.append({"type": "text", "text": query})
    data = {
        "model": "Qwen2.5-VL-72B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": content}
        ],
        "temperature":0}
        
    response = requests.post(url, headers=headers, data=json.dumps(data))
    response = response.json()
    response = response['choices'][0]['message']['content']

    return response

## You can also use its model loading method, such as the following (or use the Swift inference framework for faster speed),
##################################################################################
# from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
# MAX_IMAGE_PIXELS = 2048*28*28
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "./models/HAR-GUI-3B", 
#     torch_dtype=torch.bfloat16, 
#     attn_implementation="flash_attention_2", 
#     device_map="auto"
# )
# processor = AutoProcessor.from_pretrained("./HAR-GUI-3B", max_pixels=MAX_IMAGE_PIXELS, padding_side="left")
##################################################################################

if __name__ == "__main__":

    folder = "./your_data_folder/"
    episodes = json.load(open(folder + "your_data_path.json", "r"))
    k = 4
    inference_data = []
    for i, episode in tqdm(enumerate(episodes)):
        hist_horizon = []
        for t, step in tqdm(enumerate(episode)):
            cur_hist = ""
            # You can also build ADB pipeline for online execution  #https://developer.android.com/tools/adb
            goal, gt_action, img, ep_id = step["goal"], step["ground_truth"], step["image_path"], step["episode_id"]
            img_url = 'http://localhost:6666/' + img
            
            if len(hist_horizon) == 0:
                 cur_hist = "This is the task's initial state."
            else:
                for i, act2sum_ in enumerate(hist_horizon[-k:]):
                    cur_hist += 'Step' + str(i+1) + ': ' + act2sum_ + ".\n"
            
            pred = execute((goal, cur_hist, img_url))
            think, pred_action = pred.split("<think>")[-1].split("</think>")[0].strip(), pred.split("<answer>")[-1].split("</answer>")[0].strip()
            
            #############
            # act2sum = act2sum_fn((goal, gt_action, img_url))  # Can be used for static inference
            act2sum = act2sum_fn((goal, pred_action, img_url)) # Can be used for online inference
            hist_horizon.append(act2sum.split("<summary>")[-1].split("</summary>")[0])
            
            inference_data.append({
                "episode_id": ep_id,
                "image_path": img,
                "goal": goal,
                "pred": pred,
                "history": cur_hist,
                "ground_truth": gt_action
            })
    # evaluate(inference_data)
    with open("your_saving_path.json", "w") as f:
        f.write(json.dumps(inference_data, indent=4))
```