| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - GUI-Libra/GUI-Libra-81K-RL |
| | - GUI-Libra/GUI-Libra-81K-SFT |
| | language: |
| | - en |
| | base_model: |
| | - Qwen/Qwen2.5-VL-3B-Instruct |
| | tags: |
| | - VLM |
| | - GUI |
| | - agent |
| | --- |
| | |
| | # Introduction |
| |
|
| | The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL". |
| |
|
| |
|
| | **GitHub:** https://github.com/GUI-Libra/GUI-Libra |
| | **Website:** https://GUI-Libra.github.io |
| |
|
| |
|
| | # Usage |
| | ## 1) Start an OpenAI-compatible vLLM server |
| |
|
| | ```bash |
| | pip install -U vllm |
| | vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123 |
| | ```` |
| |
|
| | * Endpoint: `http://localhost:8000/v1` |
| | * The `api_key` here must match `--api-key`. |
| |
|
| |
|
| | ## 2) Minimal Python example (prompt + image → request) |
| |
|
| | Install dependencies: |
| |
|
| | ```bash |
| | pip install -U openai |
| | ``` |
| |
|
| | Create `minimal_infer.py`: |
| |
|
| | ```python |
| | import base64 |
| | from openai import OpenAI |
| | |
| | MODEL = "GUI-Libra/GUI-Libra-3B" |
| | client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123") |
| | |
| | def b64_image(path: str) -> str: |
| | with open(path, "rb") as f: |
| | return base64.b64encode(f.read()).decode("utf-8") |
| | |
| | # 1) Your screenshot path |
| | img_b64 = b64_image("screen.png") |
| | |
| | system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list: |
| | action_type: Click, action_target: Element description, value: None, point_2d: [x, y] |
| | ## Explanation: Tap or click a specific UI element and provide its coordinates |
| | |
| | action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None |
| | ## Explanation: Select an item from a list or dropdown menu |
| | |
| | action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None |
| | ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None |
| | |
| | action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None |
| | ## Explanation: Press a specified key on the keyboard |
| | |
| | action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None |
| | ## Explanation: Scroll a view or container in the specified direction |
| | """ |
| | |
| | # 2) Your prompt (instruction + desired output format) |
| | |
| | task_desc = 'Go to Amazon.com and buy a math book' |
| | prev_txt = '' |
| | question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n''' |
| | img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1]) |
| | query = question_description.format(img_size_string, task_desc, prev_txt) |
| | |
| | query = query + '\n' + '''The response should be structured in the following format: |
| | <think>Your step-by-step thought process here...</think> |
| | <answer> |
| | { |
| | "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.", |
| | "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with", |
| | "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'", |
| | "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100] |
| | } |
| | </answer>''' |
| | |
| | resp = client.chat.completions.create( |
| | model=MODEL, |
| | messages=[ |
| | {"role": "system", "content": "You are a helpful GUI agent."}, |
| | {"role": "user", "content": [ |
| | {"type": "image_url", |
| | "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}}, |
| | {"type": "text", "text": prompt}, |
| | ]}, |
| | ], |
| | temperature=0.0, |
| | max_completion_tokens=1024, |
| | ) |
| | |
| | print(resp.choices[0].message.content) |
| | ``` |
| |
|
| | Run: |
| |
|
| | ```bash |
| | python minimal_infer.py |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Notes |
| |
|
| | * Replace `screen.png` with your own screenshot file. |
| | * If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests. |
| | * The example assumes your vLLM server is running locally on port `8000`. |
| |
|
| |
|
| |
|
| |
|