--- license: apache-2.0 datasets: - GUI-Libra/GUI-Libra-81K-RL - GUI-Libra/GUI-Libra-81K-SFT language: - en base_model: - Qwen/Qwen2.5-VL-7B-Instruct tags: - VLM - GUI - agent --- # Introduction The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL". **GitHub:** https://github.com/GUI-Libra/GUI-Libra **Website:** https://GUI-Libra.github.io # Usage ## 1) Start an OpenAI-compatible vLLM server ```bash pip install -U vllm vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123 ```` * Endpoint: `http://localhost:8000/v1` * The `api_key` here must match `--api-key`. ## 2) Minimal Python example (prompt + image → request) Install dependencies: ```bash pip install -U openai ``` Create `minimal_infer.py`: ```python import base64 from openai import OpenAI MODEL = "GUI-Libra/GUI-Libra-7B" client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123") def b64_image(path: str) -> str: with open(path, "rb") as f: return base64.b64encode(f.read()).decode("utf-8") # 1) Your screenshot path img_b64 = b64_image("screen.png") system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list: action_type: Click, action_target: Element description, value: None, point_2d: [x, y] ## Explanation: Tap or click a specific UI element and provide its coordinates action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None ## Explanation: Select an item from a list or dropdown menu action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None ## Explanation: Press a specified key on the keyboard action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None ## Explanation: Scroll a view or container in the specified direction """ # 2) Your prompt (instruction + desired output format) task_desc = 'Go to Amazon.com and buy a math book' prev_txt = '' question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n''' img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1]) query = question_description.format(img_size_string, task_desc, prev_txt) query = query + '\n' + '''The response should be structured in the following format: Your step-by-step thought process here... { "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.", "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with", "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'", "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100] } ''' resp = client.chat.completions.create( model=MODEL, messages=[ {"role": "system", "content": "You are a helpful GUI agent."}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}}, {"type": "text", "text": prompt}, ]}, ], temperature=0.0, max_completion_tokens=1024, ) print(resp.choices[0].message.content) ``` Run: ```bash python minimal_infer.py ``` --- ## Notes * Replace `screen.png` with your own screenshot file. * If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests. * The example assumes your vLLM server is running locally on port `8000`.