Instructions to use osunlp/GUI-Drag-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use osunlp/GUI-Drag-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="osunlp/GUI-Drag-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("osunlp/GUI-Drag-7B")
model = AutoModelForImageTextToText.from_pretrained("osunlp/GUI-Drag-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use osunlp/GUI-Drag-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "osunlp/GUI-Drag-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "osunlp/GUI-Drag-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/osunlp/GUI-Drag-7B

SGLang

How to use osunlp/GUI-Drag-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "osunlp/GUI-Drag-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "osunlp/GUI-Drag-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "osunlp/GUI-Drag-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "osunlp/GUI-Drag-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use osunlp/GUI-Drag-7B with Docker Model Runner:
```
docker model run hf.co/osunlp/GUI-Drag-7B
```

nielsr HF Staff commited on Jan 13

Commit

b347340

verified ·

1 Parent(s): dc34666

Add metadata and paper link to model card

Browse files

Hi! I'm Niels, part of the community science team at Hugging Face.

This PR improves the model card for GUI-Drag-7B:
- Adds metadata for `pipeline_tag` and `library_name`.
- Links the model repository to the associated research paper: [Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging](https://huggingface.co/papers/2601.06031).
- Includes links to the project page and GitHub repository for better visibility.
- Maintains the existing demo code while ensuring the literal new line characters (`\n`) in strings are preserved.

Files changed (1) hide show

README.md +47 -21

README.md CHANGED Viewed

@@ -1,29 +1,34 @@
 ---
 license: apache-2.0
 ---
-Our models are trained based on [Jedi models](https://huggingface.co/xlangai/Jedi-3B-1080p) via an efficient continual training strategy, which enhances the models' text dragging performance while perserving their original click-based performance.
-For details of how to employ the models, please refer to our [repo](https://github.com/OSU-NLP-Group/GUI-Drag/blob/48a3480fe580e93fb747f0eb8ae549d5eb18f57b/evaluation/cli_run_drag.sh#L14) examples.
-Project page: https://osu-nlp-group.github.io/GUI-Drag
-Below is the code of a quick demo (demo.png can be found at [here](https://github.com/OSU-NLP-Group/GUI-Drag/blob/main/demo.png)):
-```
-# pip install openai pillow transformers
-# start the vllm server:
-'''
 vllm serve osunlp/GUI-Drag-7B \
 --host 0.0.0.0 \
 --port 8000 \
 --max-model-len 16384 \
 --tensor-parallel-size 2
-'''
 import base64
 import json
 import re
@@ -33,18 +38,32 @@ from openai import OpenAI
 from PIL import Image, ImageDraw
 from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import smart_resize as qwen_smart_resize
 MODEL_ID = "osunlp/GUI-Drag-7B"
 BASE_URL = "http://localhost:8000/v1"
 FN_CALL_TEMPLATE = """You are a helpful assistant.
 # Tools
 You may call one or more functions to assist with the user query.
 You are provided with function signatures within <tools></tools> XML tags:
 <tools>
-{{"type": "function", "function": {{"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is {width}x{height}.\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {{"properties": {{"action": {{"description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button.\n* `middle_click`: Click the middle mouse button.\n* `double_click`: Double-click the left mouse button.\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "scroll", "wait", "terminate"], "type": "string"}}, "keys": {{"description": "Required only by `action=key`.", "type": "array"}}, "text": {{"description": "Required only by `action=type`.", "type": "string"}}, "coordinate": {{"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move`, `action=left_click_drag`, `action=left_click`, `action=right_click`, `action=double_click`.", "type": "array"}}, "pixels": {{"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll`.", "type": "number"}}, "time": {{"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}}, "status": {{"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}}}, "required": ["action"], "type": "object"}}}}}}
 </tools>
 For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
 <tool_call>
@@ -52,13 +71,9 @@ For each function call, return a json object with function name and arguments wi
 </tool_call>
 """
 IMAGE_PATH = Path("demo.png")
 INSTRUCTION = "Drag to select the last sentence."
 def encode_image(image: Image) -> str:
     """Encode PIL image to base64 string"""
     output_buffer = io.BytesIO()
@@ -67,8 +82,6 @@ def encode_image(image: Image) -> str:
     base64_str = base64.b64encode(byte_data).decode("utf-8")
     return base64_str
 def resize_coordinates(coord, size_pred, size_to_be_mapped):
     return (
         round(coord[0] * size_to_be_mapped[0] / size_pred[0]),
@@ -140,4 +153,17 @@ def main():
 if __name__ == "__main__":
     main()
 ```

 ---
 license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- gui-grounding
+- text-dragging
+- vision-language
+- qwen2.5-vl
 ---
+# Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging
+[**Project Page**](https://osu-nlp-group.github.io/GUI-Drag) | [**GitHub Repository**](https://github.com/OSU-NLP-Group/GUI-Drag) | [**Paper**](https://huggingface.co/papers/2601.06031)
+GUI-Drag-7B is a multimodal model designed for GUI grounding, with a specific focus on text dragging interactions. While traditional models focus on clicking, GUI-Drag enables autonomous agents to select and manipulate textual content through dragging actions. The model is trained based on [Jedi models](https://huggingface.co/xlangai/Jedi-3B-1080p) via an efficient continual training strategy, which enhances text dragging performance while preserving original click-based capabilities.
+## Quick Demo
+Below is the code of a quick demo (demo.png can be found at [here](https://github.com/OSU-NLP-Group/GUI-Drag/blob/main/demo.png)). To use the model, you can start a [vLLM](https://github.com/vllm-project/vllm) server:
+```bash
 vllm serve osunlp/GUI-Drag-7B \
 --host 0.0.0.0 \
 --port 8000 \
 --max-model-len 16384 \
 --tensor-parallel-size 2
+```
+```python
+# pip install openai pillow transformers
 import base64
 import json
 import re
 from PIL import Image, ImageDraw
 from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import smart_resize as qwen_smart_resize
 MODEL_ID = "osunlp/GUI-Drag-7B"
 BASE_URL = "http://localhost:8000/v1"
 FN_CALL_TEMPLATE = """You are a helpful assistant.
 # Tools
 You may call one or more functions to assist with the user query.
 You are provided with function signatures within <tools></tools> XML tags:
 <tools>
+{{"type": "function", "function": {{"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.
+* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.
+* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.
+* The screen's resolution is {width}x{height}.
+* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.
+* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.
+* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {{"properties": {{"action": {{"description": "The action to perform. The available actions are:
+* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.
+* `type`: Type a string of text on the keyboard.
+* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.
+* `left_click`: Click the left mouse button.
+* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.
+* `right_click`: Click the right mouse button.
+* `middle_click`: Click the middle mouse button.
+* `double_click`: Double-click the left mouse button.
+* `scroll`: Performs a scroll of the mouse scroll wheel.
+* `wait`: Wait specified seconds for the change to happen.
+* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "scroll", "wait", "terminate"], "type": "string"}}, "keys": {{"description": "Required only by `action=key`.", "type": "array"}}, "text": {{"description": "Required only by `action=type`.", "type": "string"}}, "coordinate": {{"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move`, `action=left_click_drag`, `action=left_click`, `action=right_click`, `action=double_click`.", "type": "array"}}, "pixels": {{"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll`.", "type": "number"}}, "time": {{"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}}, "status": {{"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}}}, "required": ["action"], "type": "object"}}}}}}
 </tools>
 For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
 <tool_call>
 </tool_call>
 """
 IMAGE_PATH = Path("demo.png")
 INSTRUCTION = "Drag to select the last sentence."
 def encode_image(image: Image) -> str:
     """Encode PIL image to base64 string"""
     output_buffer = io.BytesIO()
     base64_str = base64.b64encode(byte_data).decode("utf-8")
     return base64_str
 def resize_coordinates(coord, size_pred, size_to_be_mapped):
     return (
         round(coord[0] * size_to_be_mapped[0] / size_pred[0]),
 if __name__ == "__main__":
     main()
+```
+## Citation
+If you find this work useful, please consider citing:
+```bibtex
+@article{cheng2025beyond,
+  title={Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging},
+  author={Cheng, Kanzhi and Wu, Zhiyong and Wu, Zhenyu and Sun, Qiushi and Liang, Paul Pu and Qiao, Yu and Zhang, Ming and Luo, Xiao and others},
+  journal={arXiv preprint arXiv:2601.06031},
+  year={2025}
+}
 ```