File size: 5,659 Bytes
6a67bc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
language:
- en
license: mit
library_name: mlx
pipeline_tag: image-text-to-text
base_model: mPLUG/GUI-Owl-7B
tags:
- mlx
- mlx-vlm
- safetensors
- apple-silicon
- conversational
- gui
- vision-language-model
- qwen2_5_vl
- gui-owl
- mobile-agent-v3
- computer-use
- grounding
- osworld
- screenspot
- bf16
---

# GUI-Owl-7B bf16

This is an MLX conversion of [mPLUG/GUI-Owl-7B](https://huggingface.co/mPLUG/GUI-Owl-7B), optimized for Apple Silicon.

GUI-Owl is a GUI automation model family developed as part of the Mobile-Agent-V3 project. Upstream, it is positioned for screen understanding, GUI grounding, and agentic action planning across benchmark suites such as ScreenSpot and OSWorld-style tasks.

This MLX artifact was converted with `mlx-vlm` and validated locally with both `mlx_vlm` prompt-packet checks and `vllm-mlx` OpenAI-compatible serve checks.

## Conversion Details

| Field | Value |
|---|---|
| Upstream model | `mPLUG/GUI-Owl-7B` |
| Artifact type | `bf16 MLX conversion` |
| Source posture | `direct upstream conversion` |
| Conversion tool | `mlx_vlm.convert` via `mlx-vlm 0.3.12` |
| Python | `3.11.14` |
| MLX | `0.31.0` |
| Transformers | `5.2.0` |
| Validation backend | `vllm-mlx (phase/p1 @ 8a5d41b)` |
| Quantization | `bf16` |
| Group size | `n/a` |
| Quantization mode | `n/a` |
| Artifact size | `15.47G` |
| Template repair | `tokenizer_config.json["chat_template"]` was re-injected after conversion |

Additional notes:

- Direct upstream conversion from `mPLUG/GUI-Owl-7B` succeeded on `mlx-vlm 0.3.12`; no local source mirror was required.
- `chat_template.json`, `chat_template.jinja`, and `tokenizer_config.json["chat_template"]` were aligned for downstream compatibility checks.
- Root-level `preprocessor_config.json` and `processor_config.json` are present intentionally for multimodal detection compatibility.

## Validation

This artifact passed local validation in this workspace:

- `mlx_vlm` prompt-packet validation: `PASS`
- `vllm-mlx` OpenAI-compatible serve validation: `PASS`

Local validation notes:

- This family stayed on the original Track A packet; no ShowUI-style packet split was required.
- Grounding returned the right object shape, but coordinates were not normalized to the requested `0-1000` grid.
- Streamed answers drifted into Chinese on one serve-path check even though the non-stream answer stayed correct in English.

## Performance

- Artifact size on disk: `15.47G`
- Local fixed-packet `mlx_vlm` validation used about `18.12 GB` peak memory
- Local `vllm-mlx` serve validation completed in about `22.14s` non-stream and `23.39s` streamed

These are local validation measurements, not a full benchmark suite.

## Usage

### Install

```bash
pip install -U mlx-vlm
```

### CLI

```bash
python -m mlx_vlm.generate \
  --model mlx-community/GUI-Owl-7B-bf16 \
  --image path/to/image.png \
  --prompt "Describe the visible controls on this screen in five short bullet points." \
  --max-tokens 256 \
  --temperature 0.0
```

### Python

```python
from mlx_vlm import load, generate

model, processor = load("mlx-community/GUI-Owl-7B-bf16")
result = generate(
    model,
    processor,
    prompt="Describe the visible controls on this screen in five short bullet points.",
    image="path/to/image.png",
    max_tokens=256,
    temp=0.0,
)
print(result.text)
```

### vllm-mlx Serve

```bash
python -m vllm_mlx.cli serve mlx-community/GUI-Owl-7B-bf16 --mllm --localhost --port 8000
```

## Links

- Upstream model: [mPLUG/GUI-Owl-7B](https://huggingface.co/mPLUG/GUI-Owl-7B)
- Paper: [Mobile-Agent-v3: Foundamental Agents for GUI Automation](https://arxiv.org/abs/2508.15144)
- Technical PDF: [Mobile-Agent-V3 Technical Report](https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/MobileAgentV3_Tech.pdf)
- GitHub: [X-PLUG/MobileAgent](https://github.com/X-PLUG/MobileAgent)
- Base model lineage: [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
- MLX framework: [ml-explore/mlx](https://github.com/ml-explore/mlx)
- mlx-vlm: [Blaizzy/mlx-vlm](https://github.com/Blaizzy/mlx-vlm)

## Other Quantizations

Planned sibling repos in this wave:

- [`mlx-community/GUI-Owl-7B-bf16`](https://huggingface.co/mlx-community/GUI-Owl-7B-bf16) - this model
- [`mlx-community/GUI-Owl-7B-6bit`](https://huggingface.co/mlx-community/GUI-Owl-7B-6bit)

## Notes and Limitations

- This card reports local MLX conversion and validation results only.
- Upstream benchmark claims belong to the original GUI-Owl family and were not re-run here unless explicitly stated.
- This family is better aligned to the Track A packet than ShowUI, but local validation still showed weak structured-action targeting and grounding normalization issues.
- Streamed response quality can diverge from the non-stream path even when the serve path itself stays healthy.

## Citation

If you use this MLX conversion, please also cite the original GUI-Owl work:

```bibtex
@misc{ye2025mobileagentv3foundamentalagentsgui,
      title={Mobile-Agent-v3: Foundamental Agents for GUI Automation},
      author={Jiabo Ye and Xi Zhang and Haiyang Xu and Haowei Liu and Junyang Wang and Zhaoqing Zhu and Ziwei Zheng and Feiyu Gao and Junjie Cao and Zhengxi Lu and Jitong Liao and Qi Zheng and Fei Huang and Jingren Zhou and Ming Yan},
      year={2025},
      eprint={2508.15144},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.15144},
}
```

## License

This repo follows the upstream model license: MIT.
See the upstream model card for the authoritative license details:
[mPLUG/GUI-Owl-7B](https://huggingface.co/mPLUG/GUI-Owl-7B).