UGround-V1-2B bf16
This is an MLX conversion of osunlp/UGround-V1-2B, optimized for Apple Silicon.
UGround is a GUI visual grounding model built on Qwen2-VL and framed upstream around point-based grounding for screen elements. This refreshed MLX artifact is intended to replace the older stale mlx-community/UGround-V1-2B conversion as the structurally trustworthy Track E reference row.
This MLX artifact was converted with mlx-vlm, structurally triaged locally, and checked for basic runtime viability with direct mlx_vlm probes.
Conversion Details
| Field | Value |
|---|---|
| Upstream model | osunlp/UGround-V1-2B |
| Artifact type | bf16 MLX conversion |
| Conversion tool | mlx_vlm.convert via mlx-vlm 0.3.12 |
| Python | 3.11.14 |
| MLX | 0.31.0 |
| Validation backend | vllm-mlx (phase/p1 @ 48b51ed) |
| Quantization | bf16 |
| Group size | n/a |
| Quantization mode | n/a |
| Artifact size | 4.5G |
| Template repair | tokenizer_config.json["chat_template"] was re-injected from chat_template.jinja after conversion |
Additional notes:
- Root-level
processor_config.jsonis present. This is the key structural fix relative to the older stalemlx-community/UGround-V1-2Bartifact. chat_template.jinja,chat_template.json["chat_template"], andtokenizer_config.json["chat_template"]were verified to match exactly after repair.- Local multimodal detection now passes on this refreshed artifact.
Validation
This artifact passed local structural triage in this workspace:
- root packaging / multimodal structure:
PASS - tokenizer-visible template repair:
PASS - safetensors and index structure:
PASS - minimum runtime viability:
PASS with caution
Local notes:
- source posture: fresh-source conversion from
osunlp/UGround-V1-2B - root structure: corrected relative to the older stale public MLX artifact
- structural triage verdict:
triage-pass - quantization status: still blocked pending stronger semantic evidence
Important limitation:
- the refreshed artifact is structurally sound, but the first local grounding/classification probes were weak
- this repo should be treated as the authoritative fresh MLX reference artifact for Track E, not yet as a recommended winning row for GUI grounding
Usage
Install
pip install -U mlx-vlm
CLI
python -m mlx_vlm.generate \
--model mlx-community/UGround-V1-2B-bf16 \
--image path/to/image.png \
--prompt "Your task is to help the user identify the precise coordinates (x, y) of a specific area or element on the screen based on a description. Your response should be a single string (x, y) corresponding to the point of interest on a 0-1000 grid. Description: API Host input field. Answer:" \
--max-tokens 64 \
--temperature 0.0
Python
from mlx_vlm import load, generate
model, processor = load("mlx-community/UGround-V1-2B-bf16")
result = generate(
model,
processor,
prompt=(
"Your task is to help the user identify the precise coordinates (x, y) "
"of a specific area or element on the screen based on a description. "
"Your response should be a single string (x, y) corresponding to the "
"point of interest on a 0-1000 grid. Description: API Host input field. "
"Answer:"
),
image="path/to/image.png",
max_tokens=64,
temp=0.0,
)
print(result.text)
Links
- Upstream model: osunlp/UGround-V1-2B
- Homepage: osu-nlp-group.github.io/UGround
- Repository: OSU-NLP-Group/UGround
- Paper: Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
- Original stale MLX artifact: mlx-community/UGround-V1-2B
- MLX framework: ml-explore/mlx
- mlx-vlm: Blaizzy/mlx-vlm
Other Quantizations
Not published from this Track E lane:
6bitnot authorized4bitnot authorized
If a later operator decision reopens quantization, those rows should be derived from this refreshed bf16 artifact rather than from the stale public MLX repo.
Notes and Limitations
- This card reports local MLX conversion and structural-triage results only.
- Upstream benchmark tables belong to the original UGround family and were not re-run here.
- The older
mlx-community/UGround-V1-2Bartifact should be treated as stale comparison context, not as the authoritative root for later quantization work. - First local semantic probes were weak enough that quantization remained blocked after triage.
Citation
If you use this MLX conversion, please also cite the original UGround work:
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
License
This repo follows the upstream model license: Apache 2.0. See the upstream model card for the authoritative license details: osunlp/UGround-V1-2B.
- Downloads last month
- 31
Quantized