UGround-V1-2B bf16

This is an MLX conversion of osunlp/UGround-V1-2B, optimized for Apple Silicon.

UGround is a GUI visual grounding model built on Qwen2-VL and framed upstream around point-based grounding for screen elements. This refreshed MLX artifact is intended to replace the older stale mlx-community/UGround-V1-2B conversion as the structurally trustworthy Track E reference row.

This MLX artifact was converted with mlx-vlm, structurally triaged locally, and checked for basic runtime viability with direct mlx_vlm probes.

Conversion Details

Field Value
Upstream model osunlp/UGround-V1-2B
Artifact type bf16 MLX conversion
Conversion tool mlx_vlm.convert via mlx-vlm 0.3.12
Python 3.11.14
MLX 0.31.0
Validation backend vllm-mlx (phase/p1 @ 48b51ed)
Quantization bf16
Group size n/a
Quantization mode n/a
Artifact size 4.5G
Template repair tokenizer_config.json["chat_template"] was re-injected from chat_template.jinja after conversion

Additional notes:

  • Root-level processor_config.json is present. This is the key structural fix relative to the older stale mlx-community/UGround-V1-2B artifact.
  • chat_template.jinja, chat_template.json["chat_template"], and tokenizer_config.json["chat_template"] were verified to match exactly after repair.
  • Local multimodal detection now passes on this refreshed artifact.

Validation

This artifact passed local structural triage in this workspace:

  • root packaging / multimodal structure: PASS
  • tokenizer-visible template repair: PASS
  • safetensors and index structure: PASS
  • minimum runtime viability: PASS with caution

Local notes:

  • source posture: fresh-source conversion from osunlp/UGround-V1-2B
  • root structure: corrected relative to the older stale public MLX artifact
  • structural triage verdict: triage-pass
  • quantization status: still blocked pending stronger semantic evidence

Important limitation:

  • the refreshed artifact is structurally sound, but the first local grounding/classification probes were weak
  • this repo should be treated as the authoritative fresh MLX reference artifact for Track E, not yet as a recommended winning row for GUI grounding

Usage

Install

pip install -U mlx-vlm

CLI

python -m mlx_vlm.generate \
  --model mlx-community/UGround-V1-2B-bf16 \
  --image path/to/image.png \
  --prompt "Your task is to help the user identify the precise coordinates (x, y) of a specific area or element on the screen based on a description. Your response should be a single string (x, y) corresponding to the point of interest on a 0-1000 grid. Description: API Host input field. Answer:" \
  --max-tokens 64 \
  --temperature 0.0

Python

from mlx_vlm import load, generate

model, processor = load("mlx-community/UGround-V1-2B-bf16")
result = generate(
    model,
    processor,
    prompt=(
        "Your task is to help the user identify the precise coordinates (x, y) "
        "of a specific area or element on the screen based on a description. "
        "Your response should be a single string (x, y) corresponding to the "
        "point of interest on a 0-1000 grid. Description: API Host input field. "
        "Answer:"
    ),
    image="path/to/image.png",
    max_tokens=64,
    temp=0.0,
)
print(result.text)

Links

Other Quantizations

Not published from this Track E lane:

  • 6bit not authorized
  • 4bit not authorized

If a later operator decision reopens quantization, those rows should be derived from this refreshed bf16 artifact rather than from the stale public MLX repo.

Notes and Limitations

  • This card reports local MLX conversion and structural-triage results only.
  • Upstream benchmark tables belong to the original UGround family and were not re-run here.
  • The older mlx-community/UGround-V1-2B artifact should be treated as stale comparison context, not as the authoritative root for later quantization work.
  • First local semantic probes were weak enough that quantization remained blocked after triage.

Citation

If you use this MLX conversion, please also cite the original UGround work:

@article{gou2024uground,
  title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
  author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
  journal={arXiv preprint arXiv:2410.05243},
  year={2024},
  url={https://arxiv.org/abs/2410.05243},
}

License

This repo follows the upstream model license: Apache 2.0. See the upstream model card for the authoritative license details: osunlp/UGround-V1-2B.

Downloads last month
31
Safetensors
Model size
2B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/UGround-V1-2B-bf16

Base model

Qwen/Qwen2-VL-2B
Finetuned
(1)
this model

Paper for mlx-community/UGround-V1-2B-bf16