请问模型MobileVLM_stage2_Grounding的输入应该是什么样式？

by hengyuanheng - opened Apr 23, 2025

Apr 23, 2025

•

edited Apr 23, 2025

作者你好，我下载了xwk123/mobilevlm_test数据集，想测试模型MobileVLM_stage2_Grounding的能力。具体来说，我使用这个数据集的ref_test子数据集（700条seen数据和700条unseen数据）来测试模型。
但是我发现没有办法得到正确的输出。可以帮我看看问题吗？

case1：
query='Question: In the image < img>/ui_data/mobilevlm_test/test_data/unseen_data/ref_test/QQmail0_3_47_383/QQmail0_3_47_383-screen.png</ img>, 如何确定“删除”在页面上的具体位置？. \nAnswer: '
response：
click(< ref>详情< /ref>< box>[634,232][692,268]</ box>)
这个response给出的控件是“详情”，而非“删除”。我试了问其他控件的位置在哪里，但是得到的结果都是同一个response。
（huggingface的讨论区会自动解析markdown格式，因此我在< img>中添加了空格符，但这不是原本的输入）

我怀疑是不是没有添加few-shot导致模型没有输出预期的结果，于是按照论文附录给的few-show的prompt，试了一版：
case2：
image1 = '/ui_data/mobilevlm_test/test_data/seen_data/ref_test/didi0_17_36_262/didi0_17_36_262-screen.png'
image2 = '/ui_data/mobilevlm_test/test_data/seen_data/ref_test/kugou0_14_674_35_1440/kugou0_14_674_35_1440-screen.png'
image3 = '/ui_data/mobilevlm_test/test_data/seen_data/ref_test/youdao0_8_125_618_888_5477_5601/youdao0_8_125_618_888_5477_5601-screen.png'

query = fewshot + query
但是得到的结果仍然是：click(< ref>详情</ ref>< box>[634,232][692,268]</ box>)

我怀疑是不是我的问法不太对，我用同样的prompt去尝试xwk123/MobileVLM_stage3_AUTO-UI模型，也没有得到理想的效果。可以帮忙看看问题出现在哪里吗？

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment