Finetuning for pointing, object detection task

#48

by yaneivan - opened Jan 11, 2025

Jan 11, 2025

This is a fantastic model! I really appreciate the pointing feature—it's something I’ve only seen in the Molmo model before. However, unlike Molmo, which outputs point coordinates as HTML tags in its text output, this model appears to have dedicated heads for generating points and bounding boxes, which is very impressive.

I’m particularly curious about how these new features can be fine-tuned. Do you plan to release a notebook demonstrating the process (for pointing and object detection)? A blog post explaining the model's architecture would also be incredibly helpful for understanding its unique capabilities.

Additionally, I noticed that the example code includes special methods like model.caption and model.query. Does this mean the model cannot be used like a traditional vision-language model? Is it possible to input a chat history for conversational use?

Thank you again for this amazing model!

vikhyatk

Owner Jan 12, 2025

I am working on a post detailing how the pointing/bounding box heads work, as well as scripts for finetuning. Will reply here when that's up.

The model currently cannot by used in a multi-turn conversational setting, we're focused on maximizing single-turn visual understanding since that's what is most useful for developers building vision applications.

RonanMcGovern

Jan 13, 2025

very cool, looking forward to that vik and planning to do a vid

OJ-1

Aug 9, 2025

Did this ever happen?

gpu-poor

Sep 27, 2025

Hey @vikhyatk any plans for publishing finetuning code for this model or moondream 3 , especially for point detection ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment