Testing QLoRA adaptor for allenai/Molmo2-4B, utilizing numerical grounding of Molmo2's pointing functionality for VLM to output action vectors using a custom dataset with 1K+ images reubk/Molmo2toVLA-Mouse

Currently, the VLA is trained on single image inputs with the following prompt:

Point to the {target} and determine the action to be taken by the camera to align the centre of the image with it.

To output a simple format with the action vector (dx,dy), meant to rotate camera perspective:

The {target} in the image is at \<points coords="1 1 533 523"\>{target}\</points\> while the centre of the image is at \<points coords="1 1 500 500"\>centre of image\</points\>. The action to be taken is therefore (-33, -23)

Action vectors can be parsed onto code for actuation.

Custom NF4 Quantisation of the LLM backbone is applied on the base model before loading the LoRA adaptor

from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
import torch
import re
from PIL import Image
import requests
from peft import PeftModel

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=[
        # Module names can also be relative like "ff_norm" which would apply to all such layers
        "model.vision_backbone", "model.transformer.ff_out", "model.transformer.ln_f"
    ]
)

model_id="allenai/Molmo2-4B"

# load the processor
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.float16,
    device_map="auto",
    token=True
)

# load the model
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype=torch.float16,
    device_map="auto",
    quantization_config=nf4_config,
    token=True
)


model = PeftModel.from_pretrained(model,"path_to_lora_adaptor")

My next steps are to evaluate its performance on some runs, collate actuation data before and after the VLA implements and action, and run RL using that data.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reubk/Molmo2toVLA-4B

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

allenai/Molmo2-4B

Finetuned

(8)

this model

reubk
/

Molmo2toVLA-4B

Model tree for reubk/Molmo2toVLA-4B

Dataset used to train reubk/Molmo2toVLA-4B