Testing QLoRA adaptor for allenai/Molmo2-4B, utilizing numerical grounding of Molmo2's pointing functionality for VLM to output action vectors using a custom dataset with 1K+ images reubk/Molmo2toVLA-Mouse
Currently, the VLA is trained on single image inputs with the following prompt:
Point to the {target} and determine the action to be taken by the camera to align the centre of the image with it.
To output a simple format with the action vector (dx,dy), meant to rotate camera perspective:
The {target} in the image is at \<points coords="1 1 533 523"\>{target}\</points\> while the centre of the image is at \<points coords="1 1 500 500"\>centre of image\</points\>. The action to be taken is therefore (-33, -23)
Action vectors can be parsed onto code for actuation.
Custom NF4 Quantisation of the LLM backbone is applied on the base model before loading the LoRA adaptor
from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
import torch
import re
from PIL import Image
import requests
from peft import PeftModel
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
llm_int8_skip_modules=[
# Module names can also be relative like "ff_norm" which would apply to all such layers
"model.vision_backbone", "model.transformer.ff_out", "model.transformer.ln_f"
]
)
model_id="allenai/Molmo2-4B"
# load the processor
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
dtype=torch.float16,
device_map="auto",
token=True
)
# load the model
model = AutoModelForImageTextToText.from_pretrained(
model_id,
trust_remote_code=True,
dtype=torch.float16,
device_map="auto",
quantization_config=nf4_config,
token=True
)
model = PeftModel.from_pretrained(model,"path_to_lora_adaptor")
My next steps are to evaluate its performance on some runs, collate actuation data before and after the VLA implements and action, and run RL using that data.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support