PitVQA Spatial Model

A spatial localization vision-language model for pituitary surgery, specialized in point and bounding box detection of surgical instruments and anatomical structures.

Model Description

This model fine-tunes Qwen2-VL-2B-Instruct using LoRA for spatial localization tasks in surgical images. It outputs structured coordinate formats for precise localization.

Capabilities

Task	Output Format	Example
Point	`<point x='X' y='Y'>target</point>`	`<point x='75.8' y='75.1'>suction device</point>`
BBox	`<box x1='X1' y1='Y1' x2='X2' y2='Y2'>target</box>`	`<box x1='20' y1='30' x2='60' y2='70'>tumor</box>`

Coordinates are normalized to [0, 100] range.

Usage

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load with 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Load spatial adapter
model = PeftModel.from_pretrained(base, "mmrech/pitvqa-qwen2vl-spatial")

Point Localization

from PIL import Image

image = Image.open("surgical_frame.jpg")

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": "Point to the suction device in this surgical image."}
]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output[0], skip_special_tokens=True)
# Output: <point x='75.8' y='75.1'>suction device</point>

Bounding Box Detection

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": "Draw a bounding box around the tumor region."}
]}]

# Same inference pattern...
# Output: <box x1='30' y1='30' x2='70' y2='70'>tumor region</box>

Coordinate Extraction

import re

def extract_point(text):
    match = re.search(r"<point x='([\d.]+)' y='([\d.]+)'>", text)
    if match:
        return float(match.group(1)), float(match.group(2))
    return None, None

def extract_bbox(text):
    match = re.search(r"<box x1='([\d.]+)' y1='([\d.]+)' x2='([\d.]+)' y2='([\d.]+)'>", text)
    if match:
        return [float(match.group(i)) for i in range(1, 5)]
    return None

x, y = extract_point(response)  # Normalized coordinates [0-100]

Supported Targets

Surgical Instruments

Suction device
Curette
Drill
Forceps
Scissors

Anatomical Structures

Tumor
Pituitary gland
Sellar floor
Sphenoid sinus
Carotid artery

Training Details

Base Model: Qwen/Qwen2-VL-2B-Instruct
Method: LoRA (r=16, alpha=32)
Trainable Parameters: ~18M (0.9% of base)
Dataset: mmrech/pitvqa-comprehensive-spatial
Training: SFT with TRL

Related Models

Unified Model: pitvqa-qwen2vl-unified-v2 - Multi-task (spatial + classification)
Merged Model: pitvqa-qwen2vl-merged - Ready-to-use deployment

Citation

@misc{pitvqa2026,
  title={PitVQA: Multi-Task Vision-Language Model for Pituitary Surgery},
  author={Matheus Rech},
  year={2026},
  url={https://huggingface.co/mmrech/pitvqa-qwen2vl-spatial}
}

License

Apache 2.0

Downloads last month: 2

Model tree for mmrech/pitvqa-qwen2vl-spatial

Base model

Qwen/Qwen2-VL-2B

Finetuned

Qwen/Qwen2-VL-2B-Instruct

Adapter

(156)

this model

mmrech
/

pitvqa-qwen2vl-spatial