PitVQA Spatial Model
A spatial localization vision-language model for pituitary surgery, specialized in point and bounding box detection of surgical instruments and anatomical structures.
Model Description
This model fine-tunes Qwen2-VL-2B-Instruct using LoRA for spatial localization tasks in surgical images. It outputs structured coordinate formats for precise localization.
Capabilities
| Task | Output Format | Example |
|---|---|---|
| Point | <point x='X' y='Y'>target</point> |
<point x='75.8' y='75.1'>suction device</point> |
| BBox | <box x1='X1' y1='Y1' x2='X2' y2='Y2'>target</box> |
<box x1='20' y1='30' x2='60' y2='70'>tumor</box> |
Coordinates are normalized to [0, 100] range.
Usage
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load with 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
# Load spatial adapter
model = PeftModel.from_pretrained(base, "mmrech/pitvqa-qwen2vl-spatial")
Point Localization
from PIL import Image
image = Image.open("surgical_frame.jpg")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Point to the suction device in this surgical image."}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output[0], skip_special_tokens=True)
# Output: <point x='75.8' y='75.1'>suction device</point>
Bounding Box Detection
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Draw a bounding box around the tumor region."}
]}]
# Same inference pattern...
# Output: <box x1='30' y1='30' x2='70' y2='70'>tumor region</box>
Coordinate Extraction
import re
def extract_point(text):
match = re.search(r"<point x='([\d.]+)' y='([\d.]+)'>", text)
if match:
return float(match.group(1)), float(match.group(2))
return None, None
def extract_bbox(text):
match = re.search(r"<box x1='([\d.]+)' y1='([\d.]+)' x2='([\d.]+)' y2='([\d.]+)'>", text)
if match:
return [float(match.group(i)) for i in range(1, 5)]
return None
x, y = extract_point(response) # Normalized coordinates [0-100]
Supported Targets
Surgical Instruments
- Suction device
- Curette
- Drill
- Forceps
- Scissors
Anatomical Structures
- Tumor
- Pituitary gland
- Sellar floor
- Sphenoid sinus
- Carotid artery
Training Details
- Base Model: Qwen/Qwen2-VL-2B-Instruct
- Method: LoRA (r=16, alpha=32)
- Trainable Parameters: ~18M (0.9% of base)
- Dataset: mmrech/pitvqa-comprehensive-spatial
- Training: SFT with TRL
Related Models
- Unified Model: pitvqa-qwen2vl-unified-v2 - Multi-task (spatial + classification)
- Merged Model: pitvqa-qwen2vl-merged - Ready-to-use deployment
Citation
@misc{pitvqa2026,
title={PitVQA: Multi-Task Vision-Language Model for Pituitary Surgery},
author={Matheus Rech},
year={2026},
url={https://huggingface.co/mmrech/pitvqa-qwen2vl-spatial}
}
License
Apache 2.0
- Downloads last month
- 69