PaliGemma2-3B Path-VQA

A PaliGemma2-3B vision-language model fine-tuned with QLoRA on the Path-VQA dataset for medical pathology visual question answering.

Given a pathology slide image and a question, the model generates an answer about the tissue, cells, or pathological findings visible in the image.

What is Path-VQA?

Path-VQA is a medical visual question answering dataset containing 32,632 question-answer pairs derived from 5,004 pathology images. The images include histology slides, hematoxylin and eosin (H&E) stains, immunohistochemistry stains, and other pathological preparations sourced from medical textbooks and the PEIR digital library.

Questions range from simple identification ("What type of cell is shown?") to complex reasoning about pathological processes ("What do the areas of white chalky deposits represent?").

Training Details

Parameter Value
Base model google/paligemma2-3b-pt-224
Method SFT with QLoRA (4-bit NF4, LoRA r=16, alpha=32)
Dataset flaviagiammarino/path-vqa (train split)
Training examples 19,654 image-question-answer triplets
Trainable parameters 23.7M / 3.05B total (0.78%)
Hardware NVIDIA RTX 5090 (32GB VRAM)
Training time ~48 minutes
Epochs 1
Effective batch size 16 (2 per device x 8 gradient accumulation)
Learning rate 2e-5 (cosine schedule, 50 warmup steps)
Precision bf16 compute, 4-bit NF4 base weights
Framework Transformers 5.3.0 + PEFT 0.18.1 + bitsandbytes

Training Curves

Training Metrics

  • Training Loss: Dropped from 3.5 to ~1.3 over 1,228 steps, showing clear learning
  • Learning Rate: Cosine decay from 2e-5 to 0 with 50-step warmup
  • Gradient Norm: Started around 2.0, decreased to ~1.0 mid-training, then gradually increased late in training (normal for single-epoch runs as the model encounters harder examples)

Example Use Cases

This model can answer questions about pathology images such as:

  • "Where are liver stem cells (oval cells) located?" -> "in the canals of hering"
  • "What are stained here with an immunohistochemical stain for cytokeratin 7?" -> "bile duct cells and canals of hering"
  • "What do the areas of white chalky deposits represent?" -> "foci of fat necrosis"

Usage

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch

# Load base model + adapter
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
    "google/paligemma2-3b-pt-224",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
processor = AutoProcessor.from_pretrained("usama10/paligemma2-3b-pathvqa")

# Load an image and ask a question
image = Image.open("pathology_slide.png").convert("RGB")
prompt = "answer What type of tissue is shown in this image?"

inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64)

answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)

With 4-bit Quantization (lower memory)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
    "google/paligemma2-3b-pt-224",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")

Prompt Format

PaliGemma uses a specific prompt format. For VQA tasks, prefix the question with answer:

answer What type of cell is shown in this image?

The model will generate the answer text directly.

Dataset

The Path-VQA dataset contains:

  • 19,654 training / 6,259 validation / 6,719 test question-answer pairs
  • 5,004 unique pathology images (some in CMYK format, auto-converted to RGB during training)
  • Mix of open-ended and yes/no questions covering cell identification, tissue classification, stain interpretation, and pathological process recognition
  • Sourced from medical textbooks and the PEIR digital library
  • MIT license

Limitations

  • Trained for 1 epoch only; additional epochs would likely improve accuracy
  • The base model (PaliGemma2-3B) uses 224x224 image resolution, which may lose fine-grained detail in high-resolution pathology slides
  • QLoRA training introduces some quantization noise compared to full-precision fine-tuning
  • This model is for research and educational purposes only and should NOT be used for clinical diagnosis
  • Performance on out-of-distribution pathology images (different staining methods, magnifications, or tissue types not in Path-VQA) may be limited
  • LoRA adapter requires the base PaliGemma2-3B model for inference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for usama10/paligemma2-3b-pathvqa

Adapter
(105)
this model

Dataset used to train usama10/paligemma2-3b-pathvqa

Evaluation results