PaliGemma2-3B Path-VQA

A PaliGemma2-3B vision-language model fine-tuned with QLoRA on the Path-VQA dataset for medical pathology visual question answering.

Given a pathology slide image and a question, the model generates an answer about the tissue, cells, or pathological findings visible in the image.

What is Path-VQA?

Path-VQA is a medical visual question answering dataset containing 32,632 question-answer pairs derived from 5,004 pathology images. The images include histology slides, hematoxylin and eosin (H&E) stains, immunohistochemistry stains, and other pathological preparations sourced from medical textbooks and the PEIR digital library.

Questions range from simple identification ("What type of cell is shown?") to complex reasoning about pathological processes ("What do the areas of white chalky deposits represent?").

Training Details

Parameter	Value
Base model	google/paligemma2-3b-pt-224
Method	SFT with QLoRA (4-bit NF4, LoRA r=16, alpha=32)
Dataset	flaviagiammarino/path-vqa (train split)
Training examples	19,654 image-question-answer triplets
Trainable parameters	23.7M / 3.05B total (0.78%)
Hardware	NVIDIA RTX 5090 (32GB VRAM)
Training time	~48 minutes
Epochs	1
Effective batch size	16 (2 per device x 8 gradient accumulation)
Learning rate	2e-5 (cosine schedule, 50 warmup steps)
Precision	bf16 compute, 4-bit NF4 base weights
Framework	Transformers 5.3.0 + PEFT 0.18.1 + bitsandbytes

Training Curves

Training Loss: Dropped from 3.5 to ~1.3 over 1,228 steps, showing clear learning
Learning Rate: Cosine decay from 2e-5 to 0 with 50-step warmup
Gradient Norm: Started around 2.0, decreased to ~1.0 mid-training, then gradually increased late in training (normal for single-epoch runs as the model encounters harder examples)

Example Use Cases

This model can answer questions about pathology images such as:

"Where are liver stem cells (oval cells) located?" -> "in the canals of hering"
"What are stained here with an immunohistochemical stain for cytokeratin 7?" -> "bile duct cells and canals of hering"
"What do the areas of white chalky deposits represent?" -> "foci of fat necrosis"

Usage

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch

# Load base model + adapter
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
    "google/paligemma2-3b-pt-224",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
processor = AutoProcessor.from_pretrained("usama10/paligemma2-3b-pathvqa")

# Load an image and ask a question
image = Image.open("pathology_slide.png").convert("RGB")
prompt = "answer What type of tissue is shown in this image?"

inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64)

answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)

With 4-bit Quantization (lower memory)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
    "google/paligemma2-3b-pt-224",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")

Prompt Format

PaliGemma uses a specific prompt format. For VQA tasks, prefix the question with answer:

answer What type of cell is shown in this image?

The model will generate the answer text directly.

Dataset

The Path-VQA dataset contains:

19,654 training / 6,259 validation / 6,719 test question-answer pairs
5,004 unique pathology images (some in CMYK format, auto-converted to RGB during training)
Mix of open-ended and yes/no questions covering cell identification, tissue classification, stain interpretation, and pathological process recognition
Sourced from medical textbooks and the PEIR digital library
MIT license

Limitations

Trained for 1 epoch only; additional epochs would likely improve accuracy
The base model (PaliGemma2-3B) uses 224x224 image resolution, which may lose fine-grained detail in high-resolution pathology slides
QLoRA training introduces some quantization noise compared to full-precision fine-tuning
This model is for research and educational purposes only and should NOT be used for clinical diagnosis
Performance on out-of-distribution pathology images (different staining methods, magnifications, or tissue types not in Path-VQA) may be limited
LoRA adapter requires the base PaliGemma2-3B model for inference

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for usama10/paligemma2-3b-pathvqa

Base model

google/paligemma2-3b-pt-224

Adapter

(116)

this model

Dataset used to train usama10/paligemma2-3b-pathvqa

Evaluation results

Final Training Loss on Path-VQA
self-reported

1.280