PaliGemma2-3B Path-VQA
A PaliGemma2-3B vision-language model fine-tuned with QLoRA on the Path-VQA dataset for medical pathology visual question answering.
Given a pathology slide image and a question, the model generates an answer about the tissue, cells, or pathological findings visible in the image.
What is Path-VQA?
Path-VQA is a medical visual question answering dataset containing 32,632 question-answer pairs derived from 5,004 pathology images. The images include histology slides, hematoxylin and eosin (H&E) stains, immunohistochemistry stains, and other pathological preparations sourced from medical textbooks and the PEIR digital library.
Questions range from simple identification ("What type of cell is shown?") to complex reasoning about pathological processes ("What do the areas of white chalky deposits represent?").
Training Details
| Parameter | Value |
|---|---|
| Base model | google/paligemma2-3b-pt-224 |
| Method | SFT with QLoRA (4-bit NF4, LoRA r=16, alpha=32) |
| Dataset | flaviagiammarino/path-vqa (train split) |
| Training examples | 19,654 image-question-answer triplets |
| Trainable parameters | 23.7M / 3.05B total (0.78%) |
| Hardware | NVIDIA RTX 5090 (32GB VRAM) |
| Training time | ~48 minutes |
| Epochs | 1 |
| Effective batch size | 16 (2 per device x 8 gradient accumulation) |
| Learning rate | 2e-5 (cosine schedule, 50 warmup steps) |
| Precision | bf16 compute, 4-bit NF4 base weights |
| Framework | Transformers 5.3.0 + PEFT 0.18.1 + bitsandbytes |
Training Curves
- Training Loss: Dropped from 3.5 to ~1.3 over 1,228 steps, showing clear learning
- Learning Rate: Cosine decay from 2e-5 to 0 with 50-step warmup
- Gradient Norm: Started around 2.0, decreased to ~1.0 mid-training, then gradually increased late in training (normal for single-epoch runs as the model encounters harder examples)
Example Use Cases
This model can answer questions about pathology images such as:
- "Where are liver stem cells (oval cells) located?" -> "in the canals of hering"
- "What are stained here with an immunohistochemical stain for cytokeratin 7?" -> "bile duct cells and canals of hering"
- "What do the areas of white chalky deposits represent?" -> "foci of fat necrosis"
Usage
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel
from PIL import Image
import torch
# Load base model + adapter
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
"google/paligemma2-3b-pt-224",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
processor = AutoProcessor.from_pretrained("usama10/paligemma2-3b-pathvqa")
# Load an image and ask a question
image = Image.open("pathology_slide.png").convert("RGB")
prompt = "answer What type of tissue is shown in this image?"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=64)
answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)
With 4-bit Quantization (lower memory)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = PaliGemmaForConditionalGeneration.from_pretrained(
"google/paligemma2-3b-pt-224",
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
Prompt Format
PaliGemma uses a specific prompt format. For VQA tasks, prefix the question with answer:
answer What type of cell is shown in this image?
The model will generate the answer text directly.
Dataset
The Path-VQA dataset contains:
- 19,654 training / 6,259 validation / 6,719 test question-answer pairs
- 5,004 unique pathology images (some in CMYK format, auto-converted to RGB during training)
- Mix of open-ended and yes/no questions covering cell identification, tissue classification, stain interpretation, and pathological process recognition
- Sourced from medical textbooks and the PEIR digital library
- MIT license
Limitations
- Trained for 1 epoch only; additional epochs would likely improve accuracy
- The base model (PaliGemma2-3B) uses 224x224 image resolution, which may lose fine-grained detail in high-resolution pathology slides
- QLoRA training introduces some quantization noise compared to full-precision fine-tuning
- This model is for research and educational purposes only and should NOT be used for clinical diagnosis
- Performance on out-of-distribution pathology images (different staining methods, magnifications, or tissue types not in Path-VQA) may be limited
- LoRA adapter requires the base PaliGemma2-3B model for inference
Model tree for usama10/paligemma2-3b-pathvqa
Base model
google/paligemma2-3b-pt-224Dataset used to train usama10/paligemma2-3b-pathvqa
Evaluation results
- Final Training Loss on Path-VQAself-reported1.280
