usama10
/

paligemma2-3b-pathvqa

@@ -1,60 +1,152 @@
 ---
-library_name: peft
-license: gemma
 base_model: google/paligemma2-3b-pt-224
 tags:
-- base_model:adapter:google/paligemma2-3b-pt-224
-- lora
-- transformers
-pipeline_tag: text-generation
 model-index:
-- name: paligemma2-3b-pathvqa
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# paligemma2-3b-pathvqa
-This model is a fine-tuned version of [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 2e-05
-- train_batch_size: 2
-- eval_batch_size: 8
-- seed: 42
-- gradient_accumulation_steps: 8
-- total_train_batch_size: 16
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 50
-- num_epochs: 1
-### Training results
-### Framework versions
-- PEFT 0.18.1
-- Transformers 5.3.0
-- Pytorch 2.10.0+cu128
-- Datasets 4.8.3
-- Tokenizers 0.22.2

 ---
+license: apache-2.0
 base_model: google/paligemma2-3b-pt-224
 tags:
+  - paligemma
+  - vision-language-model
+  - vlm
+  - medical-imaging
+  - pathology
+  - visual-question-answering
+  - vqa
+  - qlora
+  - lora
+datasets:
+  - flaviagiammarino/path-vqa
+pipeline_tag: image-text-to-text
 model-index:
+  - name: paligemma2-3b-pathvqa
+    results:
+      - task:
+          type: image-text-to-text
+          name: Medical Pathology VQA
+        dataset:
+          name: Path-VQA
+          type: flaviagiammarino/path-vqa
+          split: train
+        metrics:
+          - type: loss
+            value: 1.28
+            name: Final Training Loss
 ---
+# PaliGemma2-3B Path-VQA
+A **PaliGemma2-3B** vision-language model fine-tuned with **QLoRA** on the [Path-VQA](https://huggingface.co/datasets/flaviagiammarino/path-vqa) dataset for **medical pathology visual question answering**.
+Given a pathology slide image and a question, the model generates an answer about the tissue, cells, or pathological findings visible in the image.
+## What is Path-VQA?
+[Path-VQA](https://huggingface.co/datasets/flaviagiammarino/path-vqa) is a medical visual question answering dataset containing 32,632 question-answer pairs derived from 5,004 pathology images. The images include histology slides, hematoxylin and eosin (H&E) stains, immunohistochemistry stains, and other pathological preparations sourced from medical textbooks and the PEIR digital library.
+Questions range from simple identification ("What type of cell is shown?") to complex reasoning about pathological processes ("What do the areas of white chalky deposits represent?").
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| **Base model** | [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) |
+| **Method** | SFT with QLoRA (4-bit NF4, LoRA r=16, alpha=32) |
+| **Dataset** | [flaviagiammarino/path-vqa](https://huggingface.co/datasets/flaviagiammarino/path-vqa) (train split) |
+| **Training examples** | 19,654 image-question-answer triplets |
+| **Trainable parameters** | 23.7M / 3.05B total (0.78%) |
+| **Hardware** | NVIDIA RTX 5090 (32GB VRAM) |
+| **Training time** | ~48 minutes |
+| **Epochs** | 1 |
+| **Effective batch size** | 16 (2 per device x 8 gradient accumulation) |
+| **Learning rate** | 2e-5 (cosine schedule, 50 warmup steps) |
+| **Precision** | bf16 compute, 4-bit NF4 base weights |
+| **Framework** | Transformers 5.3.0 + PEFT 0.18.1 + bitsandbytes |
+## Training Curves
+![Training Metrics](vlm_training_metrics_plots.png)
+- **Training Loss**: Dropped from 3.5 to ~1.3 over 1,228 steps, showing clear learning
+- **Learning Rate**: Cosine decay from 2e-5 to 0 with 50-step warmup
+- **Gradient Norm**: Started around 2.0, decreased to ~1.0 mid-training, then gradually increased late in training (normal for single-epoch runs as the model encounters harder examples)
+## Example Use Cases
+This model can answer questions about pathology images such as:
+- "Where are liver stem cells (oval cells) located?" -> "in the canals of hering"
+- "What are stained here with an immunohistochemical stain for cytokeratin 7?" -> "bile duct cells and canals of hering"
+- "What do the areas of white chalky deposits represent?" -> "foci of fat necrosis"
+## Usage
+```python
+from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
+from peft import PeftModel
+from PIL import Image
+import torch
+# Load base model + adapter
+base_model = PaliGemmaForConditionalGeneration.from_pretrained(
+    "google/paligemma2-3b-pt-224",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
+processor = AutoProcessor.from_pretrained("usama10/paligemma2-3b-pathvqa")
+# Load an image and ask a question
+image = Image.open("pathology_slide.png").convert("RGB")
+prompt = "answer What type of tissue is shown in this image?"
+inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    outputs = model.generate(**inputs, max_new_tokens=64)
+answer = processor.decode(outputs[0], skip_special_tokens=True)
+print(answer)
+```
+### With 4-bit Quantization (lower memory)
+```python
+from transformers import BitsAndBytesConfig
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+)
+base_model = PaliGemmaForConditionalGeneration.from_pretrained(
+    "google/paligemma2-3b-pt-224",
+    quantization_config=bnb_config,
+    device_map="auto",
+)
+model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
+```
+## Prompt Format
+PaliGemma uses a specific prompt format. For VQA tasks, prefix the question with `answer`:
+```
+answer What type of cell is shown in this image?
+```
+The model will generate the answer text directly.
+## Dataset
+The [Path-VQA](https://huggingface.co/datasets/flaviagiammarino/path-vqa) dataset contains:
+- **19,654 training** / **6,259 validation** / **6,719 test** question-answer pairs
+- **5,004 unique pathology images** (some in CMYK format, auto-converted to RGB during training)
+- Mix of open-ended and yes/no questions covering cell identification, tissue classification, stain interpretation, and pathological process recognition
+- Sourced from medical textbooks and the PEIR digital library
+- MIT license
+## Limitations
+- Trained for 1 epoch only; additional epochs would likely improve accuracy
+- The base model (PaliGemma2-3B) uses 224x224 image resolution, which may lose fine-grained detail in high-resolution pathology slides
+- QLoRA training introduces some quantization noise compared to full-precision fine-tuning
+- This model is for research and educational purposes only and should NOT be used for clinical diagnosis
+- Performance on out-of-distribution pathology images (different staining methods, magnifications, or tissue types not in Path-VQA) may be limited
+- LoRA adapter requires the base PaliGemma2-3B model for inference