usama10 commited on
Commit
a810f17
·
verified ·
1 Parent(s): 10f4ce3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +130 -38
README.md CHANGED
@@ -1,60 +1,152 @@
1
  ---
2
- library_name: peft
3
- license: gemma
4
  base_model: google/paligemma2-3b-pt-224
5
  tags:
6
- - base_model:adapter:google/paligemma2-3b-pt-224
7
- - lora
8
- - transformers
9
- pipeline_tag: text-generation
 
 
 
 
 
 
 
 
10
  model-index:
11
- - name: paligemma2-3b-pathvqa
12
- results: []
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # paligemma2-3b-pathvqa
19
 
20
- This model is a fine-tuned version of [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) on an unknown dataset.
21
 
22
- ## Model description
23
 
24
- More information needed
25
 
26
- ## Intended uses & limitations
27
 
28
- More information needed
29
 
30
- ## Training and evaluation data
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- More information needed
33
 
34
- ## Training procedure
35
 
36
- ### Training hyperparameters
 
 
37
 
38
- The following hyperparameters were used during training:
39
- - learning_rate: 2e-05
40
- - train_batch_size: 2
41
- - eval_batch_size: 8
42
- - seed: 42
43
- - gradient_accumulation_steps: 8
44
- - total_train_batch_size: 16
45
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
46
- - lr_scheduler_type: cosine
47
- - lr_scheduler_warmup_steps: 50
48
- - num_epochs: 1
49
 
50
- ### Training results
51
 
 
 
 
52
 
 
53
 
54
- ### Framework versions
 
 
 
 
55
 
56
- - PEFT 0.18.1
57
- - Transformers 5.3.0
58
- - Pytorch 2.10.0+cu128
59
- - Datasets 4.8.3
60
- - Tokenizers 0.22.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
 
3
  base_model: google/paligemma2-3b-pt-224
4
  tags:
5
+ - paligemma
6
+ - vision-language-model
7
+ - vlm
8
+ - medical-imaging
9
+ - pathology
10
+ - visual-question-answering
11
+ - vqa
12
+ - qlora
13
+ - lora
14
+ datasets:
15
+ - flaviagiammarino/path-vqa
16
+ pipeline_tag: image-text-to-text
17
  model-index:
18
+ - name: paligemma2-3b-pathvqa
19
+ results:
20
+ - task:
21
+ type: image-text-to-text
22
+ name: Medical Pathology VQA
23
+ dataset:
24
+ name: Path-VQA
25
+ type: flaviagiammarino/path-vqa
26
+ split: train
27
+ metrics:
28
+ - type: loss
29
+ value: 1.28
30
+ name: Final Training Loss
31
  ---
32
 
33
+ # PaliGemma2-3B Path-VQA
 
34
 
35
+ A **PaliGemma2-3B** vision-language model fine-tuned with **QLoRA** on the [Path-VQA](https://huggingface.co/datasets/flaviagiammarino/path-vqa) dataset for **medical pathology visual question answering**.
36
 
37
+ Given a pathology slide image and a question, the model generates an answer about the tissue, cells, or pathological findings visible in the image.
38
 
39
+ ## What is Path-VQA?
40
 
41
+ [Path-VQA](https://huggingface.co/datasets/flaviagiammarino/path-vqa) is a medical visual question answering dataset containing 32,632 question-answer pairs derived from 5,004 pathology images. The images include histology slides, hematoxylin and eosin (H&E) stains, immunohistochemistry stains, and other pathological preparations sourced from medical textbooks and the PEIR digital library.
42
 
43
+ Questions range from simple identification ("What type of cell is shown?") to complex reasoning about pathological processes ("What do the areas of white chalky deposits represent?").
44
 
45
+ ## Training Details
46
 
47
+ | Parameter | Value |
48
+ |-----------|-------|
49
+ | **Base model** | [google/paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224) |
50
+ | **Method** | SFT with QLoRA (4-bit NF4, LoRA r=16, alpha=32) |
51
+ | **Dataset** | [flaviagiammarino/path-vqa](https://huggingface.co/datasets/flaviagiammarino/path-vqa) (train split) |
52
+ | **Training examples** | 19,654 image-question-answer triplets |
53
+ | **Trainable parameters** | 23.7M / 3.05B total (0.78%) |
54
+ | **Hardware** | NVIDIA RTX 5090 (32GB VRAM) |
55
+ | **Training time** | ~48 minutes |
56
+ | **Epochs** | 1 |
57
+ | **Effective batch size** | 16 (2 per device x 8 gradient accumulation) |
58
+ | **Learning rate** | 2e-5 (cosine schedule, 50 warmup steps) |
59
+ | **Precision** | bf16 compute, 4-bit NF4 base weights |
60
+ | **Framework** | Transformers 5.3.0 + PEFT 0.18.1 + bitsandbytes |
61
 
62
+ ## Training Curves
63
 
64
+ ![Training Metrics](vlm_training_metrics_plots.png)
65
 
66
+ - **Training Loss**: Dropped from 3.5 to ~1.3 over 1,228 steps, showing clear learning
67
+ - **Learning Rate**: Cosine decay from 2e-5 to 0 with 50-step warmup
68
+ - **Gradient Norm**: Started around 2.0, decreased to ~1.0 mid-training, then gradually increased late in training (normal for single-epoch runs as the model encounters harder examples)
69
 
70
+ ## Example Use Cases
 
 
 
 
 
 
 
 
 
 
71
 
72
+ This model can answer questions about pathology images such as:
73
 
74
+ - "Where are liver stem cells (oval cells) located?" -> "in the canals of hering"
75
+ - "What are stained here with an immunohistochemical stain for cytokeratin 7?" -> "bile duct cells and canals of hering"
76
+ - "What do the areas of white chalky deposits represent?" -> "foci of fat necrosis"
77
 
78
+ ## Usage
79
 
80
+ ```python
81
+ from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
82
+ from peft import PeftModel
83
+ from PIL import Image
84
+ import torch
85
 
86
+ # Load base model + adapter
87
+ base_model = PaliGemmaForConditionalGeneration.from_pretrained(
88
+ "google/paligemma2-3b-pt-224",
89
+ torch_dtype=torch.bfloat16,
90
+ device_map="auto",
91
+ )
92
+ model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
93
+ processor = AutoProcessor.from_pretrained("usama10/paligemma2-3b-pathvqa")
94
+
95
+ # Load an image and ask a question
96
+ image = Image.open("pathology_slide.png").convert("RGB")
97
+ prompt = "answer What type of tissue is shown in this image?"
98
+
99
+ inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
100
+ with torch.no_grad():
101
+ outputs = model.generate(**inputs, max_new_tokens=64)
102
+
103
+ answer = processor.decode(outputs[0], skip_special_tokens=True)
104
+ print(answer)
105
+ ```
106
+
107
+ ### With 4-bit Quantization (lower memory)
108
+
109
+ ```python
110
+ from transformers import BitsAndBytesConfig
111
+
112
+ bnb_config = BitsAndBytesConfig(
113
+ load_in_4bit=True,
114
+ bnb_4bit_quant_type="nf4",
115
+ bnb_4bit_compute_dtype=torch.bfloat16,
116
+ )
117
+ base_model = PaliGemmaForConditionalGeneration.from_pretrained(
118
+ "google/paligemma2-3b-pt-224",
119
+ quantization_config=bnb_config,
120
+ device_map="auto",
121
+ )
122
+ model = PeftModel.from_pretrained(base_model, "usama10/paligemma2-3b-pathvqa")
123
+ ```
124
+
125
+ ## Prompt Format
126
+
127
+ PaliGemma uses a specific prompt format. For VQA tasks, prefix the question with `answer`:
128
+
129
+ ```
130
+ answer What type of cell is shown in this image?
131
+ ```
132
+
133
+ The model will generate the answer text directly.
134
+
135
+ ## Dataset
136
+
137
+ The [Path-VQA](https://huggingface.co/datasets/flaviagiammarino/path-vqa) dataset contains:
138
+
139
+ - **19,654 training** / **6,259 validation** / **6,719 test** question-answer pairs
140
+ - **5,004 unique pathology images** (some in CMYK format, auto-converted to RGB during training)
141
+ - Mix of open-ended and yes/no questions covering cell identification, tissue classification, stain interpretation, and pathological process recognition
142
+ - Sourced from medical textbooks and the PEIR digital library
143
+ - MIT license
144
+
145
+ ## Limitations
146
+
147
+ - Trained for 1 epoch only; additional epochs would likely improve accuracy
148
+ - The base model (PaliGemma2-3B) uses 224x224 image resolution, which may lose fine-grained detail in high-resolution pathology slides
149
+ - QLoRA training introduces some quantization noise compared to full-precision fine-tuning
150
+ - This model is for research and educational purposes only and should NOT be used for clinical diagnosis
151
+ - Performance on out-of-distribution pathology images (different staining methods, magnifications, or tissue types not in Path-VQA) may be limited
152
+ - LoRA adapter requires the base PaliGemma2-3B model for inference