Image-to-Text
Transformers
Safetensors
English
medical
pathology
vision-language
contrastive-learning
fine-grained
multimodal
Instructions to use jshhhh/PathFLIP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jshhhh/PathFLIP with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="jshhhh/PathFLIP")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("jshhhh/PathFLIP", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,32 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: image-to-text
|
| 6 |
+
tags:
|
| 7 |
+
- medical
|
| 8 |
+
- pathology
|
| 9 |
+
- vision-language
|
| 10 |
+
- contrastive-learning
|
| 11 |
+
- fine-grained
|
| 12 |
+
- multimodal
|
| 13 |
+
library_name: transformers
|
| 14 |
---
|
| 15 |
+
# PathFLIP
|
| 16 |
+
|
| 17 |
+
Model weights for the paper *PathFLIP: Fine-Grained Language-Image Pretraining for Versatile Pathology Image Understanding*.
|
| 18 |
+
|
| 19 |
+
## Overview
|
| 20 |
+
|
| 21 |
+
PathFLIP is a pathology vision-language model that aligns fine-grained morphological sub-captions with their corresponding regions in Whole Slide Images. Unlike prior pathology VLMs that pair an entire slide with a single report-level anchor, PathFLIP introduces region-statement correspondence through a region Q-Former and a region-level contrastive objective with caption-swapped negatives, learning region-level alignment without any manual spatial annotation. This fine-grained supervision enables strong slide-level classification and retrieval performance, and gives rise to an emergent visual grounding capability.
|
| 22 |
+
|
| 23 |
+
## Model Details
|
| 24 |
+
|
| 25 |
+
- **Base model**: *Qwen3-0.6B*
|
| 26 |
+
- **Training data**: [FGC-4K Dataset](https://huggingface.co/datasets/jshhhh/PathFLIP/)
|
| 27 |
+
- **Task**: classification, image-text retrieval, visual grounding, vqa
|
| 28 |
+
- **Languages**: English
|
| 29 |
+
|
| 30 |
+
## License
|
| 31 |
+
|
| 32 |
+
This model is released under CC BY-NC 4.0 — free for academic and research use, **not for commercial use or clinical deployment**.
|