Image-to-Text
Transformers
Safetensors
English
medical
pathology
vision-language
contrastive-learning
fine-grained
multimodal
Instructions to use jshhhh/PathFLIP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jshhhh/PathFLIP with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="jshhhh/PathFLIP")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("jshhhh/PathFLIP", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-4.0 | |
| language: | |
| - en | |
| pipeline_tag: image-to-text | |
| tags: | |
| - medical | |
| - pathology | |
| - vision-language | |
| - contrastive-learning | |
| - fine-grained | |
| - multimodal | |
| library_name: transformers | |
| # PathFLIP | |
| Model weights for the paper *PathFLIP: Fine-Grained Language-Image Pretraining for Versatile Pathology Image Understanding*. | |
| ## Overview | |
| PathFLIP is a pathology vision-language model that aligns fine-grained morphological sub-captions with their corresponding regions in Whole Slide Images. Unlike prior pathology VLMs that pair an entire slide with a single report-level anchor, PathFLIP introduces region-statement correspondence through a region Q-Former and a region-level contrastive objective with caption-swapped negatives, learning region-level alignment without any manual spatial annotation. This fine-grained supervision enables strong slide-level classification and retrieval performance, and gives rise to an emergent visual grounding capability. | |
| ## Model Details | |
| - **Base model**: *Qwen3-0.6B* | |
| - **Training data**: [FGC-4K Dataset](https://huggingface.co/datasets/jshhhh/PathFLIP/) | |
| - **Task**: classification, image-text retrieval, visual grounding, vqa | |
| - **Languages**: English | |
| ## License | |
| This model is released under CC BY-NC 4.0 — free for academic and research use, **not for commercial use or clinical deployment**. | |