--- license: cc-by-nc-4.0 language: - en pipeline_tag: image-to-text tags: - medical - pathology - vision-language - contrastive-learning - fine-grained - multimodal library_name: transformers --- # PathFLIP Model weights for the paper *PathFLIP: Fine-Grained Language-Image Pretraining for Versatile Pathology Image Understanding*. ## Overview PathFLIP is a pathology vision-language model that aligns fine-grained morphological sub-captions with their corresponding regions in Whole Slide Images. Unlike prior pathology VLMs that pair an entire slide with a single report-level anchor, PathFLIP introduces region-statement correspondence through a region Q-Former and a region-level contrastive objective with caption-swapped negatives, learning region-level alignment without any manual spatial annotation. This fine-grained supervision enables strong slide-level classification and retrieval performance, and gives rise to an emergent visual grounding capability. ## Model Details - **Base model**: *Qwen3-0.6B* - **Training data**: [FGC-4K Dataset](https://huggingface.co/datasets/jshhhh/PathFLIP/) - **Task**: classification, image-text retrieval, visual grounding, vqa - **Languages**: English ## License This model is released under CC BY-NC 4.0 — free for academic and research use, **not for commercial use or clinical deployment**.