File size: 3,829 Bytes

---
language:
- en
license: cc-by-4.0
tags:
- vision
- image-text-to-text
- medical
- dermatology
- multimodal
- clip
- zero-shot-classification
- image-classification
pipeline_tag: zero-shot-image-classification
library_name: transformers
---

# DermLIP: Dermatology Language-Image Pretraining

## Model Description

**DermLIP** is a vision-language model for dermatology, trained on the **Derm1M** dataset—the largest dermatological image-text corpus to date. 

### Model Details

- **Model Type:** Pretrained Vision-Language Model (CLIP-style)

- **Architecture:**

  - **Vision encoder**: ViT-B16
  - **Text encoder**: GPT2

- **Resolution:** 224×224 pixels

- **Paper:** https://arxiv.org/abs/2503.14911

- **Repository:** https://github.com/SiyuanYan1/Derm1M

- **license:** cc-by-nc-nd-4.0


## Training Details

- **Training data:** 403,563 skin image-text pairs from Derm1M datasets. Images include both dermoscopic and clinical images.
- **Training objective:** image-text contrastive loss
- **Hardware:** 1 x Nvidia H200 （～40GB memory usage）
- **Hours used:** ~5 hours
  
## Intended Uses

### Primary Use Cases

- Zero-shot classification
- Few-shot learning
- Cross-modal retrieval
- Concept annotation/explanation


## How to Use


### Installation

First, clone the Derm1M repository:
```bash
git clone git@github.com:SiyuanYan1/Derm1M.git
cd Derm1M
···

Then install the package following the instruction in the repository.


### Quick Start
```python
import open_clip
from PIL import Image
import torch

# Load model with huggingface checkpoint
model, _, preprocess = open_clip.create_model_and_transforms(
    'hf-hub:redlessone/DermLIP_ViT-B-16'
)
model.eval()

# Initialize tokenizer
tokenizer = open_clip.get_tokenizer('hf-hub:redlessone/DermLIP_ViT-B-16')

# Read example image
image = preprocess(Image.open("your_skin_image.png")).unsqueeze(0)

# Define disease labels (example: PAD dataset classes)
PAD_CLASSNAMES = [
    "nevus",
    "basal cell carcinoma",
    "actinic keratosis",
    "seborrheic keratosis",
    "squamous cell carcinoma",
    "melanoma"
]

# Build text prompts
template = lambda c: f'This is a skin image of {c}'
text = tokenizer([template(c) for c in PAD_CLASSNAMES])

# Inference
with torch.no_grad(), torch.autocast("cuda"):
    # Encode image and text
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Compute similarity
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# Get prediction
final_prediction = PAD_CLASSNAMES[torch.argmax(text_probs[0])]
print(f'This image is diagnosed as {final_prediction}.')
print("Label probabilities:", text_probs)
```


## Contact

For any additional questions or comments, contact Siyuan Yan (`siyuan.yan@monash.edu`), 

## Cite our Paper
```bibtex
@misc{yan2025derm1m,
  title        = {Derm1M: A Million‑Scale Vision‑Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology},
  author       = {Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge},
  year         = {2025},
  eprint       = {2503.14911},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2503.14911}
}

@article{yan2025multimodal,
  title={A multimodal vision foundation model for clinical dermatology},
  author={Yan, Siyuan and Yu, Zhen and Primiero, Clare and Vico-Alonso, Cristina and Wang, Zhonghua and Yang, Litao and Tschandl, Philipp and Hu, Ming and Ju, Lie and Tan, Gin and others},
  journal={Nature Medicine},
  pages={1--12},
  year={2025},
  publisher={Nature Publishing Group}
}
```