File size: 3,829 Bytes
cff8475
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7d4273
 
cff8475
 
 
 
 
 
 
 
 
 
 
 
 
4db1dce
cff8475
 
 
 
 
 
4db1dce
cff8475
 
 
4db1dce
cff8475
 
 
 
 
 
 
 
4db1dce
cff8475
 
 
 
4db1dce
cff8475
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4db1dce
cff8475
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
language:
- en
license: cc-by-4.0
tags:
- vision
- image-text-to-text
- medical
- dermatology
- multimodal
- clip
- zero-shot-classification
- image-classification
pipeline_tag: zero-shot-image-classification
library_name: transformers
---

# DermLIP: Dermatology Language-Image Pretraining

## Model Description

**DermLIP** is a vision-language model for dermatology, trained on the **Derm1M** dataset—the largest dermatological image-text corpus to date. 

### Model Details

- **Model Type:** Pretrained Vision-Language Model (CLIP-style)

- **Architecture:**

  - **Vision encoder**: ViT-B16
  - **Text encoder**: GPT2

- **Resolution:** 224×224 pixels

- **Paper:** https://arxiv.org/abs/2503.14911

- **Repository:** https://github.com/SiyuanYan1/Derm1M

- **license:** cc-by-nc-nd-4.0


## Training Details

- **Training data:** 403,563 skin image-text pairs from Derm1M datasets. Images include both dermoscopic and clinical images.
- **Training objective:** image-text contrastive loss
- **Hardware:** 1 x Nvidia H200 (~40GB memory usage)
- **Hours used:** ~5 hours
  
## Intended Uses

### Primary Use Cases

- Zero-shot classification
- Few-shot learning
- Cross-modal retrieval
- Concept annotation/explanation


## How to Use


### Installation

First, clone the Derm1M repository:
```bash
git clone git@github.com:SiyuanYan1/Derm1M.git
cd Derm1M
···

Then install the package following the instruction in the repository.


### Quick Start
```python
import open_clip
from PIL import Image
import torch

# Load model with huggingface checkpoint
model, _, preprocess = open_clip.create_model_and_transforms(
    'hf-hub:redlessone/DermLIP_ViT-B-16'
)
model.eval()

# Initialize tokenizer
tokenizer = open_clip.get_tokenizer('hf-hub:redlessone/DermLIP_ViT-B-16')

# Read example image
image = preprocess(Image.open("your_skin_image.png")).unsqueeze(0)

# Define disease labels (example: PAD dataset classes)
PAD_CLASSNAMES = [
    "nevus",
    "basal cell carcinoma",
    "actinic keratosis",
    "seborrheic keratosis",
    "squamous cell carcinoma",
    "melanoma"
]

# Build text prompts
template = lambda c: f'This is a skin image of {c}'
text = tokenizer([template(c) for c in PAD_CLASSNAMES])

# Inference
with torch.no_grad(), torch.autocast("cuda"):
    # Encode image and text
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Compute similarity
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# Get prediction
final_prediction = PAD_CLASSNAMES[torch.argmax(text_probs[0])]
print(f'This image is diagnosed as {final_prediction}.')
print("Label probabilities:", text_probs)
```


## Contact

For any additional questions or comments, contact Siyuan Yan (`siyuan.yan@monash.edu`), 

## Cite our Paper
```bibtex
@misc{yan2025derm1m,
  title        = {Derm1M: A Million‑Scale Vision‑Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology},
  author       = {Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge},
  year         = {2025},
  eprint       = {2503.14911},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2503.14911}
}

@article{yan2025multimodal,
  title={A multimodal vision foundation model for clinical dermatology},
  author={Yan, Siyuan and Yu, Zhen and Primiero, Clare and Vico-Alonso, Cristina and Wang, Zhonghua and Yang, Litao and Tschandl, Philipp and Hu, Ming and Ju, Lie and Tan, Gin and others},
  journal={Nature Medicine},
  pages={1--12},
  year={2025},
  publisher={Nature Publishing Group}
}
```