Files changed (1) hide show
  1. README.md +193 -3
README.md CHANGED
@@ -1,3 +1,193 @@
1
- ---
2
- license: cc-by-nd-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: cc-by-4.0
5
+ tags:
6
+ - vision
7
+ - image-text-to-text
8
+ - medical
9
+ - dermatology
10
+ - multimodal
11
+ - clip
12
+ - zero-shot-classification
13
+ - image-classification
14
+ pipeline_tag: zero-shot-image-classification
15
+ library_name: transformers
16
+ ---
17
+
18
+ # DermLIP: Dermatology Language-Image Pretraining
19
+
20
+ ## Model Description
21
+
22
+ **DermLIP** is a vision-language model for dermatology, trained on the **Derm1M** dataset—the largest dermatological image-text corpus to date. This model variant (`ViT-B-16`) uses a standard CLIP-Base-16 architecture, providing a strong baseline for dermatological image-text understanding tasks.
23
+
24
+ ### Model Details
25
+
26
+ - **Model Type:** Vision-Language Model (CLIP-style)
27
+
28
+ - **Architecture:**
29
+ - **Pretrain Weight**: We pretrained this model starting from a openai weights
30
+ - **Vision encoder**: ViT-B/16 (12 layers, 768 width, 16×16 patches)
31
+ - **Text encoder**: GPT2 (12 layers, 512 width, 8 heads)
32
+ - **Embedding dimension**: 512
33
+
34
+ - **Resolution:** 224×224 pixels
35
+
36
+ - **Training Data:** Derm1M dataset (1,029,761 image-text pairs)
37
+
38
+ - **Coverage:** 390 skin conditions, 130 clinical concepts
39
+
40
+ - **Language:** English
41
+
42
+ - **License:** cc-by-nc-nd-4.0
43
+
44
+ - **Context Length:** 77 tokens
45
+
46
+ - **Vocabulary Size:** 49,408
47
+
48
+ ### Key Features
49
+
50
+ - **Zero-shot & Few-shot Diagnosis:** Classify skin conditions and grouding visual concepts without fine-tuning
51
+
52
+ - **Cross-modal Retrieval:** Find images from text descriptions and vice versa
53
+
54
+
55
+ ## Training Data
56
+
57
+ ### Derm1M Dataset
58
+
59
+ DermLIP is trained on **Derm1M**, which provides:
60
+
61
+ - **1,029,761** dermatological image-text pairs
62
+
63
+ - **257× larger** than any previous dermatology vision-language corpus
64
+
65
+ - **390** distinct skin conditions organized in a four-level expert ontology
66
+
67
+ - **130** clinical visual concepts
68
+
69
+ - **Rich contextual captions** with clinical metadata (average 41 tokens per caption)
70
+
71
+ The dataset enables realistic clinical scenarios including diagnostic support, patient education, and research applications.
72
+
73
+ ## Intended Uses
74
+
75
+ ### Primary Use Cases
76
+
77
+ 1. **Zero-shot Skin Condition Classification**
78
+
79
+ - Identify skin conditions without task-specific training
80
+
81
+ - Supports rare and emerging conditions
82
+
83
+ 2. **Medical Image Retrieval**
84
+
85
+ - Find similar cases from text descriptions
86
+
87
+ - Retrieve relevant images for clinical reference
88
+
89
+ 3. **Clinical Decision Support**
90
+
91
+ - Assist dermatologists with differential diagnosis
92
+
93
+ - Provide visual examples for patient education
94
+
95
+ 4. **Research Applications**
96
+
97
+ - Multimodal dermatology research
98
+
99
+ - Development of downstream clinical AI tools
100
+
101
+ ### Out-of-Scope Uses
102
+
103
+ - **Not for unsupervised clinical diagnosis**: This model should not be used as the sole basis for medical decisions
104
+
105
+ - **Not validated for all skin types**: Performance may vary across different skin tones and demographics
106
+
107
+ - **Not a replacement for medical professionals**: Always consult qualified healthcare providers
108
+
109
+ ## How to Use
110
+
111
+ ### Installation
112
+
113
+ First, clone the Derm1M repository:
114
+ ```bash
115
+ git clone git@github.com:SiyuanYan1/Derm1M.git
116
+ cd Derm1M
117
+ ···
118
+
119
+ Then install the package following the instruction in the repository.
120
+
121
+
122
+ ### Quick Start
123
+ ```python
124
+ import open_clip
125
+ from PIL import Image
126
+ import torch
127
+
128
+ # Load model with huggingface checkpoint
129
+ model, _, preprocess = open_clip.create_model_and_transforms(
130
+ 'hf-hub:redlessone/DermLIP_ViT-B-16'
131
+ )
132
+ model.eval()
133
+
134
+ # Initialize tokenizer
135
+ tokenizer = open_clip.get_tokenizer('hf-hub:redlessone/DermLIP_ViT-B-16')
136
+
137
+ # Read example image
138
+ image = preprocess(Image.open("your_skin_image.png")).unsqueeze(0)
139
+
140
+ # Define disease labels (example: PAD dataset classes)
141
+ PAD_CLASSNAMES = [
142
+ "nevus",
143
+ "basal cell carcinoma",
144
+ "actinic keratosis",
145
+ "seborrheic keratosis",
146
+ "squamous cell carcinoma",
147
+ "melanoma"
148
+ ]
149
+
150
+ # Build text prompts
151
+ template = lambda c: f'This is a skin image of {c}'
152
+ text = tokenizer([template(c) for c in PAD_CLASSNAMES])
153
+
154
+ # Inference
155
+ with torch.no_grad(), torch.autocast("cuda"):
156
+ # Encode image and text
157
+ image_features = model.encode_image(image)
158
+ text_features = model.encode_text(text)
159
+
160
+ # Normalize features
161
+ image_features /= image_features.norm(dim=-1, keepdim=True)
162
+ text_features /= text_features.norm(dim=-1, keepdim=True)
163
+
164
+ # Compute similarity
165
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
166
+
167
+ # Get prediction
168
+ final_prediction = PAD_CLASSNAMES[torch.argmax(text_probs[0])]
169
+ print(f'This image is diagnosed as {final_prediction}.')
170
+ print("Label probabilities:", text_probs)
171
+ ```
172
+
173
+ ## Cite our Paper
174
+ ```bibtex
175
+ @misc{yan2025derm1m,
176
+ title = {Derm1M: A Million‑Scale Vision‑Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology},
177
+ author = {Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge},
178
+ year = {2025},
179
+ eprint = {2503.14911},
180
+ archivePrefix= {arXiv},
181
+ primaryClass = {cs.CV},
182
+ url = {https://arxiv.org/abs/2503.14911}
183
+ }
184
+
185
+ @article{yan2025multimodal,
186
+ title={A multimodal vision foundation model for clinical dermatology},
187
+ author={Yan, Siyuan and Yu, Zhen and Primiero, Clare and Vico-Alonso, Cristina and Wang, Zhonghua and Yang, Litao and Tschandl, Philipp and Hu, Ming and Ju, Lie and Tan, Gin and others},
188
+ journal={Nature Medicine},
189
+ pages={1--12},
190
+ year={2025},
191
+ publisher={Nature Publishing Group}
192
+ }
193
+ ```