redlessone
/

DermLIP_ViT-B-16

+---
+language:
+- en
+license: cc-by-4.0
+tags:
+- vision
+- image-text-to-text
+- medical
+- dermatology
+- multimodal
+- clip
+- zero-shot-classification
+- image-classification
+pipeline_tag: zero-shot-image-classification
+library_name: transformers
+---
+# DermLIP: Dermatology Language-Image Pretraining
+## Model Description
+**DermLIP** is a vision-language model for dermatology, trained on the **Derm1M** dataset—the largest dermatological image-text corpus to date. This model variant (`ViT-B-16`) uses a standard CLIP-Base-16 architecture, providing a strong baseline for dermatological image-text understanding tasks.
+### Model Details
+- **Model Type:** Vision-Language Model (CLIP-style)
+- **Architecture:**
+  - **Pretrain Weight**: We pretrained this model starting from a openai weights
+  - **Vision encoder**: ViT-B/16 (12 layers, 768 width, 16×16 patches)
+  - **Text encoder**: GPT2 (12 layers, 512 width, 8 heads)
+  - **Embedding dimension**: 512
+- **Resolution:** 224×224 pixels
+- **Training Data:** Derm1M dataset (1,029,761 image-text pairs)
+- **Coverage:** 390 skin conditions, 130 clinical concepts
+- **Language:** English
+- **License:** cc-by-nc-nd-4.0
+- **Context Length:** 77 tokens
+- **Vocabulary Size:** 49,408
+### Key Features
+- **Zero-shot & Few-shot Diagnosis:** Classify skin conditions and grouding visual concepts without fine-tuning
+- **Cross-modal Retrieval:** Find images from text descriptions and vice versa
+## Training Data
+### Derm1M Dataset
+DermLIP is trained on **Derm1M**, which provides:
+- **1,029,761** dermatological image-text pairs
+- **257× larger** than any previous dermatology vision-language corpus
+- **390** distinct skin conditions organized in a four-level expert ontology
+- **130** clinical visual concepts
+- **Rich contextual captions** with clinical metadata (average 41 tokens per caption)
+The dataset enables realistic clinical scenarios including diagnostic support, patient education, and research applications.
+## Intended Uses
+### Primary Use Cases
+1. **Zero-shot Skin Condition Classification**
+   - Identify skin conditions without task-specific training
+   - Supports rare and emerging conditions
+2. **Medical Image Retrieval**
+   - Find similar cases from text descriptions
+   - Retrieve relevant images for clinical reference
+3. **Clinical Decision Support**
+   - Assist dermatologists with differential diagnosis
+   - Provide visual examples for patient education
+4. **Research Applications**
+   - Multimodal dermatology research
+   - Development of downstream clinical AI tools
+### Out-of-Scope Uses
+- **Not for unsupervised clinical diagnosis**: This model should not be used as the sole basis for medical decisions
+- **Not validated for all skin types**: Performance may vary across different skin tones and demographics
+- **Not a replacement for medical professionals**: Always consult qualified healthcare providers
+## How to Use
+### Installation
+First, clone the Derm1M repository:
+```bash
+git clone git@github.com:SiyuanYan1/Derm1M.git
+cd Derm1M
+···
+Then install the package following the instruction in the repository.
+### Quick Start
+```python
+import open_clip
+from PIL import Image
+import torch
+# Load model with huggingface checkpoint
+model, _, preprocess = open_clip.create_model_and_transforms(
+    'hf-hub:redlessone/DermLIP_ViT-B-16'
+)
+model.eval()
+# Initialize tokenizer
+tokenizer = open_clip.get_tokenizer('hf-hub:redlessone/DermLIP_ViT-B-16')
+# Read example image
+image = preprocess(Image.open("your_skin_image.png")).unsqueeze(0)
+# Define disease labels (example: PAD dataset classes)
+PAD_CLASSNAMES = [
+    "nevus",
+    "basal cell carcinoma",
+    "actinic keratosis",
+    "seborrheic keratosis",
+    "squamous cell carcinoma",
+    "melanoma"
+]
+# Build text prompts
+template = lambda c: f'This is a skin image of {c}'
+text = tokenizer([template(c) for c in PAD_CLASSNAMES])
+# Inference
+with torch.no_grad(), torch.autocast("cuda"):
+    # Encode image and text
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    # Normalize features
+    image_features /= image_features.norm(dim=-1, keepdim=True)
+    text_features /= text_features.norm(dim=-1, keepdim=True)
+    # Compute similarity
+    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+# Get prediction
+final_prediction = PAD_CLASSNAMES[torch.argmax(text_probs[0])]
+print(f'This image is diagnosed as {final_prediction}.')
+print("Label probabilities:", text_probs)
+```
+## Cite our Paper
+```bibtex
+@misc{yan2025derm1m,
+  title        = {Derm1M: A Million‑Scale Vision‑Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology},
+  author       = {Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge},
+  year         = {2025},
+  eprint       = {2503.14911},
+  archivePrefix= {arXiv},
+  primaryClass = {cs.CV},
+  url          = {https://arxiv.org/abs/2503.14911}
+}
+@article{yan2025multimodal,
+  title={A multimodal vision foundation model for clinical dermatology},
+  author={Yan, Siyuan and Yu, Zhen and Primiero, Clare and Vico-Alonso, Cristina and Wang, Zhonghua and Yang, Litao and Tschandl, Philipp and Hu, Ming and Ju, Lie and Tan, Gin and others},
+  journal={Nature Medicine},
+  pages={1--12},
+  year={2025},
+  publisher={Nature Publishing Group}
+}
+```