ConvNeXt Dual-Modal Skin Lesion Classifier (ISIC 2025 / MILK10k)
Research prototype โ not validated for clinical use. This model is released for reproducibility and research purposes only. It must not be used to guide clinical decisions, patient triage, or any diagnostic process. See Limitations and Out of Scope.
Model Description
A dual-input ConvNeXt-Base architecture trained end-to-end on the MILK10k dataset (ISIC 2025 Challenge). The model processes a dermoscopic image and a clinical close-up photograph of the same lesion simultaneously, fusing feature representations before classification. It was developed as a research component submitted to the MedGemma Impact Challenge.
| Property | Value |
|---|---|
| Architecture | Dual ConvNeXt-Base, shared-weight encoders, late fusion |
| Input | Paired dermoscopic + clinical images (384ร384 px each) |
| Output | Softmax probabilities over 11 ISIC diagnostic classes |
| Training | 5-fold stratified cross-validation, macro F1 optimisation |
| Ensemble | 5 models (one per fold), predictions averaged at inference |
Intended Use
This model is released strictly for non-commercial research and educational purposes, as part of the SkinAI application submitted to the MedGemma Impact Challenge. It is provided to support reproducibility of the challenge submission and to enable further research into multi-modal skin lesion classification.
Intended users: Researchers and developers working on dermatology AI, machine learning in medical imaging, or related computational fields.
Out-of-Scope Uses
The following uses are explicitly out of scope and are not supported:
- Clinical diagnosis or decision support โ the model has not been validated for clinical deployment and must not influence patient care in any setting.
- Patient triage or screening โ performance has only been evaluated on held-out folds of the MILK10k training distribution; generalisability to other populations, imaging devices, or clinical workflows is unknown.
- Autonomous or semi-autonomous medical decision making โ any application in which model outputs could directly or indirectly affect patient management.
- Deployment without independent clinical validation โ any production use would require prospective validation by qualified clinicians under appropriate regulatory oversight.
The performance metrics reported below reflect internal cross-validation on a single dataset and are not sufficient evidence of clinical utility.
Diagnostic Classes
| Class | Description |
|---|---|
| AKIEC | Actinic keratosis / intraepithelial carcinoma |
| BCC | Basal cell carcinoma |
| BEN_OTH | Other benign lesion |
| BKL | Benign keratosis |
| DF | Dermatofibroma |
| INF | Inflammatory / infectious |
| MAL_OTH | Other malignant lesion |
| MEL | Melanoma |
| NV | Melanocytic nevus |
| SCCKA | Squamous cell carcinoma / keratoacanthoma |
| VASC | Vascular lesion |
Performance
Important caveat: All metrics below are from held-out validation folds of the MILK10k training dataset using 5-fold stratified cross-validation. They represent performance under distribution-matched conditions and should not be interpreted as estimates of real-world clinical performance. External validation has not been performed.
Aggregate Metrics
| Metric | Value |
|---|---|
| Balanced Multiclass Accuracy | 0.665 |
| Macro F1 (ConvNeXt alone) | 0.555 |
| Macro F1 (MedSigLIP + ConvNeXt ensemble) | 0.591 |
| ISIC 2025 Leaderboard Score (Dice) | 0.538 |
Per-Class Metrics (Validation, Single ConvNeXt Fold)
| Class | AUC | AUC (Sens>80%) | Avg Precision | Sensitivity | Specificity | Dice | PPV | NPV |
|---|---|---|---|---|---|---|---|---|
| AKIEC | 0.933 | 0.873 | 0.704 | 0.732 | 0.924 | 0.675 | 0.627 | 0.952 |
| BCC | 0.975 | 0.960 | 0.838 | 0.951 | 0.919 | 0.758 | 0.630 | 0.992 |
| BEN_OTH | 0.978 | 0.953 | 0.505 | 0.429 | 0.998 | 0.545 | 0.750 | 0.992 |
| BKL | 0.881 | 0.713 | 0.746 | 0.750 | 0.865 | 0.664 | 0.595 | 0.929 |
| DF | 0.986 | 0.983 | 0.536 | 0.833 | 0.992 | 0.667 | 0.556 | 0.998 |
| INF | 0.841 | 0.722 | 0.164 | 0.364 | 0.985 | 0.364 | 0.364 | 0.985 |
| MAL_OTH | 0.820 | 0.717 | 0.518 | 0.400 | 0.993 | 0.571 | 1.000 | 0.987 |
| MEL | 0.957 | 0.935 | 0.820 | 0.821 | 0.950 | 0.688 | 0.593 | 0.984 |
| NV | 0.960 | 0.948 | 0.845 | 0.865 | 0.963 | 0.796 | 0.738 | 0.983 |
| SCCKA | 0.949 | 0.911 | 0.857 | 0.863 | 0.903 | 0.798 | 0.743 | 0.953 |
| VASC | 0.993 | 0.991 | 0.614 | 0.800 | 0.994 | 0.667 | 0.571 | 0.998 |
| Mean | 0.934 | 0.883 | 0.650 | 0.710 | 0.954 | 0.654 | 0.651 | 0.978 |
Rare classes (INF: ~11 lesions, MAL_OTH: ~15 lesions, VASC: ~15 lesions) are severely underrepresented in MILK10k. Sensitivity figures for these classes should be interpreted with caution given the small sample sizes involved.
Usage
This code is provided for research reproducibility. Users are responsible for ensuring any application complies with applicable laws and ethical guidelines.
import torch
import torch.nn.functional as F
import timm
import torch.nn as nn
from PIL import Image
import torchvision.transforms as transforms
from huggingface_hub import hf_hub_download
# --- Model Definition ---
class DualConvNeXt(nn.Module):
def __init__(self, num_classes=11, model_name='convnext_base'):
super().__init__()
self.clinical_encoder = timm.create_model(
model_name, pretrained=False, num_classes=0
)
self.derm_encoder = timm.create_model(
model_name, pretrained=False, num_classes=0
)
feat_dim = self.clinical_encoder.num_features
self.classifier = nn.Sequential(
nn.Linear(feat_dim * 2, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
def forward(self, clinical, derm):
c = self.clinical_encoder(clinical)
d = self.derm_encoder(derm)
return self.classifier(torch.cat([c, d], dim=1))
# --- Load Model ---
CLASS_NAMES = ['AKIEC', 'BCC', 'BEN_OTH', 'BKL', 'DF',
'INF', 'MAL_OTH', 'MEL', 'NV', 'SCCKA', 'VASC']
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = DualConvNeXt(num_classes=11)
weights_path = hf_hub_download(
repo_id="tech-doc/ConvNeXt_Milk10k",
filename="convnext_fold0_best.pth"
)
checkpoint = torch.load(weights_path, map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval().to(device)
# --- Preprocessing ---
transform = transforms.Compose([
transforms.Resize((384, 384)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
# --- Inference ---
def predict(clinical_image_path: str, derm_image_path: str) -> dict:
"""
Research inference only. Output must not be used for clinical decisions.
Args:
clinical_image_path: Path to clinical close-up photograph
derm_image_path: Path to dermoscopic image
Returns:
dict with 'prediction', 'confidence', and 'probabilities'
"""
clinical = transform(Image.open(clinical_image_path).convert('RGB')).unsqueeze(0).to(device)
derm = transform(Image.open(derm_image_path).convert('RGB')).unsqueeze(0).to(device)
with torch.no_grad():
logits = model(clinical, derm)
probs = F.softmax(logits, dim=1).squeeze().cpu().numpy()
return {
'prediction': CLASS_NAMES[probs.argmax()],
'confidence': float(probs.max()),
'probabilities': {c: float(p) for c, p in zip(CLASS_NAMES, probs)}
}
# Example
result = predict('clinical.jpg', 'dermoscopy.jpg')
print(f"Prediction: {result['prediction']} ({result['confidence']:.1%})")
Training Details
| Parameter | Value |
|---|---|
| Base model | convnext_base (ImageNet-22k pretrained via timm) |
| Image size | 384ร384 px |
| Batch size | 32 |
| Optimiser | AdamW, lr=1e-4 |
| Scheduler | Cosine annealing with warm restarts |
| Loss | Cross-entropy with class weights + focal loss |
| Augmentation | Random flips, rotations, colour jitter, RandAugment |
| Folds | 5-fold stratified CV (seed 42) |
| Hardware | NVIDIA A100 (Google Colab) |
| Training time | ~4โ6 hours per fold |
Limitations
- Single-dataset evaluation: Trained and evaluated exclusively on MILK10k (~5,240 lesions). No external validation has been performed. Reported metrics should not be generalised beyond this distribution.
- Severe class imbalance: Rare classes (INF: ~11 lesions, MAL_OTH: ~15 lesions, VASC: ~15 lesions) are underrepresented. Performance on these classes is highly uncertain and may not be reproducible on different samples.
- Paired-image requirement: The model requires simultaneous dermoscopic and clinical photographs of the same lesion. Single-image inference is architecturally unsupported and was not evaluated.
- Skin tone representation: The MILK10k dataset composition with respect to Fitzpatrick phototype has not been fully characterised. Performance across darker skin tones (Fitzpatrick IVโVI) has not been validated.
- Paediatric populations: The model was not evaluated on paediatric patients.
- Device variability: Performance may degrade with imaging devices, magnifications, or lighting conditions not represented in the training data.
- No prospective validation: All reported metrics are from retrospective cross-validation. Prospective clinical validation would be required before any consideration of real-world use.
Citation
If you use this model or the MILK10k dataset in your research, please cite:
@dataset{milk10k2025,
author = {MILK study team},
title = {MILK10k},
year = {2025},
publisher = {ISIC Archive},
doi = {10.34970/648456}
}
License
CC BY-NC 4.0 โ This model was trained on MILK10k data (CC-BY-NC licensed). Non-commercial research use only. Any commercial application is prohibited without explicit permission.