moxeeeem
/

dermlip-gpt2-captioner

+---
+tags:
+  - image-to-text
+  - image-captioning
+  - CLIP
+  - GPT-2
+  - dermatology
+  - dermlip
+library_name: transformers
+license: other
+language:
+  - en
+pipeline_tag: image-to-text
+---
+# DermLIP + GPT-2 Dermatology Captioner
+A dermatology image captioning model combining DermLIP vision encoder with gpt2-medium language model. Trained on dermatological images for generating clinical descriptions of skin lesions.
+**Architecture**: DermLIP (ViT-B/16) → learnable prefix → GPT-2 (`gpt2-medium`).
+Trained in two stages: Stage A (META) for generalization and Stage B (SkinCAP) for style/terminology.
+## Metrics
+**Stage A (META)**
+val_loss=1.1070 • PPL=3.03
+BLEU=38.6 • ROUGE-L=0.550 • CIDEr-D=0.17 • CLIP=24.4 • BERT_F1=0.565
+**Stage B (SKINCAP)**
+val_loss=1.1903 • PPL=3.29
+BLEU=10.0 • ROUGE-L=0.278 • CIDEr-D=0.13 • CLIP=25.9 • BERT_F1=0.363
+## Inference
+> Minimal example uses `inference_min.py` included in this repo.
+> Requires: `pip install torch transformers open_clip_torch pillow huggingface_hub`
+```python
+from huggingface_hub import snapshot_download
+from inference_min import load_model, generate
+# 1) download repo snapshot
+repo_dir = snapshot_download("moxeeeem/dermlip-gpt2-captioner", allow_patterns=["*.pt","*.json","inference_min.py"])
+# 2) load model from saved config/weights
+model = load_model(repo_dir)  # builds CLIP backend + GPT-2 + prefix projector
+# 3) run generation
+img_paths = ["/path/to/derma_image.jpg"]  # local test images
+caps = generate(model, img_paths, prompt="Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1\u20133 words).")
+for c in caps:
+    print(c)
+```
+## Files
+| File | Size | Check |
+|---|---:|---|
+| `best_stageA.pt` | 2 GB | sha256[:12]=3219636f48b0 |
+| `best_stageB.pt` | 2 GB | sha256[:12]=69bded2dcad1 |
+| `final_captioner_gpt2-medium_VisionTransformer.json` | 849 B | sha256[:12]=e157402c9fe2 |
+| `final_captioner_gpt2-medium_VisionTransformer.pt` | 2 GB | sha256[:12]=536ae07811c9 |
+| `loss_dermlip_vitb16.png` | 110 KB | sha256[:12]=a04b1e5832d9 |
+## Details
+- **Vision Encoder**: DermLIP (ViT-B/16)
+- **Language Model**: GPT-2 (`gpt2-medium`)
+- **CLIP weights**: `hf-hub:redlessone/DermLIP_ViT-B-16`
+- **Prefix tokens**: 32
+- **Training prompt**: `Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1–3 words).`
+### Model Type Detection
+- Detected as: `dermlip`
+- Repository: `moxeeeem/dermlip-gpt2-captioner`
+_Auto-generated on 2025-08-30 09:25 UTC._