--- tags: - image-to-text - image-captioning - CLIP - GPT-2 - dermatology - dermlip library_name: transformers license: other language: - en pipeline_tag: image-to-text --- # DermLIP + GPT-2 Dermatology Captioner A dermatology image captioning model combining DermLIP vision encoder with gpt2-medium language model. Trained on dermatological images for generating clinical descriptions of skin lesions. **Architecture**: DermLIP (ViT-B/16) → learnable prefix → GPT-2 (`gpt2-medium`). Trained in two stages: Stage A (META) for generalization and Stage B (SkinCAP) for style/terminology. ## Metrics **Stage A (META)** val_loss=1.1070 • PPL=3.03 BLEU=38.6 • ROUGE-L=0.550 • CIDEr-D=0.17 • CLIP=24.4 • BERT_F1=0.565 **Stage B (SKINCAP)** val_loss=1.1903 • PPL=3.29 BLEU=10.0 • ROUGE-L=0.278 • CIDEr-D=0.13 • CLIP=25.9 • BERT_F1=0.363 ## Inference > Minimal example uses `inference_min.py` included in this repo. > Requires: `pip install torch transformers open_clip_torch pillow huggingface_hub` ```python from huggingface_hub import snapshot_download from inference_min import load_model, generate # 1) download repo snapshot repo_dir = snapshot_download("moxeeeem/dermlip-gpt2-captioner", allow_patterns=["*.pt","*.json","inference_min.py"]) # 2) load model from saved config/weights model = load_model(repo_dir) # builds CLIP backend + GPT-2 + prefix projector # 3) run generation img_paths = ["/path/to/derma_image.jpg"] # local test images caps = generate(model, img_paths, prompt="Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1\u20133 words).") for c in caps: print(c) ``` ## Files | File | Size | Check | |---|---:|---| | `best_stageA.pt` | 2 GB | sha256[:12]=3219636f48b0 | | `best_stageB.pt` | 2 GB | sha256[:12]=69bded2dcad1 | | `final_captioner_gpt2-medium_VisionTransformer.json` | 849 B | sha256[:12]=e157402c9fe2 | | `final_captioner_gpt2-medium_VisionTransformer.pt` | 2 GB | sha256[:12]=536ae07811c9 | | `loss_dermlip_vitb16.png` | 110 KB | sha256[:12]=a04b1e5832d9 | ## Details - **Vision Encoder**: DermLIP (ViT-B/16) - **Language Model**: GPT-2 (`gpt2-medium`) - **CLIP weights**: `hf-hub:redlessone/DermLIP_ViT-B-16` - **Prefix tokens**: 32 - **Training prompt**: `Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1–3 words).` ### Model Type Detection - Detected as: `dermlip` - Repository: `moxeeeem/dermlip-gpt2-captioner` _Auto-generated on 2025-08-30 09:25 UTC._