moxeeeem
/

dermlip-gpt2-captioner

image-captioning

Model card Files Files and versions

dermlip-gpt2-captioner / README.md

moxeeeem's picture

Upload README.md with huggingface_hub

ba51cc5 verified 5 months ago

|

history blame contribute delete

2.64 kB

	---
	tags:
	- image-to-text
	- image-captioning
	- CLIP
	- GPT-2
	- dermatology
	- dermlip
	library_name: transformers
	license: other
	language:
	- en
	pipeline_tag: image-to-text
	---

	# DermLIP + GPT-2 Dermatology Captioner

	A dermatology image captioning model combining DermLIP vision encoder with gpt2-medium language model. Trained on dermatological images for generating clinical descriptions of skin lesions.

	Architecture: DermLIP (ViT-B/16) → learnable prefix → GPT-2 (`gpt2-medium`).
	Trained in two stages: Stage A (META) for generalization and Stage B (SkinCAP) for style/terminology.


	## Metrics
	Stage A (META)
	val_loss=1.1070 • PPL=3.03
	BLEU=38.6 • ROUGE-L=0.550 • CIDEr-D=0.17 • CLIP=24.4 • BERT_F1=0.565

	Stage B (SKINCAP)
	val_loss=1.1903 • PPL=3.29
	BLEU=10.0 • ROUGE-L=0.278 • CIDEr-D=0.13 • CLIP=25.9 • BERT_F1=0.363

	## Inference

	> Minimal example uses `inference_min.py` included in this repo.
	> Requires: `pip install torch transformers open_clip_torch pillow huggingface_hub`

	```python
	from huggingface_hub import snapshot_download
	from inference_min import load_model, generate

	# 1) download repo snapshot
	repo_dir = snapshot_download("moxeeeem/dermlip-gpt2-captioner", allow_patterns=[".pt",".json","inference_min.py"])

	# 2) load model from saved config/weights
	model = load_model(repo_dir) # builds CLIP backend + GPT-2 + prefix projector

	# 3) run generation
	img_paths = ["/path/to/derma_image.jpg"] # local test images
	caps = generate(model, img_paths, prompt="Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1\u20133 words).")
	for c in caps:
	print(c)
	```


	## Files
	\| File \| Size \| Check \|
	\|---\|---:\|---\|
	\| `best_stageA.pt` \| 2 GB \| sha256[:12]=3219636f48b0 \|
	\| `best_stageB.pt` \| 2 GB \| sha256[:12]=69bded2dcad1 \|
	\| `final_captioner_gpt2-medium_VisionTransformer.json` \| 849 B \| sha256[:12]=e157402c9fe2 \|
	\| `final_captioner_gpt2-medium_VisionTransformer.pt` \| 2 GB \| sha256[:12]=536ae07811c9 \|
	\| `loss_dermlip_vitb16.png` \| 110 KB \| sha256[:12]=a04b1e5832d9 \|

	## Details

	- Vision Encoder: DermLIP (ViT-B/16)
	- Language Model: GPT-2 (`gpt2-medium`)
	- CLIP weights: `hf-hub:redlessone/DermLIP_ViT-B-16`
	- Prefix tokens: 32
	- Training prompt: `Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1–3 words).`

	### Model Type Detection
	- Detected as: `dermlip`
	- Repository: `moxeeeem/dermlip-gpt2-captioner`

	_Auto-generated on 2025-08-30 09:25 UTC._