Upload HIKARI-Rigel-8B-SkinCaption

89e88ee verified 9 days ago

5.24 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-VL-8B-Thinking
	tags:
	- dermatology
	- medical
	- vision-language
	- caption-generation
	- clinical-nlp
	- fine-tuned
	- qwen3-vl
	language:
	- en
	- th
	datasets:
	- SkinCAP
	pipeline_tag: image-text-to-text
	---

	<p align="center">
	<img src="HIKARI_logo.png" alt="HIKARI — Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference" width="100%"/>
	</p>

	<h1 align="center">HIKARI-Rigel-8B-SkinCaption</h1>

	<p align="center">
	<b>Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference</b><br/>
	Named after <b>Rigel</b> — blue supergiant in Orion, a first step in the caption training path
	</p>

	<p align="center">
	<img src="https://img.shields.io/badge/Base%20Model-Qwen3--VL--8B--Thinking-blue?style=flat-square"/>
	<img src="https://img.shields.io/badge/Task-Clinical%20Caption%20Generation-teal?style=flat-square"/>
	<img src="https://img.shields.io/badge/BLEU--4-9.82-yellow?style=flat-square"/>
	<img src="https://img.shields.io/badge/Init-Checkpoint--Init-gray?style=flat-square"/>
	<img src="https://img.shields.io/badge/License-Apache%202.0-orange?style=flat-square"/>
	</p>

	---

	## 📦 Model Type: Merged Full Model

	> This is a fully merged model — the LoRA adapter weights have been merged directly into the base model weights.
	>
	> ✅ No adapter loading needed. Load directly with `transformers`, `vLLM`, or `SGLang`.
	>
	> 💾 Size: ~17 GB (4 safetensor shards)
	>
	> 🔌 Lightweight adapter version:
	> [E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA) (~1.1 GB)

	---

	## Overview

	HIKARI-Rigel generates clinical skin lesion captions using the checkpoint-init (Way 1) training strategy: Stage 3 caption training continues directly from the Stage 2 LoRA checkpoint, fine-tuning the existing disease adapters further on caption data.

	This is an ablation baseline. For best captioning performance, use [⭐ HIKARI-Vega-8B-SkinCaption-Fused](https://huggingface.co/E27085921/HIKARI-Vega-8B-SkinCaption-Fused) (BLEU-4: 29.33, 3× better).

	\| Property \| Value \|
	\|:---------\|:------\|
	\| Task \| Clinical skin lesion caption generation (Stage 3) \|
	\| Base model \| `Qwen/Qwen3-VL-8B-Thinking` \|
	\| Init strategy \| Checkpoint-Init — continues from Stage 2 LoRA checkpoint \|
	\| BLEU-4 \| 9.82 \|
	\| ROUGE-1 \| 38.90 \|
	\| BERTScore-F \| 88.12 (roberta-large) \|
	\| Model type \| Merged full model \|

	### Why Checkpoint-Init Underperforms

	The Stage 2 disease LoRA adapters are directly continued into caption training. The caption learning signal overwrites the disease knowledge that was stored in those same LoRA weights. Result: the model loses its diagnostic ability before it fully learns to generate captions.

	\| Init \| BLEU-4 \| ROUGE-1 \| Disease knowledge \|
	\|:-----\|:------:\|:-------:\|:-----------------:\|
	\| Checkpoint (this model) \| 9.82 \| 38.90 \| ❌ Lost during training \|
	\| Merged (Vega) \| 29.33 \| 53.55 \| ✅ Locked in base weights \|

	---

	## 🔧 Quick Inference — `transformers`

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
	import torch
	from PIL import Image

	model_id = "E27085921/HIKARI-Rigel-8B-SkinCaption"

	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	model = Qwen3VLForConditionalGeneration.from_pretrained(
	model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
	)

	image = Image.open("skin_lesion.jpg").convert("RGB")

	PROMPT = (
	"Describe this skin lesion image in detail. Include information about its "
	"appearance, possible diagnosis, and recommended examinations."
	)

	messages = [{"role": "user", "content": [
	{"type": "image", "image": image},
	{"type": "text", "text": PROMPT},
	]}]
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

	with torch.no_grad():
	out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False)

	print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip())
	```

	---

	## 🔌 LoRA Adapter Version

	```python
	from peft import PeftModel
	from transformers import Qwen3VLForConditionalGeneration
	import torch

	base = Qwen3VLForConditionalGeneration.from_pretrained(
	"Qwen/Qwen3-VL-8B-Thinking", torch_dtype=torch.bfloat16, device_map="auto"
	)
	model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA")
	```

	→ [E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)

	---

	## 📄 Citation

	```bibtex
	@misc{hikari2026,
	title = {HIKARI: RAG-in-Training for Skin Disease Diagnosis
	with Cascaded Vision-Language Models},
	author = {Watin Promfiy and Pawitra Boonprasart},
	year = {2026},
	institution = {King Mongkut's Institute of Technology Ladkrabang,
	Department of Information Technology, Bangkok, Thailand}
	}
	```

	<p align="center">Made with ❤️ at <b>King Mongkut's Institute of Technology Ladkrabang (KMITL)</b></p>