File size: 5,240 Bytes
89e88ee | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ---
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Thinking
tags:
- dermatology
- medical
- vision-language
- caption-generation
- clinical-nlp
- fine-tuned
- qwen3-vl
language:
- en
- th
datasets:
- SkinCAP
pipeline_tag: image-text-to-text
---
<p align="center">
<img src="HIKARI_logo.png" alt="HIKARI β Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference" width="100%"/>
</p>
<h1 align="center">HIKARI-Rigel-8B-SkinCaption</h1>
<p align="center">
<b>Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference</b><br/>
Named after <b>Rigel</b> β blue supergiant in Orion, a first step in the caption training path
</p>
<p align="center">
<img src="https://img.shields.io/badge/Base%20Model-Qwen3--VL--8B--Thinking-blue?style=flat-square"/>
<img src="https://img.shields.io/badge/Task-Clinical%20Caption%20Generation-teal?style=flat-square"/>
<img src="https://img.shields.io/badge/BLEU--4-9.82-yellow?style=flat-square"/>
<img src="https://img.shields.io/badge/Init-Checkpoint--Init-gray?style=flat-square"/>
<img src="https://img.shields.io/badge/License-Apache%202.0-orange?style=flat-square"/>
</p>
---
## π¦ Model Type: Merged Full Model
> This is a **fully merged model** β the LoRA adapter weights have been merged directly into the base model weights.
>
> β
**No adapter loading needed.** Load directly with `transformers`, `vLLM`, or `SGLang`.
>
> πΎ **Size:** ~17 GB (4 safetensor shards)
>
> π Lightweight adapter version:
> **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)** (~1.1 GB)
---
## Overview
**HIKARI-Rigel** generates clinical skin lesion captions using the **checkpoint-init** (Way 1) training strategy: Stage 3 caption training continues directly from the Stage 2 LoRA checkpoint, fine-tuning the existing disease adapters further on caption data.
This is an ablation baseline. For best captioning performance, use [β HIKARI-Vega-8B-SkinCaption-Fused](https://huggingface.co/E27085921/HIKARI-Vega-8B-SkinCaption-Fused) (BLEU-4: 29.33, 3Γ better).
| Property | Value |
|:---------|:------|
| **Task** | Clinical skin lesion caption generation (Stage 3) |
| **Base model** | `Qwen/Qwen3-VL-8B-Thinking` |
| **Init strategy** | Checkpoint-Init β continues from Stage 2 LoRA checkpoint |
| **BLEU-4** | 9.82 |
| **ROUGE-1** | 38.90 |
| **BERTScore-F** | 88.12 (roberta-large) |
| **Model type** | Merged full model |
### Why Checkpoint-Init Underperforms
The Stage 2 disease LoRA adapters are directly continued into caption training. The caption learning signal overwrites the disease knowledge that was stored in those same LoRA weights. Result: the model loses its diagnostic ability before it fully learns to generate captions.
| Init | BLEU-4 | ROUGE-1 | Disease knowledge |
|:-----|:------:|:-------:|:-----------------:|
| **Checkpoint (this model)** | 9.82 | 38.90 | β Lost during training |
| **Merged (Vega)** | **29.33** | **53.55** | β
Locked in base weights |
---
## π§ Quick Inference β `transformers`
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
model_id = "E27085921/HIKARI-Rigel-8B-SkinCaption"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
image = Image.open("skin_lesion.jpg").convert("RGB")
PROMPT = (
"Describe this skin lesion image in detail. Include information about its "
"appearance, possible diagnosis, and recommended examinations."
)
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip())
```
---
## π LoRA Adapter Version
```python
from peft import PeftModel
from transformers import Qwen3VLForConditionalGeneration
import torch
base = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-8B-Thinking", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA")
```
β **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)**
---
## π Citation
```bibtex
@misc{hikari2026,
title = {HIKARI: RAG-in-Training for Skin Disease Diagnosis
with Cascaded Vision-Language Models},
author = {Watin Promfiy and Pawitra Boonprasart},
year = {2026},
institution = {King Mongkut's Institute of Technology Ladkrabang,
Department of Information Technology, Bangkok, Thailand}
}
```
<p align="center">Made with β€οΈ at <b>King Mongkut's Institute of Technology Ladkrabang (KMITL)</b></p>
|