File size: 5,240 Bytes
89e88ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Thinking
tags:
  - dermatology
  - medical
  - vision-language
  - caption-generation
  - clinical-nlp
  - fine-tuned
  - qwen3-vl
language:
  - en
  - th
datasets:
  - SkinCAP
pipeline_tag: image-text-to-text
---

<p align="center">
  <img src="HIKARI_logo.png" alt="HIKARI β€” Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference" width="100%"/>
</p>

<h1 align="center">HIKARI-Rigel-8B-SkinCaption</h1>

<p align="center">
  <b>Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference</b><br/>
  Named after <b>Rigel</b> β€” blue supergiant in Orion, a first step in the caption training path
</p>

<p align="center">
  <img src="https://img.shields.io/badge/Base%20Model-Qwen3--VL--8B--Thinking-blue?style=flat-square"/>
  <img src="https://img.shields.io/badge/Task-Clinical%20Caption%20Generation-teal?style=flat-square"/>
  <img src="https://img.shields.io/badge/BLEU--4-9.82-yellow?style=flat-square"/>
  <img src="https://img.shields.io/badge/Init-Checkpoint--Init-gray?style=flat-square"/>
  <img src="https://img.shields.io/badge/License-Apache%202.0-orange?style=flat-square"/>
</p>

---

## πŸ“¦ Model Type: Merged Full Model

> This is a **fully merged model** β€” the LoRA adapter weights have been merged directly into the base model weights.
>
> βœ… **No adapter loading needed.** Load directly with `transformers`, `vLLM`, or `SGLang`.
>
> πŸ’Ύ **Size:** ~17 GB (4 safetensor shards)
>
> πŸ”Œ Lightweight adapter version:
> **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)** (~1.1 GB)

---

## Overview

**HIKARI-Rigel** generates clinical skin lesion captions using the **checkpoint-init** (Way 1) training strategy: Stage 3 caption training continues directly from the Stage 2 LoRA checkpoint, fine-tuning the existing disease adapters further on caption data.

This is an ablation baseline. For best captioning performance, use [⭐ HIKARI-Vega-8B-SkinCaption-Fused](https://huggingface.co/E27085921/HIKARI-Vega-8B-SkinCaption-Fused) (BLEU-4: 29.33, 3Γ— better).

| Property | Value |
|:---------|:------|
| **Task** | Clinical skin lesion caption generation (Stage 3) |
| **Base model** | `Qwen/Qwen3-VL-8B-Thinking` |
| **Init strategy** | Checkpoint-Init β€” continues from Stage 2 LoRA checkpoint |
| **BLEU-4** | 9.82 |
| **ROUGE-1** | 38.90 |
| **BERTScore-F** | 88.12 (roberta-large) |
| **Model type** | Merged full model |

### Why Checkpoint-Init Underperforms

The Stage 2 disease LoRA adapters are directly continued into caption training. The caption learning signal overwrites the disease knowledge that was stored in those same LoRA weights. Result: the model loses its diagnostic ability before it fully learns to generate captions.

| Init | BLEU-4 | ROUGE-1 | Disease knowledge |
|:-----|:------:|:-------:|:-----------------:|
| **Checkpoint (this model)** | 9.82 | 38.90 | ❌ Lost during training |
| **Merged (Vega)** | **29.33** | **53.55** | βœ… Locked in base weights |

---

## πŸ”§ Quick Inference β€” `transformers`

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image

model_id = "E27085921/HIKARI-Rigel-8B-SkinCaption"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

image = Image.open("skin_lesion.jpg").convert("RGB")

PROMPT = (
    "Describe this skin lesion image in detail. Include information about its "
    "appearance, possible diagnosis, and recommended examinations."
)

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": PROMPT},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False)

print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip())
```

---

## πŸ”Œ LoRA Adapter Version

```python
from peft import PeftModel
from transformers import Qwen3VLForConditionalGeneration
import torch

base = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-8B-Thinking", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA")
```

β†’ **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)**

---

## πŸ“„ Citation

```bibtex
@misc{hikari2026,
  title  = {HIKARI: RAG-in-Training for Skin Disease Diagnosis
            with Cascaded Vision-Language Models},
  author = {Watin Promfiy and Pawitra Boonprasart},
  year   = {2026},
  institution = {King Mongkut's Institute of Technology Ladkrabang,
                 Department of Information Technology, Bangkok, Thailand}
}
```

<p align="center">Made with ❀️ at <b>King Mongkut's Institute of Technology Ladkrabang (KMITL)</b></p>