E27085921's picture
Upload HIKARI-Rigel-8B-SkinCaption
89e88ee verified
---
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Thinking
tags:
- dermatology
- medical
- vision-language
- caption-generation
- clinical-nlp
- fine-tuned
- qwen3-vl
language:
- en
- th
datasets:
- SkinCAP
pipeline_tag: image-text-to-text
---
<p align="center">
<img src="HIKARI_logo.png" alt="HIKARI β€” Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference" width="100%"/>
</p>
<h1 align="center">HIKARI-Rigel-8B-SkinCaption</h1>
<p align="center">
<b>Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference</b><br/>
Named after <b>Rigel</b> β€” blue supergiant in Orion, a first step in the caption training path
</p>
<p align="center">
<img src="https://img.shields.io/badge/Base%20Model-Qwen3--VL--8B--Thinking-blue?style=flat-square"/>
<img src="https://img.shields.io/badge/Task-Clinical%20Caption%20Generation-teal?style=flat-square"/>
<img src="https://img.shields.io/badge/BLEU--4-9.82-yellow?style=flat-square"/>
<img src="https://img.shields.io/badge/Init-Checkpoint--Init-gray?style=flat-square"/>
<img src="https://img.shields.io/badge/License-Apache%202.0-orange?style=flat-square"/>
</p>
---
## πŸ“¦ Model Type: Merged Full Model
> This is a **fully merged model** β€” the LoRA adapter weights have been merged directly into the base model weights.
>
> βœ… **No adapter loading needed.** Load directly with `transformers`, `vLLM`, or `SGLang`.
>
> πŸ’Ύ **Size:** ~17 GB (4 safetensor shards)
>
> πŸ”Œ Lightweight adapter version:
> **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)** (~1.1 GB)
---
## Overview
**HIKARI-Rigel** generates clinical skin lesion captions using the **checkpoint-init** (Way 1) training strategy: Stage 3 caption training continues directly from the Stage 2 LoRA checkpoint, fine-tuning the existing disease adapters further on caption data.
This is an ablation baseline. For best captioning performance, use [⭐ HIKARI-Vega-8B-SkinCaption-Fused](https://huggingface.co/E27085921/HIKARI-Vega-8B-SkinCaption-Fused) (BLEU-4: 29.33, 3Γ— better).
| Property | Value |
|:---------|:------|
| **Task** | Clinical skin lesion caption generation (Stage 3) |
| **Base model** | `Qwen/Qwen3-VL-8B-Thinking` |
| **Init strategy** | Checkpoint-Init β€” continues from Stage 2 LoRA checkpoint |
| **BLEU-4** | 9.82 |
| **ROUGE-1** | 38.90 |
| **BERTScore-F** | 88.12 (roberta-large) |
| **Model type** | Merged full model |
### Why Checkpoint-Init Underperforms
The Stage 2 disease LoRA adapters are directly continued into caption training. The caption learning signal overwrites the disease knowledge that was stored in those same LoRA weights. Result: the model loses its diagnostic ability before it fully learns to generate captions.
| Init | BLEU-4 | ROUGE-1 | Disease knowledge |
|:-----|:------:|:-------:|:-----------------:|
| **Checkpoint (this model)** | 9.82 | 38.90 | ❌ Lost during training |
| **Merged (Vega)** | **29.33** | **53.55** | βœ… Locked in base weights |
---
## πŸ”§ Quick Inference β€” `transformers`
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
model_id = "E27085921/HIKARI-Rigel-8B-SkinCaption"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
image = Image.open("skin_lesion.jpg").convert("RGB")
PROMPT = (
"Describe this skin lesion image in detail. Include information about its "
"appearance, possible diagnosis, and recommended examinations."
)
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip())
```
---
## πŸ”Œ LoRA Adapter Version
```python
from peft import PeftModel
from transformers import Qwen3VLForConditionalGeneration
import torch
base = Qwen3VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen3-VL-8B-Thinking", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA")
```
β†’ **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)**
---
## πŸ“„ Citation
```bibtex
@misc{hikari2026,
title = {HIKARI: RAG-in-Training for Skin Disease Diagnosis
with Cascaded Vision-Language Models},
author = {Watin Promfiy and Pawitra Boonprasart},
year = {2026},
institution = {King Mongkut's Institute of Technology Ladkrabang,
Department of Information Technology, Bangkok, Thailand}
}
```
<p align="center">Made with ❀️ at <b>King Mongkut's Institute of Technology Ladkrabang (KMITL)</b></p>