| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-VL-8B-Thinking |
| tags: |
| - dermatology |
| - medical |
| - vision-language |
| - caption-generation |
| - clinical-nlp |
| - fine-tuned |
| - qwen3-vl |
| language: |
| - en |
| - th |
| datasets: |
| - SkinCAP |
| pipeline_tag: image-text-to-text |
| --- |
| |
| <p align="center"> |
| <img src="HIKARI_logo.png" alt="HIKARI β Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference" width="100%"/> |
| </p> |
|
|
| <h1 align="center">HIKARI-Rigel-8B-SkinCaption</h1> |
|
|
| <p align="center"> |
| <b>Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference</b><br/> |
| Named after <b>Rigel</b> β blue supergiant in Orion, a first step in the caption training path |
| </p> |
|
|
| <p align="center"> |
| <img src="https://img.shields.io/badge/Base%20Model-Qwen3--VL--8B--Thinking-blue?style=flat-square"/> |
| <img src="https://img.shields.io/badge/Task-Clinical%20Caption%20Generation-teal?style=flat-square"/> |
| <img src="https://img.shields.io/badge/BLEU--4-9.82-yellow?style=flat-square"/> |
| <img src="https://img.shields.io/badge/Init-Checkpoint--Init-gray?style=flat-square"/> |
| <img src="https://img.shields.io/badge/License-Apache%202.0-orange?style=flat-square"/> |
| </p> |
|
|
| --- |
|
|
| ## π¦ Model Type: Merged Full Model |
|
|
| > This is a **fully merged model** β the LoRA adapter weights have been merged directly into the base model weights. |
| > |
| > β
**No adapter loading needed.** Load directly with `transformers`, `vLLM`, or `SGLang`. |
| > |
| > πΎ **Size:** ~17 GB (4 safetensor shards) |
| > |
| > π Lightweight adapter version: |
| > **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)** (~1.1 GB) |
|
|
| --- |
|
|
| ## Overview |
|
|
| **HIKARI-Rigel** generates clinical skin lesion captions using the **checkpoint-init** (Way 1) training strategy: Stage 3 caption training continues directly from the Stage 2 LoRA checkpoint, fine-tuning the existing disease adapters further on caption data. |
|
|
| This is an ablation baseline. For best captioning performance, use [β HIKARI-Vega-8B-SkinCaption-Fused](https://huggingface.co/E27085921/HIKARI-Vega-8B-SkinCaption-Fused) (BLEU-4: 29.33, 3Γ better). |
|
|
| | Property | Value | |
| |:---------|:------| |
| | **Task** | Clinical skin lesion caption generation (Stage 3) | |
| | **Base model** | `Qwen/Qwen3-VL-8B-Thinking` | |
| | **Init strategy** | Checkpoint-Init β continues from Stage 2 LoRA checkpoint | |
| | **BLEU-4** | 9.82 | |
| | **ROUGE-1** | 38.90 | |
| | **BERTScore-F** | 88.12 (roberta-large) | |
| | **Model type** | Merged full model | |
|
|
| ### Why Checkpoint-Init Underperforms |
|
|
| The Stage 2 disease LoRA adapters are directly continued into caption training. The caption learning signal overwrites the disease knowledge that was stored in those same LoRA weights. Result: the model loses its diagnostic ability before it fully learns to generate captions. |
|
|
| | Init | BLEU-4 | ROUGE-1 | Disease knowledge | |
| |:-----|:------:|:-------:|:-----------------:| |
| | **Checkpoint (this model)** | 9.82 | 38.90 | β Lost during training | |
| | **Merged (Vega)** | **29.33** | **53.55** | β
Locked in base weights | |
|
|
| --- |
|
|
| ## π§ Quick Inference β `transformers` |
|
|
| ```python |
| from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
| import torch |
| from PIL import Image |
| |
| model_id = "E27085921/HIKARI-Rigel-8B-SkinCaption" |
| |
| processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
| model = Qwen3VLForConditionalGeneration.from_pretrained( |
| model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True |
| ) |
| |
| image = Image.open("skin_lesion.jpg").convert("RGB") |
| |
| PROMPT = ( |
| "Describe this skin lesion image in detail. Include information about its " |
| "appearance, possible diagnosis, and recommended examinations." |
| ) |
| |
| messages = [{"role": "user", "content": [ |
| {"type": "image", "image": image}, |
| {"type": "text", "text": PROMPT}, |
| ]}] |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False) |
| |
| print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip()) |
| ``` |
|
|
| --- |
|
|
| ## π LoRA Adapter Version |
|
|
| ```python |
| from peft import PeftModel |
| from transformers import Qwen3VLForConditionalGeneration |
| import torch |
| |
| base = Qwen3VLForConditionalGeneration.from_pretrained( |
| "Qwen/Qwen3-VL-8B-Thinking", torch_dtype=torch.bfloat16, device_map="auto" |
| ) |
| model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA") |
| ``` |
|
|
| β **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)** |
|
|
| --- |
|
|
| ## π Citation |
|
|
| ```bibtex |
| @misc{hikari2026, |
| title = {HIKARI: RAG-in-Training for Skin Disease Diagnosis |
| with Cascaded Vision-Language Models}, |
| author = {Watin Promfiy and Pawitra Boonprasart}, |
| year = {2026}, |
| institution = {King Mongkut's Institute of Technology Ladkrabang, |
| Department of Information Technology, Bangkok, Thailand} |
| } |
| ``` |
|
|
| <p align="center">Made with β€οΈ at <b>King Mongkut's Institute of Technology Ladkrabang (KMITL)</b></p> |
|
|