--- license: apache-2.0 base_model: Qwen/Qwen3-VL-8B-Thinking tags: - dermatology - medical - vision-language - caption-generation - clinical-nlp - fine-tuned - qwen3-vl language: - en - th datasets: - SkinCAP pipeline_tag: image-text-to-text ---

HIKARI — Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference

HIKARI-Rigel-8B-SkinCaption

Healthcare-oriented Intelligent Knowledge Augmented Retrieval and Inference
Named after Rigel — blue supergiant in Orion, a first step in the caption training path

--- ## 📦 Model Type: Merged Full Model > This is a **fully merged model** — the LoRA adapter weights have been merged directly into the base model weights. > > ✅ **No adapter loading needed.** Load directly with `transformers`, `vLLM`, or `SGLang`. > > 💾 **Size:** ~17 GB (4 safetensor shards) > > 🔌 Lightweight adapter version: > **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)** (~1.1 GB) --- ## Overview **HIKARI-Rigel** generates clinical skin lesion captions using the **checkpoint-init** (Way 1) training strategy: Stage 3 caption training continues directly from the Stage 2 LoRA checkpoint, fine-tuning the existing disease adapters further on caption data. This is an ablation baseline. For best captioning performance, use [⭐ HIKARI-Vega-8B-SkinCaption-Fused](https://huggingface.co/E27085921/HIKARI-Vega-8B-SkinCaption-Fused) (BLEU-4: 29.33, 3× better). | Property | Value | |:---------|:------| | **Task** | Clinical skin lesion caption generation (Stage 3) | | **Base model** | `Qwen/Qwen3-VL-8B-Thinking` | | **Init strategy** | Checkpoint-Init — continues from Stage 2 LoRA checkpoint | | **BLEU-4** | 9.82 | | **ROUGE-1** | 38.90 | | **BERTScore-F** | 88.12 (roberta-large) | | **Model type** | Merged full model | ### Why Checkpoint-Init Underperforms The Stage 2 disease LoRA adapters are directly continued into caption training. The caption learning signal overwrites the disease knowledge that was stored in those same LoRA weights. Result: the model loses its diagnostic ability before it fully learns to generate captions. | Init | BLEU-4 | ROUGE-1 | Disease knowledge | |:-----|:------:|:-------:|:-----------------:| | **Checkpoint (this model)** | 9.82 | 38.90 | ❌ Lost during training | | **Merged (Vega)** | **29.33** | **53.55** | ✅ Locked in base weights | --- ## 🔧 Quick Inference — `transformers` ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor import torch from PIL import Image model_id = "E27085921/HIKARI-Rigel-8B-SkinCaption" processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = Qwen3VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) image = Image.open("skin_lesion.jpg").convert("RGB") PROMPT = ( "Describe this skin lesion image in detail. Include information about its " "appearance, possible diagnosis, and recommended examinations." ) messages = [{"role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": PROMPT}, ]}] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=256, temperature=0.0, do_sample=False) print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip()) ``` --- ## 🔌 LoRA Adapter Version ```python from peft import PeftModel from transformers import Qwen3VLForConditionalGeneration import torch base = Qwen3VLForConditionalGeneration.from_pretrained( "Qwen/Qwen3-VL-8B-Thinking", torch_dtype=torch.bfloat16, device_map="auto" ) model = PeftModel.from_pretrained(base, "E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA") ``` → **[E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA](https://huggingface.co/E27085921/HIKARI-Rigel-8B-SkinCaption-LoRA)** --- ## 📄 Citation ```bibtex @misc{hikari2026, title = {HIKARI: RAG-in-Training for Skin Disease Diagnosis with Cascaded Vision-Language Models}, author = {Watin Promfiy and Pawitra Boonprasart}, year = {2026}, institution = {King Mongkut's Institute of Technology Ladkrabang, Department of Information Technology, Bangkok, Thailand} } ```

Made with ❤️ at King Mongkut's Institute of Technology Ladkrabang (KMITL)