PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging
Paper β’ 2505.11872 β’ Published
This repository contains the pretrained weights for PRS-Med, a modular framework for position reasoning segmentation in medical imaging. PRS-Med combines a vision-language model (LLaVA-Med) with a lightweight image encoder (TinySAM) and a custom mask decoder to perform context-aware medical image segmentation.
Accepted at CVPRW 2026.
PRS-Med consists of three main components:
| Component | Description | Details |
|---|---|---|
| LLaVA-Med | Vision-language model for semantic reasoning | Mistral-7B backbone, fine-tuned with LoRA (r=16, alpha=16) |
| TinySAM | Lightweight image encoder | ViT-Tiny, extracts 256-dim features at 64x64 spatial resolution |
| Prompted Mask Decoder | Cross-attention fusion + upsampling | Fuses LLM embeddings with image features to produce 1024x1024 masks |
| Classification Head | 6-class medical modality classifier | Brain, Breast, Lung CT, Lung X-ray, Skin (ISIC), Other |
checkpoint/
βββ lora_adapter/ # LoRA adapter weights for LLaVA-Med
β βββ adapter_config.json
β βββ adapter_model.safetensors
β βββ tokenizer.json
β βββ tokenizer.model
β βββ tokenizer_config.json
β βββ special_tokens_map.json
βββ image_encoder.pth # TinySAM image encoder weights
βββ mask_decoder.pth # Prompted mask decoder weights
βββ cls.pth # 6-class classification head weights
git clone https://github.com/huyquoctrinh/PRS-Med.git
cd PRS-Med
pip install -r requirements.txt
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="huyquoctrinh/PRS-Med",
local_dir="./checkpoint"
)
Or using the CLI:
huggingface-cli download huyquoctrinh/PRS-Med --local-dir ./checkpoint
PRS-Med requires the LLaVA-Med base model:
huggingface-cli download microsoft/llava-med-v1.5-mistral-7b --local-dir ./weight/llava-med-v1.5-mistral-7b
from segment_model.model import build_llm_seg
# Build model
model, tokenizer, image_processor, config = build_llm_seg(
model_path="./weight/llava-med-v1.5-mistral-7b",
device="cuda"
)
# Load checkpoint
model.load_model("./checkpoint")
model = model.to("cuda")
Or run the inference script:
python infer_full.py \
--model_path ./weight/llava-med-v1.5-mistral-7b \
--checkpoint_path ./checkpoint \
--device cuda
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 train_ddp.py \
--model_path ./weight/llava-med-v1.5-mistral-7b \
--data_path /path/to/dataset/data \
--annotation_path /path/to/dataset/annotations \
--batch_size 4 \
--epochs 20 \
--save_dir ./training_results \
--grad_accum_steps 8 \
--grad_clip_norm 1.0
| Hyperparameter | Value |
|---|---|
| Base Model | LLaVA-Med v1.5 (Mistral-7B) |
| LoRA Rank | 16 |
| LoRA Alpha | 16 |
| LoRA Dropout | 0.05 |
| LoRA Targets | q_proj, k_proj, v_proj, o_proj |
| Image Encoder | TinySAM (ViT-Tiny) |
| Optimizer | AdamW (lr=1e-4, weight_decay=1e-5) |
| Scheduler | CosineAnnealing (T_max=20, eta_min=1e-6) |
| Precision | bfloat16 (mixed precision) |
| Loss | 0.5 * StructureLoss + 0.5 * LLM Loss + ClassificationLoss |
| Modality | Dataset |
|---|---|
| Brain MRI | Brain Tumor CT Scan |
| Breast Ultrasound | Breast Ultrasound |
| Lung CT | Lung CT |
| Lung X-ray | Lung X-ray |
| Skin Lesion | ISIC Skin Cancer |
| Polyp | Polyp Endoscopy |
@article{trinh2025prs,
title = {Prs-med: Position reasoning segmentation with vision-language model in medical imaging},
author = {Trinh, Quoc-Huy and Nguyen, Minh-Van and Zeng, Jung and Bagci, Ulas and Jha, Debesh},
journal = {arXiv preprint arXiv:2505.11872},
year = {2025}
}
This work is built upon LLaVA, Segment Anything, and TinySAM.
This project is licensed under the MIT License.