--- license: apache-2.0 pipeline_tag: image-feature-extraction library_name: transformers --- # MedM-VL: What Makes a Good Medical LVLM? [![arXiv](https://img.shields.io/badge/Arxiv-2504.04323-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2504.04323) [![hf_space](https://img.shields.io/badge/馃-%20Open%20In%20HF-blue.svg)](https://huggingface.co/collections/shiym2000/medm-vl-67f739e50d344d712eb7b010) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](./LICENSE) ![architecture](./assets/architecture.png) MedM-VL is a **modular**, LLaVA-based codebase for medical LVLMs, supporting flexible customization of encoders, connectors, and LLMs. MedM-VL focuses on **small-scale** medical LVLMs, designed for **direct deployment** in real-world medical scenarios or **efficient fine-tuning** on downstream tasks. ## :newspaper: News + **[2025.04.10]**: The model weights (v1.0) have been uploaded to Hugging Face. + [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en) + [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en) + [shiym2000/MedM-CLIP-CT 路 Hugging Face](https://huggingface.co/shiym2000/MedM-CLIP-CT) + **[2025.04.06]**: The technical report has been released on arXiv. + [[2504.04323] MedM-VL: What Makes a Good Medical LVLM?](https://arxiv.org/abs/2504.04323) + **[2024.12.19]**: The complete code has been released on GitHub. ## :sparkles: Features MedM-VL (v1.0: single image input, more details on Hugging Face) + [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en): Trained on **2D** medical images and **English** medical texts. + [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en): Trained on **3D** chest CT volumes and **English** medical texts. ## :package: Installation ``` bash # 1. clone and navigate git clone https://github.com/MSIIP/MedM-VL.git cd MedM-VL # 2. create a conda environment, activate it and install packages conda create -n medm python=3.10 conda activate medm pip install -r requirements.txt pip install flash-attn --no-build-isolation ``` ## :rocket: Getting Started If you are confused about some parameters during usage, please refer to [Parameter Interpretation](docs/param_interpretation.md). ### 1. Train a general medical LVLM from scratch ``` bash # For 2D medical LVLMs # 1. pre-train (annotation format: docs/example_2d_pretrain.json) bash scripts/train/MedM-VL-2D/pretrain_en.sh # 2. fine-tune (annotation format: docs/example_2d_finetune.json) bash scripts/train/MedM-VL-2D/finetune_en.sh # For 3D medical LVLMs # 1. pre-train (annotation format: docs/example_3d_pretrain.json) bash scripts/train/MedM-VL-CT-Chest/pretrain_en.sh # 2. fine-tune (annotation format: docs/example_3d_finetune.json) bash scripts/train/MedM-VL-CT-Chest/finetune_en.sh # In fact, there is no difference in the annotation file format between # pre-training and fine-tuning. The former is from image-text pairs # while the latter refers to instruction tuning data. ``` ### 2. Fine-tune a specialized medical LVLM with pre-trained weights ``` bash # For 2D medical LVLMs # 1. download weights from Hugging Face pip install -U huggingface_hub huggingface-cli download --resume-download shiym2000/MedM-VL-2D-3B-en --local-dir work_dirs/MedM-VL-2D-3B-en # 2. fine-tune using LoRA (annotation format: docs/example_2d_finetune.json) bash scripts/train/finetune_2d.sh # For 3D medical LVLMs # 1. download weights from Hugging Face pip install -U huggingface_hub huggingface-cli download --resume-download shiym2000/MedM-VL-CT-Chest-3B-en --local-dir work_dirs/MedM-VL-CT-Chest-3B-en # 2. fine-tune using LoRA (annotation format: docs/example_3d_finetune.json) bash scripts/train/finetune_3d.sh # You can choose full or LoRA fine-tuning based on available GPU memory. ``` ### 3. Inference ``` bash # For 2D medical LVLMs # inference (annotation format: docs/example_2d_inference.json) bash scripts/eval/inference_2d.sh # For 3D medical LVLMs # inference (annotation format: docs/example_3d_inference.json) bash scripts/eval/inference_3d.sh # Compared to `finetune.json``, `conversations` in `inference.json` lacks # the final response, which will be generated by the model. ``` ### 4. Demo ``` bash # Launch a Gradio demo locally. bash scripts/playground.sh ``` ## :robot: Model Zoo
Encoder Connector LLM
  • CLIP (2021)
  • SigLIP (2023)
  • M3D-CLIP (2023)
  • MedM-CLIP
  • MLP
  • Spatial Pooling
  • Attention Pooling
  • Phi-2 (2023)
  • Phi-3 (2024)
  • Qwen2.5 (2024)
  • Llama-3.2 (2024)
  • ## :book: Citation ``` bibtex @article{shi2025medm, title={MedM-VL: What Makes a Good Medical LVLM?}, author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji}, journal={arXiv preprint arXiv:2504.04323}, year={2025} } ``` ## :heart: Acknowledgements We would like to express our gratitude to the following resources: + [**TinyLLaVA_Factory**](https://github.com/TinyLLaVA/TinyLLaVA_Factory) - An open-source modular codebase for small-scale large multimodal models (LMMs). Code: https://github.com/MSIIP/MedM-VL