File size: 6,104 Bytes
18b8344
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
license: apache-2.0
pipeline_tag: image-feature-extraction
library_name: transformers
---

# MedM-VL: What Makes a Good Medical LVLM?

[![arXiv](https://img.shields.io/badge/Arxiv-2504.04323-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2504.04323) [![hf_space](https://img.shields.io/badge/馃-%20Open%20In%20HF-blue.svg)](https://huggingface.co/collections/shiym2000/medm-vl-67f739e50d344d712eb7b010) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](./LICENSE)

![architecture](./assets/architecture.png)

MedM-VL is a **modular**, LLaVA-based codebase for medical LVLMs, supporting flexible customization of encoders, connectors, and LLMs.

MedM-VL focuses on **small-scale** medical LVLMs, designed for **direct deployment** in real-world medical scenarios or **efficient fine-tuning** on downstream tasks.

## :newspaper: News

+ **[2025.04.10]**: The model weights (v1.0) have been uploaded to Hugging Face.
  + [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en)
  + [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en)
  + [shiym2000/MedM-CLIP-CT 路 Hugging Face](https://huggingface.co/shiym2000/MedM-CLIP-CT)
+ **[2025.04.06]**: The technical report has been released on arXiv.
  + [[2504.04323] MedM-VL: What Makes a Good Medical LVLM?](https://arxiv.org/abs/2504.04323)
+ **[2024.12.19]**: The complete code has been released on GitHub.

## :sparkles: Features

MedM-VL (v1.0: single image input, more details on Hugging Face)
+ [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en): Trained on **2D** medical images and **English** medical texts.
+ [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en): Trained on **3D** chest CT volumes and **English** medical texts.

## :package: Installation

``` bash
# 1. clone and navigate
git clone https://github.com/MSIIP/MedM-VL.git
cd MedM-VL

# 2. create a conda environment, activate it and install packages
conda create -n medm python=3.10
conda activate medm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```

## :rocket: Getting Started

If you are confused about some parameters during usage, please refer to [Parameter Interpretation](docs/param_interpretation.md).

### 1. Train a general medical LVLM from scratch

``` bash
# For 2D medical LVLMs
# 1. pre-train (annotation format: docs/example_2d_pretrain.json)
bash scripts/train/MedM-VL-2D/pretrain_en.sh
# 2. fine-tune (annotation format: docs/example_2d_finetune.json)
bash scripts/train/MedM-VL-2D/finetune_en.sh

# For 3D medical LVLMs
# 1. pre-train (annotation format: docs/example_3d_pretrain.json)
bash scripts/train/MedM-VL-CT-Chest/pretrain_en.sh
# 2. fine-tune (annotation format: docs/example_3d_finetune.json)
bash scripts/train/MedM-VL-CT-Chest/finetune_en.sh

# In fact, there is no difference in the annotation file format between
# pre-training and fine-tuning. The former is from image-text pairs
# while the latter refers to instruction tuning data.
```

### 2. Fine-tune a specialized medical LVLM with pre-trained weights

``` bash
# For 2D medical LVLMs
# 1. download weights from Hugging Face
pip install -U huggingface_hub
huggingface-cli download --resume-download shiym2000/MedM-VL-2D-3B-en --local-dir work_dirs/MedM-VL-2D-3B-en
# 2. fine-tune using LoRA (annotation format: docs/example_2d_finetune.json)
bash scripts/train/finetune_2d.sh

# For 3D medical LVLMs
# 1. download weights from Hugging Face
pip install -U huggingface_hub
huggingface-cli download --resume-download shiym2000/MedM-VL-CT-Chest-3B-en --local-dir work_dirs/MedM-VL-CT-Chest-3B-en
# 2. fine-tune using LoRA (annotation format: docs/example_3d_finetune.json)
bash scripts/train/finetune_3d.sh

# You can choose full or LoRA fine-tuning based on available GPU memory.
```

### 3. Inference

``` bash
# For 2D medical LVLMs
# inference (annotation format: docs/example_2d_inference.json)
bash scripts/eval/inference_2d.sh

# For 3D medical LVLMs
# inference (annotation format: docs/example_3d_inference.json)
bash scripts/eval/inference_3d.sh

# Compared to `finetune.json``, `conversations` in `inference.json` lacks
# the final response, which will be generated by the model.
```

### 4. Demo

``` bash
# Launch a Gradio demo locally.
bash scripts/playground.sh
```

## :robot: Model Zoo

<table>
  <tr align="center">
    <td><b>Encoder</b></td>
    <td><b>Connector</b></td>
    <td><b>LLM</b></td>
  </tr>
  <tr valign="top">
    <td>
      <li><a href="https://arxiv.org/abs/2103.00020"> CLIP (2021) </a></li>
      <li><a href="https://arxiv.org/abs/2303.15343"> SigLIP (2023) </a></li>
      <li><a href="https://arxiv.org/abs/2404.00578"> M3D-CLIP (2023) </a></li>
      <li><a href="https://huggingface.co/collections/shiym2000/medm-clip-67f7afd8a3dbcff656466805"> MedM-CLIP <a></li>
    </td>
    <td>
      <li> MLP </li>
      <li> Spatial Pooling </li>
      <li> Attention Pooling </li>
    </td>
    <td>
      <li><a href="https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/"> Phi-2 (2023) </a></li>
      <li><a href="https://arxiv.org/abs/2404.14219"> Phi-3 (2024) </a></li>
      <li><a href="https://arxiv.org/abs/2412.15115"> Qwen2.5 (2024) </a></li>
      <li><a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"> Llama-3.2 (2024) </a></li>
    </td>
  </tr>
</table>

## :book: Citation

``` bibtex
@article{shi2025medm,
  title={MedM-VL: What Makes a Good Medical LVLM?},
  author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
  journal={arXiv preprint arXiv:2504.04323},
  year={2025}
}
```

## :heart: Acknowledgements

We would like to express our gratitude to the following resources:
+ [**TinyLLaVA_Factory**](https://github.com/TinyLLaVA/TinyLLaVA_Factory) - An open-source modular codebase for small-scale large multimodal models (LMMs).

Code: https://github.com/MSIIP/MedM-VL