Add pipeline tag, library name and content from Github README
Browse filesThis PR adds the `pipeline_tag` and `library_name` to the model card metadata, improving discoverability and clarity. It also adds content from the Github README to provide more information about the model, including the architecture, installation instructions, getting started guide, and model zoo.
README.md
CHANGED
|
@@ -1,3 +1,158 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-feature-extraction
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# MedM-VL: What Makes a Good Medical LVLM?
|
| 8 |
+
|
| 9 |
+
[](https://arxiv.org/abs/2504.04323) [](https://huggingface.co/collections/shiym2000/medm-vl-67f739e50d344d712eb7b010) [](./LICENSE)
|
| 10 |
+
|
| 11 |
+

|
| 12 |
+
|
| 13 |
+
MedM-VL is a **modular**, LLaVA-based codebase for medical LVLMs, supporting flexible customization of encoders, connectors, and LLMs.
|
| 14 |
+
|
| 15 |
+
MedM-VL focuses on **small-scale** medical LVLMs, designed for **direct deployment** in real-world medical scenarios or **efficient fine-tuning** on downstream tasks.
|
| 16 |
+
|
| 17 |
+
## :newspaper: News
|
| 18 |
+
|
| 19 |
+
+ **[2025.04.10]**: The model weights (v1.0) have been uploaded to Hugging Face.
|
| 20 |
+
+ [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en)
|
| 21 |
+
+ [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en)
|
| 22 |
+
+ [shiym2000/MedM-CLIP-CT 路 Hugging Face](https://huggingface.co/shiym2000/MedM-CLIP-CT)
|
| 23 |
+
+ **[2025.04.06]**: The technical report has been released on arXiv.
|
| 24 |
+
+ [[2504.04323] MedM-VL: What Makes a Good Medical LVLM?](https://arxiv.org/abs/2504.04323)
|
| 25 |
+
+ **[2024.12.19]**: The complete code has been released on GitHub.
|
| 26 |
+
|
| 27 |
+
## :sparkles: Features
|
| 28 |
+
|
| 29 |
+
MedM-VL (v1.0: single image input, more details on Hugging Face)
|
| 30 |
+
+ [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en): Trained on **2D** medical images and **English** medical texts.
|
| 31 |
+
+ [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en): Trained on **3D** chest CT volumes and **English** medical texts.
|
| 32 |
+
|
| 33 |
+
## :package: Installation
|
| 34 |
+
|
| 35 |
+
``` bash
|
| 36 |
+
# 1. clone and navigate
|
| 37 |
+
git clone https://github.com/MSIIP/MedM-VL.git
|
| 38 |
+
cd MedM-VL
|
| 39 |
+
|
| 40 |
+
# 2. create a conda environment, activate it and install packages
|
| 41 |
+
conda create -n medm python=3.10
|
| 42 |
+
conda activate medm
|
| 43 |
+
pip install -r requirements.txt
|
| 44 |
+
pip install flash-attn --no-build-isolation
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## :rocket: Getting Started
|
| 48 |
+
|
| 49 |
+
If you are confused about some parameters during usage, please refer to [Parameter Interpretation](docs/param_interpretation.md).
|
| 50 |
+
|
| 51 |
+
### 1. Train a general medical LVLM from scratch
|
| 52 |
+
|
| 53 |
+
``` bash
|
| 54 |
+
# For 2D medical LVLMs
|
| 55 |
+
# 1. pre-train (annotation format: docs/example_2d_pretrain.json)
|
| 56 |
+
bash scripts/train/MedM-VL-2D/pretrain_en.sh
|
| 57 |
+
# 2. fine-tune (annotation format: docs/example_2d_finetune.json)
|
| 58 |
+
bash scripts/train/MedM-VL-2D/finetune_en.sh
|
| 59 |
+
|
| 60 |
+
# For 3D medical LVLMs
|
| 61 |
+
# 1. pre-train (annotation format: docs/example_3d_pretrain.json)
|
| 62 |
+
bash scripts/train/MedM-VL-CT-Chest/pretrain_en.sh
|
| 63 |
+
# 2. fine-tune (annotation format: docs/example_3d_finetune.json)
|
| 64 |
+
bash scripts/train/MedM-VL-CT-Chest/finetune_en.sh
|
| 65 |
+
|
| 66 |
+
# In fact, there is no difference in the annotation file format between
|
| 67 |
+
# pre-training and fine-tuning. The former is from image-text pairs
|
| 68 |
+
# while the latter refers to instruction tuning data.
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
### 2. Fine-tune a specialized medical LVLM with pre-trained weights
|
| 72 |
+
|
| 73 |
+
``` bash
|
| 74 |
+
# For 2D medical LVLMs
|
| 75 |
+
# 1. download weights from Hugging Face
|
| 76 |
+
pip install -U huggingface_hub
|
| 77 |
+
huggingface-cli download --resume-download shiym2000/MedM-VL-2D-3B-en --local-dir work_dirs/MedM-VL-2D-3B-en
|
| 78 |
+
# 2. fine-tune using LoRA (annotation format: docs/example_2d_finetune.json)
|
| 79 |
+
bash scripts/train/finetune_2d.sh
|
| 80 |
+
|
| 81 |
+
# For 3D medical LVLMs
|
| 82 |
+
# 1. download weights from Hugging Face
|
| 83 |
+
pip install -U huggingface_hub
|
| 84 |
+
huggingface-cli download --resume-download shiym2000/MedM-VL-CT-Chest-3B-en --local-dir work_dirs/MedM-VL-CT-Chest-3B-en
|
| 85 |
+
# 2. fine-tune using LoRA (annotation format: docs/example_3d_finetune.json)
|
| 86 |
+
bash scripts/train/finetune_3d.sh
|
| 87 |
+
|
| 88 |
+
# You can choose full or LoRA fine-tuning based on available GPU memory.
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### 3. Inference
|
| 92 |
+
|
| 93 |
+
``` bash
|
| 94 |
+
# For 2D medical LVLMs
|
| 95 |
+
# inference (annotation format: docs/example_2d_inference.json)
|
| 96 |
+
bash scripts/eval/inference_2d.sh
|
| 97 |
+
|
| 98 |
+
# For 3D medical LVLMs
|
| 99 |
+
# inference (annotation format: docs/example_3d_inference.json)
|
| 100 |
+
bash scripts/eval/inference_3d.sh
|
| 101 |
+
|
| 102 |
+
# Compared to `finetune.json``, `conversations` in `inference.json` lacks
|
| 103 |
+
# the final response, which will be generated by the model.
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### 4. Demo
|
| 107 |
+
|
| 108 |
+
``` bash
|
| 109 |
+
# Launch a Gradio demo locally.
|
| 110 |
+
bash scripts/playground.sh
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
## :robot: Model Zoo
|
| 114 |
+
|
| 115 |
+
<table>
|
| 116 |
+
<tr align="center">
|
| 117 |
+
<td><b>Encoder</b></td>
|
| 118 |
+
<td><b>Connector</b></td>
|
| 119 |
+
<td><b>LLM</b></td>
|
| 120 |
+
</tr>
|
| 121 |
+
<tr valign="top">
|
| 122 |
+
<td>
|
| 123 |
+
<li><a href="https://arxiv.org/abs/2103.00020"> CLIP (2021) </a></li>
|
| 124 |
+
<li><a href="https://arxiv.org/abs/2303.15343"> SigLIP (2023) </a></li>
|
| 125 |
+
<li><a href="https://arxiv.org/abs/2404.00578"> M3D-CLIP (2023) </a></li>
|
| 126 |
+
<li><a href="https://huggingface.co/collections/shiym2000/medm-clip-67f7afd8a3dbcff656466805"> MedM-CLIP <a></li>
|
| 127 |
+
</td>
|
| 128 |
+
<td>
|
| 129 |
+
<li> MLP </li>
|
| 130 |
+
<li> Spatial Pooling </li>
|
| 131 |
+
<li> Attention Pooling </li>
|
| 132 |
+
</td>
|
| 133 |
+
<td>
|
| 134 |
+
<li><a href="https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/"> Phi-2 (2023) </a></li>
|
| 135 |
+
<li><a href="https://arxiv.org/abs/2404.14219"> Phi-3 (2024) </a></li>
|
| 136 |
+
<li><a href="https://arxiv.org/abs/2412.15115"> Qwen2.5 (2024) </a></li>
|
| 137 |
+
<li><a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"> Llama-3.2 (2024) </a></li>
|
| 138 |
+
</td>
|
| 139 |
+
</tr>
|
| 140 |
+
</table>
|
| 141 |
+
|
| 142 |
+
## :book: Citation
|
| 143 |
+
|
| 144 |
+
``` bibtex
|
| 145 |
+
@article{shi2025medm,
|
| 146 |
+
title={MedM-VL: What Makes a Good Medical LVLM?},
|
| 147 |
+
author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
|
| 148 |
+
journal={arXiv preprint arXiv:2504.04323},
|
| 149 |
+
year={2025}
|
| 150 |
+
}
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
## :heart: Acknowledgements
|
| 154 |
+
|
| 155 |
+
We would like to express our gratitude to the following resources:
|
| 156 |
+
+ [**TinyLLaVA_Factory**](https://github.com/TinyLLaVA/TinyLLaVA_Factory) - An open-source modular codebase for small-scale large multimodal models (LMMs).
|
| 157 |
+
|
| 158 |
+
Code: https://github.com/MSIIP/MedM-VL
|