Add pipeline tag, library name and content from Github README

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +158 -3
README.md CHANGED
@@ -1,3 +1,158 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-feature-extraction
4
+ library_name: transformers
5
+ ---
6
+
7
+ # MedM-VL: What Makes a Good Medical LVLM?
8
+
9
+ [![arXiv](https://img.shields.io/badge/Arxiv-2504.04323-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2504.04323) [![hf_space](https://img.shields.io/badge/🤗-%20Open%20In%20HF-blue.svg)](https://huggingface.co/collections/shiym2000/medm-vl-67f739e50d344d712eb7b010) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](./LICENSE)
10
+
11
+ ![architecture](./assets/architecture.png)
12
+
13
+ MedM-VL is a **modular**, LLaVA-based codebase for medical LVLMs, supporting flexible customization of encoders, connectors, and LLMs.
14
+
15
+ MedM-VL focuses on **small-scale** medical LVLMs, designed for **direct deployment** in real-world medical scenarios or **efficient fine-tuning** on downstream tasks.
16
+
17
+ ## :newspaper: News
18
+
19
+ + **[2025.04.10]**: The model weights (v1.0) have been uploaded to Hugging Face.
20
+ + [shiym2000/MedM-VL-2D-3B-en · Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en)
21
+ + [shiym2000/MedM-VL-CT-Chest-3B-en · Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en)
22
+ + [shiym2000/MedM-CLIP-CT · Hugging Face](https://huggingface.co/shiym2000/MedM-CLIP-CT)
23
+ + **[2025.04.06]**: The technical report has been released on arXiv.
24
+ + [[2504.04323] MedM-VL: What Makes a Good Medical LVLM?](https://arxiv.org/abs/2504.04323)
25
+ + **[2024.12.19]**: The complete code has been released on GitHub.
26
+
27
+ ## :sparkles: Features
28
+
29
+ MedM-VL (v1.0: single image input, more details on Hugging Face)
30
+ + [shiym2000/MedM-VL-2D-3B-en · Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en): Trained on **2D** medical images and **English** medical texts.
31
+ + [shiym2000/MedM-VL-CT-Chest-3B-en · Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en): Trained on **3D** chest CT volumes and **English** medical texts.
32
+
33
+ ## :package: Installation
34
+
35
+ ``` bash
36
+ # 1. clone and navigate
37
+ git clone https://github.com/MSIIP/MedM-VL.git
38
+ cd MedM-VL
39
+
40
+ # 2. create a conda environment, activate it and install packages
41
+ conda create -n medm python=3.10
42
+ conda activate medm
43
+ pip install -r requirements.txt
44
+ pip install flash-attn --no-build-isolation
45
+ ```
46
+
47
+ ## :rocket: Getting Started
48
+
49
+ If you are confused about some parameters during usage, please refer to [Parameter Interpretation](docs/param_interpretation.md).
50
+
51
+ ### 1. Train a general medical LVLM from scratch
52
+
53
+ ``` bash
54
+ # For 2D medical LVLMs
55
+ # 1. pre-train (annotation format: docs/example_2d_pretrain.json)
56
+ bash scripts/train/MedM-VL-2D/pretrain_en.sh
57
+ # 2. fine-tune (annotation format: docs/example_2d_finetune.json)
58
+ bash scripts/train/MedM-VL-2D/finetune_en.sh
59
+
60
+ # For 3D medical LVLMs
61
+ # 1. pre-train (annotation format: docs/example_3d_pretrain.json)
62
+ bash scripts/train/MedM-VL-CT-Chest/pretrain_en.sh
63
+ # 2. fine-tune (annotation format: docs/example_3d_finetune.json)
64
+ bash scripts/train/MedM-VL-CT-Chest/finetune_en.sh
65
+
66
+ # In fact, there is no difference in the annotation file format between
67
+ # pre-training and fine-tuning. The former is from image-text pairs
68
+ # while the latter refers to instruction tuning data.
69
+ ```
70
+
71
+ ### 2. Fine-tune a specialized medical LVLM with pre-trained weights
72
+
73
+ ``` bash
74
+ # For 2D medical LVLMs
75
+ # 1. download weights from Hugging Face
76
+ pip install -U huggingface_hub
77
+ huggingface-cli download --resume-download shiym2000/MedM-VL-2D-3B-en --local-dir work_dirs/MedM-VL-2D-3B-en
78
+ # 2. fine-tune using LoRA (annotation format: docs/example_2d_finetune.json)
79
+ bash scripts/train/finetune_2d.sh
80
+
81
+ # For 3D medical LVLMs
82
+ # 1. download weights from Hugging Face
83
+ pip install -U huggingface_hub
84
+ huggingface-cli download --resume-download shiym2000/MedM-VL-CT-Chest-3B-en --local-dir work_dirs/MedM-VL-CT-Chest-3B-en
85
+ # 2. fine-tune using LoRA (annotation format: docs/example_3d_finetune.json)
86
+ bash scripts/train/finetune_3d.sh
87
+
88
+ # You can choose full or LoRA fine-tuning based on available GPU memory.
89
+ ```
90
+
91
+ ### 3. Inference
92
+
93
+ ``` bash
94
+ # For 2D medical LVLMs
95
+ # inference (annotation format: docs/example_2d_inference.json)
96
+ bash scripts/eval/inference_2d.sh
97
+
98
+ # For 3D medical LVLMs
99
+ # inference (annotation format: docs/example_3d_inference.json)
100
+ bash scripts/eval/inference_3d.sh
101
+
102
+ # Compared to `finetune.json``, `conversations` in `inference.json` lacks
103
+ # the final response, which will be generated by the model.
104
+ ```
105
+
106
+ ### 4. Demo
107
+
108
+ ``` bash
109
+ # Launch a Gradio demo locally.
110
+ bash scripts/playground.sh
111
+ ```
112
+
113
+ ## :robot: Model Zoo
114
+
115
+ <table>
116
+ <tr align="center">
117
+ <td><b>Encoder</b></td>
118
+ <td><b>Connector</b></td>
119
+ <td><b>LLM</b></td>
120
+ </tr>
121
+ <tr valign="top">
122
+ <td>
123
+ <li><a href="https://arxiv.org/abs/2103.00020"> CLIP (2021) </a></li>
124
+ <li><a href="https://arxiv.org/abs/2303.15343"> SigLIP (2023) </a></li>
125
+ <li><a href="https://arxiv.org/abs/2404.00578"> M3D-CLIP (2023) </a></li>
126
+ <li><a href="https://huggingface.co/collections/shiym2000/medm-clip-67f7afd8a3dbcff656466805"> MedM-CLIP <a></li>
127
+ </td>
128
+ <td>
129
+ <li> MLP </li>
130
+ <li> Spatial Pooling </li>
131
+ <li> Attention Pooling </li>
132
+ </td>
133
+ <td>
134
+ <li><a href="https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/"> Phi-2 (2023) </a></li>
135
+ <li><a href="https://arxiv.org/abs/2404.14219"> Phi-3 (2024) </a></li>
136
+ <li><a href="https://arxiv.org/abs/2412.15115"> Qwen2.5 (2024) </a></li>
137
+ <li><a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"> Llama-3.2 (2024) </a></li>
138
+ </td>
139
+ </tr>
140
+ </table>
141
+
142
+ ## :book: Citation
143
+
144
+ ``` bibtex
145
+ @article{shi2025medm,
146
+ title={MedM-VL: What Makes a Good Medical LVLM?},
147
+ author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
148
+ journal={arXiv preprint arXiv:2504.04323},
149
+ year={2025}
150
+ }
151
+ ```
152
+
153
+ ## :heart: Acknowledgements
154
+
155
+ We would like to express our gratitude to the following resources:
156
+ + [**TinyLLaVA_Factory**](https://github.com/TinyLLaVA/TinyLLaVA_Factory) - An open-source modular codebase for small-scale large multimodal models (LMMs).
157
+
158
+ Code: https://github.com/MSIIP/MedM-VL