nielsr HF Staff commited on
Commit
18b8344
verified
1 Parent(s): d747d27

Add pipeline tag, library name and content from Github README

Browse files

This PR adds the `pipeline_tag` and `library_name` to the model card metadata, improving discoverability and clarity. It also adds content from the Github README to provide more information about the model, including the architecture, installation instructions, getting started guide, and model zoo.

Files changed (1) hide show
  1. README.md +158 -3
README.md CHANGED
@@ -1,3 +1,158 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-feature-extraction
4
+ library_name: transformers
5
+ ---
6
+
7
+ # MedM-VL: What Makes a Good Medical LVLM?
8
+
9
+ [![arXiv](https://img.shields.io/badge/Arxiv-2504.04323-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2504.04323) [![hf_space](https://img.shields.io/badge/馃-%20Open%20In%20HF-blue.svg)](https://huggingface.co/collections/shiym2000/medm-vl-67f739e50d344d712eb7b010) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](./LICENSE)
10
+
11
+ ![architecture](./assets/architecture.png)
12
+
13
+ MedM-VL is a **modular**, LLaVA-based codebase for medical LVLMs, supporting flexible customization of encoders, connectors, and LLMs.
14
+
15
+ MedM-VL focuses on **small-scale** medical LVLMs, designed for **direct deployment** in real-world medical scenarios or **efficient fine-tuning** on downstream tasks.
16
+
17
+ ## :newspaper: News
18
+
19
+ + **[2025.04.10]**: The model weights (v1.0) have been uploaded to Hugging Face.
20
+ + [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en)
21
+ + [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en)
22
+ + [shiym2000/MedM-CLIP-CT 路 Hugging Face](https://huggingface.co/shiym2000/MedM-CLIP-CT)
23
+ + **[2025.04.06]**: The technical report has been released on arXiv.
24
+ + [[2504.04323] MedM-VL: What Makes a Good Medical LVLM?](https://arxiv.org/abs/2504.04323)
25
+ + **[2024.12.19]**: The complete code has been released on GitHub.
26
+
27
+ ## :sparkles: Features
28
+
29
+ MedM-VL (v1.0: single image input, more details on Hugging Face)
30
+ + [shiym2000/MedM-VL-2D-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-2D-3B-en): Trained on **2D** medical images and **English** medical texts.
31
+ + [shiym2000/MedM-VL-CT-Chest-3B-en 路 Hugging Face](https://huggingface.co/shiym2000/MedM-VL-CT-Chest-3B-en): Trained on **3D** chest CT volumes and **English** medical texts.
32
+
33
+ ## :package: Installation
34
+
35
+ ``` bash
36
+ # 1. clone and navigate
37
+ git clone https://github.com/MSIIP/MedM-VL.git
38
+ cd MedM-VL
39
+
40
+ # 2. create a conda environment, activate it and install packages
41
+ conda create -n medm python=3.10
42
+ conda activate medm
43
+ pip install -r requirements.txt
44
+ pip install flash-attn --no-build-isolation
45
+ ```
46
+
47
+ ## :rocket: Getting Started
48
+
49
+ If you are confused about some parameters during usage, please refer to [Parameter Interpretation](docs/param_interpretation.md).
50
+
51
+ ### 1. Train a general medical LVLM from scratch
52
+
53
+ ``` bash
54
+ # For 2D medical LVLMs
55
+ # 1. pre-train (annotation format: docs/example_2d_pretrain.json)
56
+ bash scripts/train/MedM-VL-2D/pretrain_en.sh
57
+ # 2. fine-tune (annotation format: docs/example_2d_finetune.json)
58
+ bash scripts/train/MedM-VL-2D/finetune_en.sh
59
+
60
+ # For 3D medical LVLMs
61
+ # 1. pre-train (annotation format: docs/example_3d_pretrain.json)
62
+ bash scripts/train/MedM-VL-CT-Chest/pretrain_en.sh
63
+ # 2. fine-tune (annotation format: docs/example_3d_finetune.json)
64
+ bash scripts/train/MedM-VL-CT-Chest/finetune_en.sh
65
+
66
+ # In fact, there is no difference in the annotation file format between
67
+ # pre-training and fine-tuning. The former is from image-text pairs
68
+ # while the latter refers to instruction tuning data.
69
+ ```
70
+
71
+ ### 2. Fine-tune a specialized medical LVLM with pre-trained weights
72
+
73
+ ``` bash
74
+ # For 2D medical LVLMs
75
+ # 1. download weights from Hugging Face
76
+ pip install -U huggingface_hub
77
+ huggingface-cli download --resume-download shiym2000/MedM-VL-2D-3B-en --local-dir work_dirs/MedM-VL-2D-3B-en
78
+ # 2. fine-tune using LoRA (annotation format: docs/example_2d_finetune.json)
79
+ bash scripts/train/finetune_2d.sh
80
+
81
+ # For 3D medical LVLMs
82
+ # 1. download weights from Hugging Face
83
+ pip install -U huggingface_hub
84
+ huggingface-cli download --resume-download shiym2000/MedM-VL-CT-Chest-3B-en --local-dir work_dirs/MedM-VL-CT-Chest-3B-en
85
+ # 2. fine-tune using LoRA (annotation format: docs/example_3d_finetune.json)
86
+ bash scripts/train/finetune_3d.sh
87
+
88
+ # You can choose full or LoRA fine-tuning based on available GPU memory.
89
+ ```
90
+
91
+ ### 3. Inference
92
+
93
+ ``` bash
94
+ # For 2D medical LVLMs
95
+ # inference (annotation format: docs/example_2d_inference.json)
96
+ bash scripts/eval/inference_2d.sh
97
+
98
+ # For 3D medical LVLMs
99
+ # inference (annotation format: docs/example_3d_inference.json)
100
+ bash scripts/eval/inference_3d.sh
101
+
102
+ # Compared to `finetune.json``, `conversations` in `inference.json` lacks
103
+ # the final response, which will be generated by the model.
104
+ ```
105
+
106
+ ### 4. Demo
107
+
108
+ ``` bash
109
+ # Launch a Gradio demo locally.
110
+ bash scripts/playground.sh
111
+ ```
112
+
113
+ ## :robot: Model Zoo
114
+
115
+ <table>
116
+ <tr align="center">
117
+ <td><b>Encoder</b></td>
118
+ <td><b>Connector</b></td>
119
+ <td><b>LLM</b></td>
120
+ </tr>
121
+ <tr valign="top">
122
+ <td>
123
+ <li><a href="https://arxiv.org/abs/2103.00020"> CLIP (2021) </a></li>
124
+ <li><a href="https://arxiv.org/abs/2303.15343"> SigLIP (2023) </a></li>
125
+ <li><a href="https://arxiv.org/abs/2404.00578"> M3D-CLIP (2023) </a></li>
126
+ <li><a href="https://huggingface.co/collections/shiym2000/medm-clip-67f7afd8a3dbcff656466805"> MedM-CLIP <a></li>
127
+ </td>
128
+ <td>
129
+ <li> MLP </li>
130
+ <li> Spatial Pooling </li>
131
+ <li> Attention Pooling </li>
132
+ </td>
133
+ <td>
134
+ <li><a href="https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/"> Phi-2 (2023) </a></li>
135
+ <li><a href="https://arxiv.org/abs/2404.14219"> Phi-3 (2024) </a></li>
136
+ <li><a href="https://arxiv.org/abs/2412.15115"> Qwen2.5 (2024) </a></li>
137
+ <li><a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/"> Llama-3.2 (2024) </a></li>
138
+ </td>
139
+ </tr>
140
+ </table>
141
+
142
+ ## :book: Citation
143
+
144
+ ``` bibtex
145
+ @article{shi2025medm,
146
+ title={MedM-VL: What Makes a Good Medical LVLM?},
147
+ author={Shi, Yiming and Yang, Shaoshuai and Zhu, Xun and Wang, Haoyu and Li, Miao and Wu, Ji},
148
+ journal={arXiv preprint arXiv:2504.04323},
149
+ year={2025}
150
+ }
151
+ ```
152
+
153
+ ## :heart: Acknowledgements
154
+
155
+ We would like to express our gratitude to the following resources:
156
+ + [**TinyLLaVA_Factory**](https://github.com/TinyLLaVA/TinyLLaVA_Factory) - An open-source modular codebase for small-scale large multimodal models (LMMs).
157
+
158
+ Code: https://github.com/MSIIP/MedM-VL