Files changed (1) hide show
  1. README.md +181 -3
README.md CHANGED
@@ -1,3 +1,181 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - medical
7
+ - multimodal
8
+ - grounding
9
+ - report-generation
10
+ - radiology
11
+ - clinical-reasoning
12
+ - mri
13
+ - ct
14
+ - histopathology
15
+ - x-ray
16
+ - fundus
17
+ ---
18
+
19
+ # MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
20
+
21
+ [![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2602.06965)
22
+ [![Model](https://img.shields.io/badge/πŸ€—-MedMO--8B-blue)](https://huggingface.co/MBZUAI/MedMO-8B)
23
+ [![Model](https://img.shields.io/badge/πŸ€—-MedMO--4B-blue)](https://huggingface.co/MBZUAI/MedMO-4B)
24
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
25
+
26
+
27
+ <p align="center">
28
+ <img src="MedMO-logo.png" alt="MedMO Logo" width="300"/>
29
+ </p>
30
+
31
+
32
+ **MedMO** is a powerful open-source multimodal foundation model designed for comprehensive medical image understanding and grounding. Built on Qwen3-VL architecture and trained on 26M+ diverse medical samples across 45 datasets, MedMO achieves state-of-the-art performance across multiple medical imaging tasks.
33
+
34
+ ## 🎯 Capabilities
35
+
36
+ MedMO excels at a comprehensive range of medical imaging tasks:
37
+
38
+ - **Visual Question Answering (VQA)**: Answer complex questions about medical images across radiology, pathology, ophthalmology, and dermatology
39
+ - **Text-Based Medical QA**: Clinical reasoning and medical knowledge question answering
40
+ - **Radiology Report Generation**: Generate detailed, clinically accurate radiology reports from medical images
41
+ - **Disease Localization with Bounding Boxes**: Precise spatial detection and localization of pathological findings
42
+ - **Anatomical Grounding**: Spatial localization and grounding of anatomical structures
43
+ - **Clinical Reasoning**: Step-by-step diagnostic reasoning and clinical decision support
44
+ - **Diagnostic Classification**: Multi-class disease classification across diverse imaging modalities
45
+ - **Spatial Object Detection**: Fine-grained detection in microscopy, pathology slides, and cellular imaging
46
+ - **Medical Report Summarization**: Extract and summarize key clinical findings from complex medical reports
47
+
48
+ ### Supported Modalities
49
+ - Radiology (X-ray, CT, MRI, Ultrasound)
50
+ - Pathology & Microscopy
51
+ - Ophthalmology (Fundus, OCT)
52
+ - Dermatology
53
+ - Nuclear Medicine (PET, SPECT)
54
+
55
+ ## πŸš€ Quick Start
56
+
57
+ ### Installation
58
+
59
+ ```bash
60
+ pip install transformers torch qwen-vl-utils
61
+ ```
62
+
63
+ ### Basic Usage
64
+
65
+ ```python
66
+ from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
67
+ from qwen_vl_utils import process_vision_info
68
+ import torch
69
+
70
+ # Load model
71
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
72
+ "MBZUAI/MedMO-8B",
73
+ torch_dtype=torch.bfloat16,
74
+ attn_implementation="flash_attention_2",
75
+ device_map="auto",
76
+ )
77
+
78
+ processor = AutoProcessor.from_pretrained("MBZUAI/MedMO-8B")
79
+
80
+ # Prepare your input
81
+ messages = [
82
+ {
83
+ "role": "user",
84
+ "content": [
85
+ {
86
+ "type": "image",
87
+ "image": "path/to/medical/image.png",
88
+ },
89
+ {"type": "text", "text": "What abnormalities are present in this chest X-ray?"},
90
+ ],
91
+ }
92
+ ]
93
+
94
+ # Process and generate
95
+ text = processor.apply_chat_template(
96
+ messages, tokenize=False, add_generation_prompt=True
97
+ )
98
+ image_inputs, video_inputs = process_vision_info(messages)
99
+ inputs = processor(
100
+ text=[text],
101
+ images=image_inputs,
102
+ videos=video_inputs,
103
+ padding=True,
104
+ return_tensors="pt",
105
+ )
106
+ inputs = inputs.to(model.device)
107
+
108
+ # Generate output
109
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
110
+ generated_ids_trimmed = [
111
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
112
+ ]
113
+ output_text = processor.batch_decode(
114
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
115
+ )
116
+ print(output_text[0])
117
+ ```
118
+
119
+ ### Example: Disease Localization with Bounding Boxes
120
+
121
+ ```python
122
+ messages = [
123
+ {
124
+ "role": "user",
125
+ "content": [
126
+ {"type": "image", "image": "chest_xray.png"},
127
+ {"type": "text", "text": "Detect and localize all abnormalities in this image."},
128
+ ],
129
+ }
130
+ ]
131
+ # Output: "Fractures <box>[[156, 516, 231, 607], [240, 529, 296, 581]]</box>"
132
+ ```
133
+
134
+ ### Example: Report Generation
135
+
136
+ ```python
137
+ messages = [
138
+ {
139
+ "role": "user",
140
+ "content": [
141
+ {"type": "image", "image": "ct_scan.png"},
142
+ {"type": "text", "text": "Generate a detailed radiology report for this CT scan."},
143
+ ],
144
+ }
145
+ ]
146
+ # MedMO generates comprehensive clinical reports with findings and impressions
147
+ ```
148
+
149
+ ## πŸ—οΈ Model Architecture
150
+
151
+ MedMO is built on **Qwen3-VL-8B-Instruct** and trained through a 4-stage progressive pipeline:
152
+
153
+ 1. **Stage 1 - General Medical SFT**: Large-scale training on 18.5M image-text pairs for foundational medical understanding
154
+ 2. **Stage 2 - High-Resolution & Grounding**: Training on 3M curated samples at 1280Γ—1280 resolution for spatial localization
155
+ 3. **Stage 3 - Instruction Tuning**: Fine-tuning on 4.3M instruction-response pairs for task-specific alignment
156
+ 4. **Stage 4 - Reinforcement Learning**: GRPO training with verifiable rewards (label accuracy, bbox IoU) for enhanced grounding
157
+
158
+ **Total Training Data**: 26M+ samples from 45 medical datasets spanning diverse modalities and anatomical systems.
159
+
160
+
161
+
162
+ For detailed benchmark results, please refer to our paper.
163
+
164
+ ## πŸ“„ Citation
165
+
166
+ If you use MedMO in your research, please cite our paper:
167
+
168
+ ```bibtex
169
+ @article{deria2026medmo,
170
+ title={MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images},
171
+ author={Deria, Ankan and Kumar, Komal and Dukre, Adinath Madhavrao and Segal, Eran and Khan, Salman and Razzak, Imran},
172
+ journal={arXiv preprint arXiv:2602.06965},
173
+ year={2026}
174
+ }
175
+ ```
176
+
177
+
178
+ ## πŸ“œ License
179
+
180
+ This project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for details.
181
+