File size: 6,062 Bytes
4e02ca4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de3860a
8f97581
4e02ca4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4da65ec
 
 
 
 
4e02ca4
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - medical
  - multimodal
  - grounding
  - report-generation
  - radiology
  - clinical-reasoning
  - mri
  - ct
  - histopathology
  - x-ray
  - fundus
---

# MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

[![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2602.06965)
[![Model](https://img.shields.io/badge/πŸ€—-MedMO--8B--Next-blue)](https://huggingface.co/MBZUAI/MedMO-8B-Next)
[![Model](https://img.shields.io/badge/πŸ€—-MedMO--8B-blue)](https://huggingface.co/MBZUAI/MedMO-8B)
[![Model](https://img.shields.io/badge/πŸ€—-MedMO--4B-blue)](https://huggingface.co/MBZUAI/MedMO-4B)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)


<p align="center">
  <img src="MedMO-logo.png" alt="MedMO Logo" width="300"/>
</p>


**MedMO** is a powerful open-source multimodal foundation model designed for comprehensive medical image understanding and grounding. Built on Qwen3-VL architecture and trained on 26M+ diverse medical samples across 45 datasets, MedMO achieves state-of-the-art performance across multiple medical imaging tasks.

## 🎯 Capabilities

MedMO excels at a comprehensive range of medical imaging tasks:

- **Visual Question Answering (VQA)**: Answer complex questions about medical images across radiology, pathology, ophthalmology, and dermatology
- **Text-Based Medical QA**: Clinical reasoning and medical knowledge question answering
- **Radiology Report Generation**: Generate detailed, clinically accurate radiology reports from medical images
- **Disease Localization with Bounding Boxes**: Precise spatial detection and localization of pathological findings
- **Anatomical Grounding**: Spatial localization and grounding of anatomical structures
- **Clinical Reasoning**: Step-by-step diagnostic reasoning and clinical decision support
- **Diagnostic Classification**: Multi-class disease classification across diverse imaging modalities
- **Spatial Object Detection**: Fine-grained detection in microscopy, pathology slides, and cellular imaging
- **Medical Report Summarization**: Extract and summarize key clinical findings from complex medical reports

### Supported Modalities
- Radiology (X-ray, CT, MRI, Ultrasound)
- Pathology & Microscopy
- Ophthalmology (Fundus, OCT)
- Dermatology
- Nuclear Medicine (PET, SPECT)

## πŸš€ Quick Start

### Installation

```bash
pip install transformers torch qwen-vl-utils
```

### Basic Usage

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "MBZUAI/MedMO-4B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("MBZUAI/MedMO-4B")

# Prepare your input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/medical/image.png",
            },
            {"type": "text", "text": "What abnormalities are present in this chest X-ray?"},
        ],
    }
]

# Process and generate
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```

### Example: Disease Localization with Bounding Boxes

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "chest_xray.png"},
            {"type": "text", "text": "Detect and localize all abnormalities in this image."},
        ],
    }
]
# Output: "Fractures <box>[[156, 516, 231, 607], [240, 529, 296, 581]]</box>"
```

### Example: Report Generation

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "ct_scan.png"},
            {"type": "text", "text": "Generate a detailed radiology report for this CT scan."},
        ],
    }
]
# MedMO generates comprehensive clinical reports with findings and impressions
```

## πŸ—οΈ Model Architecture

MedMO is built on **Qwen3-VL-4B-Instruct** and trained through a 4-stage progressive pipeline:

1. **Stage 1 - General Medical SFT**: Large-scale training on 18.5M image-text pairs for foundational medical understanding
2. **Stage 2 - High-Resolution & Grounding**: Training on 3M curated samples at 1280Γ—1280 resolution for spatial localization
3. **Stage 3 - Instruction Tuning**: Fine-tuning on 4.3M instruction-response pairs for task-specific alignment
4. **Stage 4 - Reinforcement Learning**: GRPO training with verifiable rewards (label accuracy, bbox IoU) for enhanced grounding

**Total Training Data**: 26M+ samples from 45 medical datasets spanning diverse modalities and anatomical systems.



For detailed benchmark results, please refer to our paper.

## πŸ“„ Citation

If you use MedMO in your research, please cite our paper:

```bibtex
@article{deria2026medmo,
  title={MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images},
  author={Deria, Ankan and Kumar, Komal and Dukre, Adinath Madhavrao and Segal, Eran and Khan, Salman and Razzak, Imran},
  journal={arXiv preprint arXiv:2602.06965},
  year={2026}
}
```


## πŸ“œ License

This project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for details.