File size: 7,302 Bytes
cae20d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
3edd3d2
 
 
46c0cfb
 
 
 
 
 
5b1c92d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3edd3d2
1bdd9cc
44f7872
3edd3d2
 
 
1bdd9cc
46c0cfb
5b1c92d
46c0cfb
5b1c92d
 
3edd3d2
 
44f7872
3edd3d2
 
 
 
 
 
4c53986
5b1c92d
3edd3d2
5b1c92d
 
46c0cfb
5b1c92d
 
 
 
 
 
 
 
 
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
 
 
5b1c92d
46c0cfb
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
 
46c0cfb
 
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
93a24e1
5b1c92d
46c0cfb
5b1c92d
46c0cfb
5b1c92d
 
93a24e1
 
5b1c92d
 
46c0cfb
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
3edd3d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - medical
  - multimodal
  - grounding
  - report-generation
  - radiology
  - clinical-reasoning
  - mri
  - ct
  - histopathology
  - x-ray
  - fundus
---


# MedMO-8B-Next: Grounding and Understanding Multimodal Large Language Model for Medical Images

[![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2602.06965)
[![Model](https://img.shields.io/badge/πŸ€—-MedMO--4B-blue)](https://huggingface.co/MBZUAI/MedMO-4B)
[![Model](https://img.shields.io/badge/πŸ€—-MedMO--4B--Next-blue)](https://huggingface.co/MBZUAI/MedMO-4B-Next)
[![Model](https://img.shields.io/badge/πŸ€—-MedMO--8B-blue)](https://huggingface.co/MBZUAI/MedMO-8B)
[![Model](https://img.shields.io/badge/πŸ€—-MedMO--8B--Next-blue)](https://huggingface.co/MBZUAI/MedMO-8B-Next)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

<p align="center">
  <img src="MedMO-logo.png" alt="MedMO Logo" width="300"/>
</p>

**MedMO-8B-Next** is the latest and most powerful iteration of the MedMO family β€” an open-source multimodal foundation model purpose-built for comprehensive medical image understanding and grounding. Trained on **26M+ diverse medical samples across 45 datasets**, MedMO-8B-Next achieves **state-of-the-art performance across all major medical imaging benchmarks**, outperforming both open-source and closed-source competitors on VQA, Text QA, grounding, and report generation tasks.

---

## πŸ† Benchmark Performance

### VQA & Text QA Results

MedMO-8B-Next sets a new state-of-the-art across the board, achieving the highest average scores on both medical VQA and Text QA benchmarks β€” surpassing strong baselines including Lingshu-7B and Fleming-VL-8B.

> OMIVQA = OmniMedVQA Β· MedXQA = MedXpertQA Β· Medbullets reported as op4/op5

#### Medical VQA Benchmarks

| Model | MMMU-Med | VQA-RAD (closed/all) | SLAKE (closed/all) | PathVQA | PMC-VQA | OmniMedVQA | MedXpertQA | **Avg.** |
|---|---|---|---|---|---|---|---|---|
| Lingshu-7B | 54.0 | 77.2 / 43.0 | 82.4 / 33.2 | 41.9 | 54.2 | 82.9 | 26.9 | 55.1 |
| Fleming-VL-8B | 63.3 | 78.4 / 56.4 | <u>86.9 / 80.0</u> | 56.5 | 64.3 | 88.2 | 21.6 | 66.1 |
| MediX-R1-8B | 63.3 | 75.2/51.6 | 70.3/54.4 | 41.0 | 55.3 | 73.8 | 24.9 | 57.1 |
| MedMO-4B | 54.6 | 50.9 / 35.0 | 41.0 / 30.0 | 42.4 | 50.6 | 79.7 | 24.8 | 45.4 |
| MedMO-8B | <u>64.6</u> | 72.3 / 64.7 | 70.6 / 70.0 | 56.3 | 59.4 | 84.8 | 26.2 | 63.2 |
| MedMO-4B-Next | 58.7 | <u>79.7 / 59.6</u> | 78.0 / 74.0 | **73.3** | **75.7** | <u>90.6</u> | <u>27.0</u> | <u>68.5</u> |
| **MedMO-8B-Next** | **69.3** | **86.4 / 68.0** | **83.0 / 81.6** | <u>56.3</u> | <u>74.1</u> | **93.3** | **42.9** | **72.7** |

#### Medical Text QA Benchmarks

| Model | MMLU-Med | PubMedQA | MedMCQA | MedQA | Medbullets (op4/op5) | MedXpertQA | SGPQA | **Avg.** |
|---|---|---|---|---|---|---|---|---|
| Lingshu-7B | 69.6 | 75.8 | 56.3 | 63.5 | 62.0 / 53.8 | 16.4 | 27.5 | 53.1 |
| Fleming-VL-8B | 71.8 | 74.0 | 51.8 | 53.7 | 40.5 / 37.3 | 12.1 | 24.9 | 45.7 |
| MediX-R1-8B | 79.0 | 73.4 | 60.1 | 85.8 | 55.1/47.0 | 14.4 | 34.3 | 56.1 |
| MedMO-4B | 75.7 | <u>78.0</u> | 58.0 | 78.5 | 57.5 / 47.7 | 16.4 | 29.4 | 55.1 |
| MedMO-8B | **81.0** | 77.6 | **65.0** | **84.3** | **66.5 / 60.2** | <u>19.9</u> | **36.0** | **61.3** |
| MedMO-4B-Next | 74.8 | **78.2** | 58.1 | 78.3 | 57.4 / 47.6 | 16.5 | 29.5 | 55.0 |
| **MedMO-8B-Next** | <u>80.2</u> | 75.6 | <u>62.0</u> | <u>83.8</u> | <u>65.2 / 57.8</u> | **20.9** | <u>35.5</u> | <u>60.1</u> |

> **Bold** = best result, <u>underline</u> = second-best result.
> * Benchmarked on AMD MI210 GPU.

---

### Supported Imaging Modalities

| Domain | Modalities |
|---|---|
| Radiology | X-ray, CT, MRI, Ultrasound |
| Pathology | Whole-slide imaging, Microscopy |
| Ophthalmology | Fundus photography, OCT |
| Dermatology | Clinical skin images |
| Nuclear Medicine | PET, SPECT |

---

## πŸš€ Quick Start

### Installation

```bash
pip install transformers torch qwen-vl-utils
```

### Basic Usage

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "MBZUAI/MedMO-8B-Next",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("MBZUAI/MedMO-8B-Next")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/medical/image.png",
            },
            {"type": "text", "text": "What abnormalities are present in this chest X-ray?"},
        ],
    }
]

# Process and generate
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```

### Example: Disease Localization with Bounding Boxes

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "chest_xray.png"},
            {"type": "text", "text": "Detect and localize all abnormalities in this image."},
        ],
    }
]
# Example output:
# "Fractures <box>[[156, 516, 231, 607], [240, 529, 296, 581]]</box>"
```

### Example: Radiology Report Generation

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "ct_scan.png"},
            {"type": "text", "text": "Generate a detailed radiology report for this CT scan."},
        ],
    }
]
# MedMO-8B-Next generates comprehensive clinical reports with findings and impressions
```



---

## πŸ“¦ Model Family

| Model | Parameters | Best For |
|---|---|---|
| [MedMO-8B-Next](https://huggingface.co/MBZUAI/MedMO-8B-Next) | 8B | SOTA highest accuracy, all tasks β€” **recommended** |
| [MedMO-4B-Next](https://huggingface.co/MBZUAI/MedMO-4B-Next) | 4B | 2nd SOTA, high accuracy in resource-constrained environments |
| [MedMO-8B](https://huggingface.co/MBZUAI/MedMO-8B) | 8B | Previous generation |
| [MedMO-4B](https://huggingface.co/MBZUAI/MedMO-4B) | 4B | Resource-constrained environments |

---

## πŸ“„ Citation

If you use MedMO in your research, please cite our paper:

```bibtex
@article{deria2026medmo,
  title={MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images},
  author={Deria, Ankan and Kumar, Komal and Dukre, Adinath Madhavrao and Segal, Eran and Khan, Salman and Razzak, Imran},
  journal={arXiv preprint arXiv:2602.06965},
  year={2026}
}
```

---

## πŸ“œ License

This project is licensed under the **Apache License 2.0** β€” see the [LICENSE](LICENSE) file for details.