File size: 6,062 Bytes
4e02ca4 de3860a 8f97581 4e02ca4 4da65ec 4e02ca4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- medical
- multimodal
- grounding
- report-generation
- radiology
- clinical-reasoning
- mri
- ct
- histopathology
- x-ray
- fundus
---
# MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
[](https://arxiv.org/abs/2602.06965)
[](https://huggingface.co/MBZUAI/MedMO-8B-Next)
[](https://huggingface.co/MBZUAI/MedMO-8B)
[](https://huggingface.co/MBZUAI/MedMO-4B)
[](https://opensource.org/licenses/Apache-2.0)
<p align="center">
<img src="MedMO-logo.png" alt="MedMO Logo" width="300"/>
</p>
**MedMO** is a powerful open-source multimodal foundation model designed for comprehensive medical image understanding and grounding. Built on Qwen3-VL architecture and trained on 26M+ diverse medical samples across 45 datasets, MedMO achieves state-of-the-art performance across multiple medical imaging tasks.
## π― Capabilities
MedMO excels at a comprehensive range of medical imaging tasks:
- **Visual Question Answering (VQA)**: Answer complex questions about medical images across radiology, pathology, ophthalmology, and dermatology
- **Text-Based Medical QA**: Clinical reasoning and medical knowledge question answering
- **Radiology Report Generation**: Generate detailed, clinically accurate radiology reports from medical images
- **Disease Localization with Bounding Boxes**: Precise spatial detection and localization of pathological findings
- **Anatomical Grounding**: Spatial localization and grounding of anatomical structures
- **Clinical Reasoning**: Step-by-step diagnostic reasoning and clinical decision support
- **Diagnostic Classification**: Multi-class disease classification across diverse imaging modalities
- **Spatial Object Detection**: Fine-grained detection in microscopy, pathology slides, and cellular imaging
- **Medical Report Summarization**: Extract and summarize key clinical findings from complex medical reports
### Supported Modalities
- Radiology (X-ray, CT, MRI, Ultrasound)
- Pathology & Microscopy
- Ophthalmology (Fundus, OCT)
- Dermatology
- Nuclear Medicine (PET, SPECT)
## π Quick Start
### Installation
```bash
pip install transformers torch qwen-vl-utils
```
### Basic Usage
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"MBZUAI/MedMO-4B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("MBZUAI/MedMO-4B")
# Prepare your input
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/medical/image.png",
},
{"type": "text", "text": "What abnormalities are present in this chest X-ray?"},
],
}
]
# Process and generate
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```
### Example: Disease Localization with Bounding Boxes
```python
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "chest_xray.png"},
{"type": "text", "text": "Detect and localize all abnormalities in this image."},
],
}
]
# Output: "Fractures <box>[[156, 516, 231, 607], [240, 529, 296, 581]]</box>"
```
### Example: Report Generation
```python
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "ct_scan.png"},
{"type": "text", "text": "Generate a detailed radiology report for this CT scan."},
],
}
]
# MedMO generates comprehensive clinical reports with findings and impressions
```
## ποΈ Model Architecture
MedMO is built on **Qwen3-VL-4B-Instruct** and trained through a 4-stage progressive pipeline:
1. **Stage 1 - General Medical SFT**: Large-scale training on 18.5M image-text pairs for foundational medical understanding
2. **Stage 2 - High-Resolution & Grounding**: Training on 3M curated samples at 1280Γ1280 resolution for spatial localization
3. **Stage 3 - Instruction Tuning**: Fine-tuning on 4.3M instruction-response pairs for task-specific alignment
4. **Stage 4 - Reinforcement Learning**: GRPO training with verifiable rewards (label accuracy, bbox IoU) for enhanced grounding
**Total Training Data**: 26M+ samples from 45 medical datasets spanning diverse modalities and anatomical systems.
For detailed benchmark results, please refer to our paper.
## π Citation
If you use MedMO in your research, please cite our paper:
```bibtex
@article{deria2026medmo,
title={MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images},
author={Deria, Ankan and Kumar, Komal and Dukre, Adinath Madhavrao and Segal, Eran and Khan, Salman and Razzak, Imran},
journal={arXiv preprint arXiv:2602.06965},
year={2026}
}
```
## π License
This project is licensed under the **Apache License 2.0** - see the [LICENSE](LICENSE) file for details.
|