README.md · MBZUAI/MedMO-8B-Next at main

File size: 7,302 Bytes

cae20d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
3edd3d2
 
 
46c0cfb
 
 
 
 
 
5b1c92d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3edd3d2
1bdd9cc
44f7872
3edd3d2
 
 
1bdd9cc
46c0cfb
5b1c92d
46c0cfb
5b1c92d
 
3edd3d2
 
44f7872
3edd3d2
 
 
 
 
 
4c53986
5b1c92d
3edd3d2
5b1c92d
 
46c0cfb
5b1c92d
 
 
 
 
 
 
 
 
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
 
 
5b1c92d
46c0cfb
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
 
46c0cfb
 
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
93a24e1
5b1c92d
46c0cfb
5b1c92d
46c0cfb
5b1c92d
 
93a24e1
 
5b1c92d
 
46c0cfb
5b1c92d
46c0cfb
 
 
 
 
 
 
 
 
 
 
 
 
 
5b1c92d
46c0cfb
 
 
3edd3d2

---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - medical
  - multimodal
  - grounding
  - report-generation
  - radiology
  - clinical-reasoning
  - mri
  - ct
  - histopathology
  - x-ray
  - fundus
---


# MedMO-8B-Next: Grounding and Understanding Multimodal Large Language Model for Medical Images

[![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2602.06965)
[![Model](https://img.shields.io/badge/🤗-MedMO--4B-blue)](https://huggingface.co/MBZUAI/MedMO-4B)
[![Model](https://img.shields.io/badge/🤗-MedMO--4B--Next-blue)](https://huggingface.co/MBZUAI/MedMO-4B-Next)
[![Model](https://img.shields.io/badge/🤗-MedMO--8B-blue)](https://huggingface.co/MBZUAI/MedMO-8B)
[![Model](https://img.shields.io/badge/🤗-MedMO--8B--Next-blue)](https://huggingface.co/MBZUAI/MedMO-8B-Next)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

<p align="center">
  <img src="MedMO-logo.png" alt="MedMO Logo" width="300"/>
</p>

**MedMO-8B-Next** is the latest and most powerful iteration of the MedMO family — an open-source multimodal foundation model purpose-built for comprehensive medical image understanding and grounding. Trained on **26M+ diverse medical samples across 45 datasets**, MedMO-8B-Next achieves **state-of-the-art performance across all major medical imaging benchmarks**, outperforming both open-source and closed-source competitors on VQA, Text QA, grounding, and report generation tasks.

---

## 🏆 Benchmark Performance

### VQA & Text QA Results

MedMO-8B-Next sets a new state-of-the-art across the board, achieving the highest average scores on both medical VQA and Text QA benchmarks — surpassing strong baselines including Lingshu-7B and Fleming-VL-8B.

> OMIVQA = OmniMedVQA · MedXQA = MedXpertQA · Medbullets reported as op4/op5

#### Medical VQA Benchmarks

| Model | MMMU-Med | VQA-RAD (closed/all) | SLAKE (closed/all) | PathVQA | PMC-VQA | OmniMedVQA | MedXpertQA | **Avg.** |
|---|---|---|---|---|---|---|---|---|
| Lingshu-7B | 54.0 | 77.2 / 43.0 | 82.4 / 33.2 | 41.9 | 54.2 | 82.9 | 26.9 | 55.1 |
| Fleming-VL-8B | 63.3 | 78.4 / 56.4 | <u>86.9 / 80.0</u> | 56.5 | 64.3 | 88.2 | 21.6 | 66.1 |
| MediX-R1-8B | 63.3 | 75.2/51.6 | 70.3/54.4 | 41.0 | 55.3 | 73.8 | 24.9 | 57.1 |
| MedMO-4B | 54.6 | 50.9 / 35.0 | 41.0 / 30.0 | 42.4 | 50.6 | 79.7 | 24.8 | 45.4 |
| MedMO-8B | <u>64.6</u> | 72.3 / 64.7 | 70.6 / 70.0 | 56.3 | 59.4 | 84.8 | 26.2 | 63.2 |
| MedMO-4B-Next | 58.7 | <u>79.7 / 59.6</u> | 78.0 / 74.0 | **73.3** | **75.7** | <u>90.6</u> | <u>27.0</u> | <u>68.5</u> |
| **MedMO-8B-Next** | **69.3** | **86.4 / 68.0** | **83.0 / 81.6** | <u>56.3</u> | <u>74.1</u> | **93.3** | **42.9** | **72.7** |

#### Medical Text QA Benchmarks

| Model | MMLU-Med | PubMedQA | MedMCQA | MedQA | Medbullets (op4/op5) | MedXpertQA | SGPQA | **Avg.** |
|---|---|---|---|---|---|---|---|---|
| Lingshu-7B | 69.6 | 75.8 | 56.3 | 63.5 | 62.0 / 53.8 | 16.4 | 27.5 | 53.1 |
| Fleming-VL-8B | 71.8 | 74.0 | 51.8 | 53.7 | 40.5 / 37.3 | 12.1 | 24.9 | 45.7 |
| MediX-R1-8B | 79.0 | 73.4 | 60.1 | 85.8 | 55.1/47.0 | 14.4 | 34.3 | 56.1 |
| MedMO-4B | 75.7 | <u>78.0</u> | 58.0 | 78.5 | 57.5 / 47.7 | 16.4 | 29.4 | 55.1 |
| MedMO-8B | **81.0** | 77.6 | **65.0** | **84.3** | **66.5 / 60.2** | <u>19.9</u> | **36.0** | **61.3** |
| MedMO-4B-Next | 74.8 | **78.2** | 58.1 | 78.3 | 57.4 / 47.6 | 16.5 | 29.5 | 55.0 |
| **MedMO-8B-Next** | <u>80.2</u> | 75.6 | <u>62.0</u> | <u>83.8</u> | <u>65.2 / 57.8</u> | **20.9** | <u>35.5</u> | <u>60.1</u> |

> **Bold** = best result, <u>underline</u> = second-best result.
> * Benchmarked on AMD MI210 GPU.

---

### Supported Imaging Modalities

| Domain | Modalities |
|---|---|
| Radiology | X-ray, CT, MRI, Ultrasound |
| Pathology | Whole-slide imaging, Microscopy |
| Ophthalmology | Fundus photography, OCT |
| Dermatology | Clinical skin images |
| Nuclear Medicine | PET, SPECT |

---

## 🚀 Quick Start

### Installation

```bash
pip install transformers torch qwen-vl-utils
```

### Basic Usage

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "MBZUAI/MedMO-8B-Next",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("MBZUAI/MedMO-8B-Next")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/medical/image.png",
            },
            {"type": "text", "text": "What abnormalities are present in this chest X-ray?"},
        ],
    }
]

# Process and generate
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
```

### Example: Disease Localization with Bounding Boxes

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "chest_xray.png"},
            {"type": "text", "text": "Detect and localize all abnormalities in this image."},
        ],
    }
]
# Example output:
# "Fractures <box>[[156, 516, 231, 607], [240, 529, 296, 581]]</box>"
```

### Example: Radiology Report Generation

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "ct_scan.png"},
            {"type": "text", "text": "Generate a detailed radiology report for this CT scan."},
        ],
    }
]
# MedMO-8B-Next generates comprehensive clinical reports with findings and impressions
```



---

## 📦 Model Family

| Model | Parameters | Best For |
|---|---|---|
| [MedMO-8B-Next](https://huggingface.co/MBZUAI/MedMO-8B-Next) | 8B | SOTA highest accuracy, all tasks — **recommended** |
| [MedMO-4B-Next](https://huggingface.co/MBZUAI/MedMO-4B-Next) | 4B | 2nd SOTA, high accuracy in resource-constrained environments |
| [MedMO-8B](https://huggingface.co/MBZUAI/MedMO-8B) | 8B | Previous generation |
| [MedMO-4B](https://huggingface.co/MBZUAI/MedMO-4B) | 4B | Resource-constrained environments |

---

## 📄 Citation

If you use MedMO in your research, please cite our paper:

```bibtex
@article{deria2026medmo,
  title={MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images},
  author={Deria, Ankan and Kumar, Komal and Dukre, Adinath Madhavrao and Segal, Eran and Khan, Salman and Razzak, Imran},
  journal={arXiv preprint arXiv:2602.06965},
  year={2026}
}
```

---

## 📜 License

This project is licensed under the **Apache License 2.0** — see the [LICENSE](LICENSE) file for details.