README.md · MBZUAI/MedMO-8B-Next at main

MedMO-8B-Next / README.md

ankanmbz

Update README.md

cae20d0 verified 10 days ago

preview code

raw

history blame contribute delete

6.33 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- medical
	- multimodal
	- grounding
	- report-generation
	- radiology
	- clinical-reasoning
	- mri
	- ct
	- histopathology
	- x-ray
	- fundus
	---


	# MedMO-8B-Next: Grounding and Understanding Multimodal Large Language Model for Medical Images

	[![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2602.06965)
	[![Model](https://img.shields.io/badge/🤗-MedMO--8B--Next-blue)](https://huggingface.co/MBZUAI/MedMO-8B-Next)
	[![Model](https://img.shields.io/badge/🤗-MedMO--8B-blue)](https://huggingface.co/MBZUAI/MedMO-8B)
	[![Model](https://img.shields.io/badge/🤗-MedMO--4B-blue)](https://huggingface.co/MBZUAI/MedMO-4B)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

	<p align="center">
	<img src="MedMO-logo.png" alt="MedMO Logo" width="300"/>
	</p>

	MedMO-8B-Next is the latest and most powerful iteration of the MedMO family — an open-source multimodal foundation model purpose-built for comprehensive medical image understanding and grounding. Trained on 26M+ diverse medical samples across 45 datasets, MedMO-8B-Next achieves state-of-the-art performance across all major medical imaging benchmarks, outperforming both open-source and closed-source competitors on VQA, Text QA, grounding, and report generation tasks.

	---

	## 🏆 Benchmark Performance

	### VQA & Text QA Results

	MedMO-8B-Next sets a new state-of-the-art across the board, achieving the highest average scores on both medical VQA and Text QA benchmarks — surpassing strong baselines including Lingshu-7B and Fleming-VL-8B.

	> OMIVQA = OmniMedVQA · MedXQA = MedXpertQA · Medbullets reported as op4/op5

	#### Medical VQA Benchmarks

	\| Model \| MMMU-Med \| VQA-RAD (closed/all) \| SLAKE (closed/all) \| PathVQA \| PMC-VQA \| OmniMedVQA \| MedXpertQA \| Avg. \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Lingshu-7B \| 54.0 \| 77.2 / 43.0 \| 82.4 / 33.2 \| 61.9 \| 54.2 \| 82.9 \| 26.9 \| 57.3 \|
	\| Fleming-VL-8B \| 63.3 \| 78.4 / 56.0 \| 86.9 / 80.0 \| 62.9 \| 64.3 \| 88.2 \| 21.6 \| 66.8 \|
	\| MedMO-8B-Next \| 65.3 \| 80.4 / 65.0 \| 75.5 / 74.7 \| 57.3 \| 70.3 \| 88.8 \| 48.9 \| 69.6 \|

	#### Medical Text QA Benchmarks

	\| Model \| MMLU-Med \| PubMedQA \| MedMCQA \| MedQA \| Medbullets (op4/op5) \| MedXpertQA \| SGPQA \| Avg. \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| Lingshu-7B \| 69.6 \| 75.8 \| 56.3 \| 63.5 \| 62.0 / 53.8 \| 16.4 \| 27.5 \| 51.1 \|
	\| Fleming-VL-8B \| 71.8 \| 74.0 \| 51.8 \| 53.7 \| 40.5 \| 12.1 \| 24.9 \| 46.9 \|
	\| MedMO-8B-Next \| 80.2 \| 75.6 \| 62.0 \| 83.8 \| 65.2 / 57.8 \| 20.9 \| 35.5 \| 60.1 \|

	> Bold = best result. MedMO-8B-Next achieves the highest average on both VQA (69.6) and Text QA (60.1) benchmarks.
	> * Benchmarked on AMD MI210 GPU.
	---


	### Supported Imaging Modalities

	\| Domain \| Modalities \|
	\|---\|---\|
	\| Radiology \| X-ray, CT, MRI, Ultrasound \|
	\| Pathology \| Whole-slide imaging, Microscopy \|
	\| Ophthalmology \| Fundus photography, OCT \|
	\| Dermatology \| Clinical skin images \|
	\| Nuclear Medicine \| PET, SPECT \|

	---

	## 🚀 Quick Start

	### Installation

	```bash
	pip install transformers torch qwen-vl-utils
	```

	### Basic Usage

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info
	import torch

	# Load model
	model = Qwen3VLForConditionalGeneration.from_pretrained(
	"MBZUAI/MedMO-8B-Next",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto",
	)

	processor = AutoProcessor.from_pretrained("MBZUAI/MedMO-8B-Next")

	# Prepare input
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "path/to/medical/image.png",
	},
	{"type": "text", "text": "What abnormalities are present in this chest X-ray?"},
	],
	}
	]

	# Process and generate
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=512)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text[0])
	```

	### Example: Disease Localization with Bounding Boxes

	```python
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "chest_xray.png"},
	{"type": "text", "text": "Detect and localize all abnormalities in this image."},
	],
	}
	]
	# Example output:
	# "Fractures <box>[[156, 516, 231, 607], [240, 529, 296, 581]]</box>"
	```

	### Example: Radiology Report Generation

	```python
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "ct_scan.png"},
	{"type": "text", "text": "Generate a detailed radiology report for this CT scan."},
	],
	}
	]
	# MedMO-8B-Next generates comprehensive clinical reports with findings and impressions
	```


	---

	## 📦 Model Family

	\| Model \| Parameters \| Best For \|
	\|---\|---\|---\|
	\| [MedMO-8B-Next](https://huggingface.co/MBZUAI/MedMO-8B-Next) \| 8B \| Highest accuracy, all tasks — recommended \|
	\| [MedMO-8B](https://huggingface.co/MBZUAI/MedMO-8B) \| 8B \| Previous generation \|
	\| [MedMO-4B](https://huggingface.co/MBZUAI/MedMO-4B) \| 4B \| Resource-constrained environments \|

	---

	## 📄 Citation

	If you use MedMO in your research, please cite our paper:

	```bibtex
	@article{deria2026medmo,
	title={MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images},
	author={Deria, Ankan and Kumar, Komal and Dukre, Adinath Madhavrao and Segal, Eran and Khan, Salman and Razzak, Imran},
	journal={arXiv preprint arXiv:2602.06965},
	year={2026}
	}
	```

	---

	## 📜 License

	This project is licensed under the Apache License 2.0 — see the [LICENSE](LICENSE) file for details.