File size: 9,771 Bytes
576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 576c8e4 789ad27 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | ---
license: cc-by-nc-4.0
pipeline_tag: image-text-to-text
tags:
- medical
- chest-x-ray
- radiology
- multi-modal
- multi-task
- vision-language
- report-generation
- visual-grounding
- vqa
---
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
# M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation [IEEE TNNLS]
<p align="center">
π <a href="https://arxiv.org/abs/2408.16213" target="_blank">arXiv</a> β’
π <a href="https://ieeexplore.ieee.org/abstract/document/11106750" target="_blank">IEEE TNNLS</a> β’
π€ <a href="https://huggingface.co/Deepnoid/M4CXR-TNNLS" target="_blank">Model</a> β’
π§© <a href="https://github.com/deepnoid-ai/M4CXR-TNNLS" target="_blank">Codes</a>
</p>
## Introduction
**M4CXR** is a multi-modal large language model (MLLM) designed for **chest X-ray (CXR) interpretation**, capable of handling **multiple tasks** in a unified conversational framework. It is trained on a visual instruction-following dataset assembled from diverse CXR tasks, and supports:
- π **Medical Report Generation (MRG)** β single-image, multi-image, and multi-study (with prior reports) scenarios, powered by a **chain-of-thought (CoT)** prompting strategy for state-of-the-art clinical accuracy.
- π― **Visual Grounding** β localizing anatomical regions or findings described in free-text phrases.
- π¬ **Visual Question Answering (VQA)** β answering open-ended questions about CXR images, including difference VQA across studies.
## Abstract
> The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the LLMs' capability for multitask learning or lacking clinical accuracy. This article presents M4CXR, a multimodal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought (CoT) prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multiimage, and multistudy contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR's versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.
## Get Started
### Install dependencies
```bash
pip install -r requirements.txt
```
### Basic Inference
A minimal example β load the model, feed a chest X-ray with a text question, and get a response.
The full runnable script is available as [interface.py](./interface.py).
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from interface import do_generate, load_image_from_url
# Setup
device = torch.device("cuda")
dtype = torch.bfloat16
# Load processor, model, and generation config
processor = AutoProcessor.from_pretrained("Deepnoid/M4CXR-TNNLS", trust_remote_code=True)
generation_config = GenerationConfig.from_pretrained("Deepnoid/M4CXR-TNNLS")
model = AutoModelForCausalLM.from_pretrained(
"Deepnoid/M4CXR-TNNLS",
trust_remote_code=True,
torch_dtype=dtype,
device_map=device,
)
# Prepare a batch of images and questions
images = [
load_image_from_url(
"https://upload.wikimedia.org/wikipedia/commons/a/a1/Normal_posteroanterior_%28PA%29_chest_radiograph_%28X-ray%29.jpg"
),
load_image_from_url(
"https://upload.wikimedia.org/wikipedia/commons/a/a1/Normal_posteroanterior_%28PA%29_chest_radiograph_%28X-ray%29.jpg"
),
]
questions = [
"radiology image: <image> What is the view of this chest X-ray?",
"radiology image: <image> Provide a description of the findings in the radiology image.",
]
# Build prompts with the chat template
prompts = [
processor.apply_chat_template([{"role": "user", "content": q}], tokenize=False)
for q in questions
]
# Generate
generation_config.do_sample = False
outputs = do_generate(prompts, images, model, processor, generation_config)
print(outputs)
```
## Task-specific Usage
M4CXR supports diverse CXR interpretation tasks through single- or multi-turn conversations. Full runnable examples are provided in [task_examples.py](./task_examples.py).
The examples below use the helpers from [interface.py](./interface.py) and the multi-turn driver defined in [task_examples.py](./task_examples.py):
```python
findings = (
"enlarged cardiomediastinum, cardiomegaly, lung opacity, lung lesion, edema, "
"consolidation, pneumonia, atelectasis, pneumothorax, pleural Effusion, "
"pleural other, fracture, support devices"
)
```
### 1. Single-image Medical Report Generation (CoT)
The model first predicts findings from a list of candidates, then writes the report conditioned on its own predictions.
```python
images = [image]
questions = [
f"radiology image: <image> Which of the following findings are present in the radiology image? Findings: {findings}",
"Based on the previous conversation, provide a description of the findings in the radiology image.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```
### 2. Multi-image Medical Report Generation (CoT)
Multiple views of the same study can be provided in a single prompt.
```python
images = [image_pa, image_lat] # e.g., PA + lateral
image_tokens = " ".join("<image>" for _ in images)
questions = [
f"radiology images: {image_tokens} Which of the following findings are present in the radiology images? Findings: {findings}",
"Based on the previous conversation, provide a description of the findings in the radiology images.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```
### 3. Multi-study Medical Report Generation (CoT)
Condition on prior images and the prior report to generate a follow-up report that references temporal changes.
```python
prior_images = [prior_pa, prior_lat]
prior_report = "The lungs are clear. There is no pneumothorax."
follow_up_images = [current_pa, current_lat]
images = prior_images + follow_up_images
prior_tokens = " ".join("<image>" for _ in prior_images)
current_tokens = " ".join("<image>" for _ in follow_up_images)
questions = [
(
f"prior radiology images: {prior_tokens}, prior radiology report: {prior_report} "
f"follow-up images: {current_tokens}, The radiology studies are given in chronological order. "
f"Which of the following findings are present in the current follow-up radiology images? "
f"Findings: {findings}"
),
"Based on the previous conversation, provide a description of the findings in the current follow-up radiology images.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```
### 4. Visual Grounding
Given a phrase, the model returns the bounding box of the region it describes.
```python
images = [image]
phrase = "right lower lobe"
questions = [
f"radiology image: <image> Provide the bounding box coordinate of the region this phrase describes: {phrase}",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```
### 5. Report Summarization
Chain MRG with a follow-up summarization turn to obtain a concise one-sentence summary.
```python
images = [image]
questions = [
f"radiology image: <image> Which of the following findings are present in the radiology image? Findings: {findings}",
"Based on the previous conversation, provide a description of the findings in the radiology image.",
"Summarize the description in one concise sentence.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```
## Citation
If you use M4CXR in your research, please cite:
```bibtex
@article{park2025m4cxr,
author={Park, Jonggwon and Kim, Soobum and Yoon, Byungmu and Hyun, Jihun and Choi, Kyoyun},
journal={IEEE Transactions on Neural Networks and Learning Systems},
title={M4CXR: Exploring Multitask Potentials of Multimodal Large Language Models for Chest X-Ray Interpretation},
year={2025},
volume={36},
number={10},
pages={17841-17855},
doi={10.1109/TNNLS.2025.3587687}
}
```
## References
- **Pretrained models**
- **Vision encoder**: [RAD-DINO](https://huggingface.co/microsoft/rad-dino)
- **Language model**: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
- **Visual projector**
- **C-Abstractor** from [Honeybee (CVPR 2024)](https://github.com/khanrc/honeybee)
## Acknowledgments
This work was supported by the Technology Innovation Program (RS-2025-02221011, Development of Medical-Specialized Multimodal Hyperscale Generative AI Technology for Global Integration) funded by the Ministry of Trade Industry & Energy (MOTIE, South Korea).
## License
[](https://creativecommons.org/licenses/by-nc/4.0/)
Released under **CC BY-NC 4.0**. The model and its outputs are provided **for research purposes only** and are **not intended for clinical use or medical decision-making**.
|