File size: 9,771 Bytes
576c8e4
 
789ad27
 
 
 
 
 
 
 
 
 
 
576c8e4
 
 
 
 
789ad27
576c8e4
 
789ad27
 
 
 
576c8e4
 
789ad27
576c8e4
789ad27
576c8e4
789ad27
 
 
576c8e4
789ad27
 
 
 
 
 
 
 
 
 
 
576c8e4
789ad27
 
 
 
 
 
576c8e4
 
 
789ad27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
576c8e4
789ad27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
576c8e4
 
789ad27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
license: cc-by-nc-4.0
pipeline_tag: image-text-to-text
tags:
- medical
- chest-x-ray
- radiology
- multi-modal
- multi-task
- vision-language
- report-generation
- visual-grounding
- vqa
---

<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->

# M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation [IEEE TNNLS]

<p align="center">
πŸ“ <a href="https://arxiv.org/abs/2408.16213" target="_blank">arXiv</a> β€’
πŸ“– <a href="https://ieeexplore.ieee.org/abstract/document/11106750" target="_blank">IEEE TNNLS</a> β€’
πŸ€— <a href="https://huggingface.co/Deepnoid/M4CXR-TNNLS" target="_blank">Model</a> β€’
🧩 <a href="https://github.com/deepnoid-ai/M4CXR-TNNLS" target="_blank">Codes</a>
</p>

## Introduction

**M4CXR** is a multi-modal large language model (MLLM) designed for **chest X-ray (CXR) interpretation**, capable of handling **multiple tasks** in a unified conversational framework. It is trained on a visual instruction-following dataset assembled from diverse CXR tasks, and supports:

- πŸ“ **Medical Report Generation (MRG)** β€” single-image, multi-image, and multi-study (with prior reports) scenarios, powered by a **chain-of-thought (CoT)** prompting strategy for state-of-the-art clinical accuracy.
- 🎯 **Visual Grounding** β€” localizing anatomical regions or findings described in free-text phrases.
- πŸ’¬ **Visual Question Answering (VQA)** β€” answering open-ended questions about CXR images, including difference VQA across studies.

## Abstract

> The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the LLMs' capability for multitask learning or lacking clinical accuracy. This article presents M4CXR, a multimodal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought (CoT) prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multiimage, and multistudy contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR's versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.

## Get Started

### Install dependencies

```bash
pip install -r requirements.txt
```

### Basic Inference

A minimal example β€” load the model, feed a chest X-ray with a text question, and get a response.
The full runnable script is available as [interface.py](./interface.py).

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from interface import do_generate, load_image_from_url


# Setup
device = torch.device("cuda")
dtype = torch.bfloat16

# Load processor, model, and generation config
processor = AutoProcessor.from_pretrained("Deepnoid/M4CXR-TNNLS", trust_remote_code=True)
generation_config = GenerationConfig.from_pretrained("Deepnoid/M4CXR-TNNLS")
model = AutoModelForCausalLM.from_pretrained(
    "Deepnoid/M4CXR-TNNLS",
    trust_remote_code=True,
    torch_dtype=dtype,
    device_map=device,
)

# Prepare a batch of images and questions
images = [
    load_image_from_url(
        "https://upload.wikimedia.org/wikipedia/commons/a/a1/Normal_posteroanterior_%28PA%29_chest_radiograph_%28X-ray%29.jpg"
    ),
    load_image_from_url(
        "https://upload.wikimedia.org/wikipedia/commons/a/a1/Normal_posteroanterior_%28PA%29_chest_radiograph_%28X-ray%29.jpg"
    ),
]
questions = [
    "radiology image: <image> What is the view of this chest X-ray?",
    "radiology image: <image> Provide a description of the findings in the radiology image.",
]

# Build prompts with the chat template
prompts = [
    processor.apply_chat_template([{"role": "user", "content": q}], tokenize=False)
    for q in questions
]

# Generate
generation_config.do_sample = False
outputs = do_generate(prompts, images, model, processor, generation_config)
print(outputs)
```

## Task-specific Usage

M4CXR supports diverse CXR interpretation tasks through single- or multi-turn conversations. Full runnable examples are provided in [task_examples.py](./task_examples.py).

The examples below use the helpers from [interface.py](./interface.py) and the multi-turn driver defined in [task_examples.py](./task_examples.py):

```python
findings = (
    "enlarged cardiomediastinum, cardiomegaly, lung opacity, lung lesion, edema, "
    "consolidation, pneumonia, atelectasis, pneumothorax, pleural Effusion, "
    "pleural other, fracture, support devices"
)
```

### 1. Single-image Medical Report Generation (CoT)

The model first predicts findings from a list of candidates, then writes the report conditioned on its own predictions.

```python
images = [image]
questions = [
    f"radiology image: <image> Which of the following findings are present in the radiology image? Findings: {findings}",
    "Based on the previous conversation, provide a description of the findings in the radiology image.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```

### 2. Multi-image Medical Report Generation (CoT)

Multiple views of the same study can be provided in a single prompt.

```python
images = [image_pa, image_lat]  # e.g., PA + lateral
image_tokens = " ".join("<image>" for _ in images)
questions = [
    f"radiology images: {image_tokens} Which of the following findings are present in the radiology images? Findings: {findings}",
    "Based on the previous conversation, provide a description of the findings in the radiology images.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```

### 3. Multi-study Medical Report Generation (CoT)

Condition on prior images and the prior report to generate a follow-up report that references temporal changes.

```python
prior_images = [prior_pa, prior_lat]
prior_report = "The lungs are clear. There is no pneumothorax."
follow_up_images = [current_pa, current_lat]
images = prior_images + follow_up_images

prior_tokens = " ".join("<image>" for _ in prior_images)
current_tokens = " ".join("<image>" for _ in follow_up_images)

questions = [
    (
        f"prior radiology images: {prior_tokens}, prior radiology report: {prior_report} "
        f"follow-up images: {current_tokens}, The radiology studies are given in chronological order. "
        f"Which of the following findings are present in the current follow-up radiology images? "
        f"Findings: {findings}"
    ),
    "Based on the previous conversation, provide a description of the findings in the current follow-up radiology images.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```

### 4. Visual Grounding

Given a phrase, the model returns the bounding box of the region it describes.

```python
images = [image]
phrase = "right lower lobe"
questions = [
    f"radiology image: <image> Provide the bounding box coordinate of the region this phrase describes: {phrase}",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```

### 5. Report Summarization

Chain MRG with a follow-up summarization turn to obtain a concise one-sentence summary.

```python
images = [image]
questions = [
    f"radiology image: <image> Which of the following findings are present in the radiology image? Findings: {findings}",
    "Based on the previous conversation, provide a description of the findings in the radiology image.",
    "Summarize the description in one concise sentence.",
]
chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
```

## Citation

If you use M4CXR in your research, please cite:

```bibtex
@article{park2025m4cxr,
  author={Park, Jonggwon and Kim, Soobum and Yoon, Byungmu and Hyun, Jihun and Choi, Kyoyun},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  title={M4CXR: Exploring Multitask Potentials of Multimodal Large Language Models for Chest X-Ray Interpretation},
  year={2025},
  volume={36},
  number={10},
  pages={17841-17855},
  doi={10.1109/TNNLS.2025.3587687}
}
```

## References

- **Pretrained models**
  - **Vision encoder**: [RAD-DINO](https://huggingface.co/microsoft/rad-dino)
  - **Language model**: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
- **Visual projector**
  - **C-Abstractor** from [Honeybee (CVPR 2024)](https://github.com/khanrc/honeybee)

## Acknowledgments

This work was supported by the Technology Innovation Program (RS-2025-02221011, Development of Medical-Specialized Multimodal Hyperscale Generative AI Technology for Global Integration) funded by the Ministry of Trade Industry & Energy (MOTIE, South Korea).

## License

[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)

Released under **CC BY-NC 4.0**. The model and its outputs are provided **for research purposes only** and are **not intended for clinical use or medical decision-making**.