Jayden Park

Update README and add requirements.txt

789ad27 about 1 month ago

9.77 kB

	---
	license: cc-by-nc-4.0
	pipeline_tag: image-text-to-text
	tags:
	- medical
	- chest-x-ray
	- radiology
	- multi-modal
	- multi-task
	- vision-language
	- report-generation
	- visual-grounding
	- vqa
	---

	<!-- markdownlint-disable first-line-h1 -->
	<!-- markdownlint-disable html -->

	# M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation [IEEE TNNLS]

	<p align="center">
	📝 <a href="https://arxiv.org/abs/2408.16213" target="_blank">arXiv</a> •
	📖 <a href="https://ieeexplore.ieee.org/abstract/document/11106750" target="_blank">IEEE TNNLS</a> •
	🤗 <a href="https://huggingface.co/Deepnoid/M4CXR-TNNLS" target="_blank">Model</a> •
	🧩 <a href="https://github.com/deepnoid-ai/M4CXR-TNNLS" target="_blank">Codes</a>
	</p>

	## Introduction

	M4CXR is a multi-modal large language model (MLLM) designed for chest X-ray (CXR) interpretation, capable of handling multiple tasks in a unified conversational framework. It is trained on a visual instruction-following dataset assembled from diverse CXR tasks, and supports:

	- 📝 Medical Report Generation (MRG) — single-image, multi-image, and multi-study (with prior reports) scenarios, powered by a chain-of-thought (CoT) prompting strategy for state-of-the-art clinical accuracy.
	- 🎯 Visual Grounding — localizing anatomical regions or findings described in free-text phrases.
	- 💬 Visual Question Answering (VQA) — answering open-ended questions about CXR images, including difference VQA across studies.

	## Abstract

	> The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the LLMs' capability for multitask learning or lacking clinical accuracy. This article presents M4CXR, a multimodal LLM designed to enhance CXR interpretation. The model is trained on a visual instruction-following dataset that integrates various task-specific datasets in a conversational format. As a result, the model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA). M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought (CoT) prompting strategy, in which it identifies findings in CXR images and subsequently generates corresponding reports. The model is adaptable to various MRG scenarios depending on the available inputs, such as single-image, multiimage, and multistudy contexts. In addition to MRG, M4CXR performs visual grounding at a level comparable to specialized models and demonstrates outstanding performance in VQA. Both quantitative and qualitative assessments reveal M4CXR's versatility in MRG, visual grounding, and VQA, while consistently maintaining clinical accuracy.

	## Get Started

	### Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### Basic Inference

	A minimal example — load the model, feed a chest X-ray with a text question, and get a response.
	The full runnable script is available as [interface.py](./interface.py).

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

	from interface import do_generate, load_image_from_url


	# Setup
	device = torch.device("cuda")
	dtype = torch.bfloat16

	# Load processor, model, and generation config
	processor = AutoProcessor.from_pretrained("Deepnoid/M4CXR-TNNLS", trust_remote_code=True)
	generation_config = GenerationConfig.from_pretrained("Deepnoid/M4CXR-TNNLS")
	model = AutoModelForCausalLM.from_pretrained(
	"Deepnoid/M4CXR-TNNLS",
	trust_remote_code=True,
	torch_dtype=dtype,
	device_map=device,
	)

	# Prepare a batch of images and questions
	images = [
	load_image_from_url(
	"https://upload.wikimedia.org/wikipedia/commons/a/a1/Normal_posteroanterior_%28PA%29_chest_radiograph_%28X-ray%29.jpg"
	),
	load_image_from_url(
	"https://upload.wikimedia.org/wikipedia/commons/a/a1/Normal_posteroanterior_%28PA%29_chest_radiograph_%28X-ray%29.jpg"
	),
	]
	questions = [
	"radiology image: <image> What is the view of this chest X-ray?",
	"radiology image: <image> Provide a description of the findings in the radiology image.",
	]

	# Build prompts with the chat template
	prompts = [
	processor.apply_chat_template([{"role": "user", "content": q}], tokenize=False)
	for q in questions
	]

	# Generate
	generation_config.do_sample = False
	outputs = do_generate(prompts, images, model, processor, generation_config)
	print(outputs)
	```

	## Task-specific Usage

	M4CXR supports diverse CXR interpretation tasks through single- or multi-turn conversations. Full runnable examples are provided in [task_examples.py](./task_examples.py).

	The examples below use the helpers from [interface.py](./interface.py) and the multi-turn driver defined in [task_examples.py](./task_examples.py):

	```python
	findings = (
	"enlarged cardiomediastinum, cardiomegaly, lung opacity, lung lesion, edema, "
	"consolidation, pneumonia, atelectasis, pneumothorax, pleural Effusion, "
	"pleural other, fracture, support devices"
	)
	```

	### 1. Single-image Medical Report Generation (CoT)

	The model first predicts findings from a list of candidates, then writes the report conditioned on its own predictions.

	```python
	images = [image]
	questions = [
	f"radiology image: <image> Which of the following findings are present in the radiology image? Findings: {findings}",
	"Based on the previous conversation, provide a description of the findings in the radiology image.",
	]
	chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
	```

	### 2. Multi-image Medical Report Generation (CoT)

	Multiple views of the same study can be provided in a single prompt.

	```python
	images = [image_pa, image_lat] # e.g., PA + lateral
	image_tokens = " ".join("<image>" for _ in images)
	questions = [
	f"radiology images: {image_tokens} Which of the following findings are present in the radiology images? Findings: {findings}",
	"Based on the previous conversation, provide a description of the findings in the radiology images.",
	]
	chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
	```

	### 3. Multi-study Medical Report Generation (CoT)

	Condition on prior images and the prior report to generate a follow-up report that references temporal changes.

	```python
	prior_images = [prior_pa, prior_lat]
	prior_report = "The lungs are clear. There is no pneumothorax."
	follow_up_images = [current_pa, current_lat]
	images = prior_images + follow_up_images

	prior_tokens = " ".join("<image>" for _ in prior_images)
	current_tokens = " ".join("<image>" for _ in follow_up_images)

	questions = [
	(
	f"prior radiology images: {prior_tokens}, prior radiology report: {prior_report} "
	f"follow-up images: {current_tokens}, The radiology studies are given in chronological order. "
	f"Which of the following findings are present in the current follow-up radiology images? "
	f"Findings: {findings}"
	),
	"Based on the previous conversation, provide a description of the findings in the current follow-up radiology images.",
	]
	chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
	```

	### 4. Visual Grounding

	Given a phrase, the model returns the bounding box of the region it describes.

	```python
	images = [image]
	phrase = "right lower lobe"
	questions = [
	f"radiology image: <image> Provide the bounding box coordinate of the region this phrase describes: {phrase}",
	]
	chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
	```

	### 5. Report Summarization

	Chain MRG with a follow-up summarization turn to obtain a concise one-sentence summary.

	```python
	images = [image]
	questions = [
	f"radiology image: <image> Which of the following findings are present in the radiology image? Findings: {findings}",
	"Based on the previous conversation, provide a description of the findings in the radiology image.",
	"Summarize the description in one concise sentence.",
	]
	chats = do_generate_multi_turn(questions, images, model, processor, generation_config)
	```

	## Citation

	If you use M4CXR in your research, please cite:

	```bibtex
	@article{park2025m4cxr,
	author={Park, Jonggwon and Kim, Soobum and Yoon, Byungmu and Hyun, Jihun and Choi, Kyoyun},
	journal={IEEE Transactions on Neural Networks and Learning Systems},
	title={M4CXR: Exploring Multitask Potentials of Multimodal Large Language Models for Chest X-Ray Interpretation},
	year={2025},
	volume={36},
	number={10},
	pages={17841-17855},
	doi={10.1109/TNNLS.2025.3587687}
	}
	```

	## References

	- Pretrained models
	- Vision encoder: [RAD-DINO](https://huggingface.co/microsoft/rad-dino)
	- Language model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
	- Visual projector
	- C-Abstractor from [Honeybee (CVPR 2024)](https://github.com/khanrc/honeybee)

	## Acknowledgments

	This work was supported by the Technology Innovation Program (RS-2025-02221011, Development of Medical-Specialized Multimodal Hyperscale Generative AI Technology for Global Integration) funded by the Ministry of Trade Industry & Energy (MOTIE, South Korea).

	## License

	[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)

	Released under CC BY-NC 4.0. The model and its outputs are provided for research purposes only and are not intended for clinical use or medical decision-making.