RadVLM-info / README.md

Update README.md

a0a65af verified 3 months ago

9.6 kB

	---
	license: apache-2.0
	library_name: transformers
	---
	# RadVLM Model Card
	A Multitask Conversational Vision-Language Model for Radiology (paper: https://arxiv.org/abs/2502.03333).

	Here, we provide the link to access RadVLM github repository and the inference code to use RadVLM once trained following the repo's instructions.

	New: Instruction dataset (Physionet datasets only) is published on PhysioNet platform: https://physionet.org/content/radvlm-instruction-dataset/1.0.0/

	New: Model weights: published on the PhysioNet platform: https://physionet.org/content/radvlm-model/1.0.0/



	# Github repo
	The code for data curation, finetuning and evaluation is shared in the following github repo: https://github.com/uzh-dqbm-cmi/RadVLM.git


	## Model Development

	- Developed by: KrauthammerLab, University of Zurich, ETH Zurich, Kyoto University of Applied Science, Kobe University, Swiss AI Initiative
	- Contributors: Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer

	## Model Overview

	RadVLM is a compact, multitask vision-language model designed for conversational Chest X-ray (CXR) interpretation. Unlike traditional models focused solely on report generation, RadVLM supports interactive, multi-turn diagnostic conversations. It has been fine-tuned on a large-scale instruction dataset containing over 1 million image-instruction pairs, covering tasks such as abnormality classification, visual grounding, and structured conversations.

	## Intended Use
	- Primary Use Cases
	- Medical Education: Supporting radiology trainees in learning CXR interpretation through interactive Q&A.
	- Preliminary Findings: Generating structured observations from CXRs to complement radiology reports.
	- Out-of-Scope Uses
	- Clinical Decision Making: RadVLM is not a replacement for a licensed radiologist and should not be used as the sole basis for medical decisions.
	- Automated Diagnosis: The model does not provide definitive diagnoses and should be used as a supplementary tool.
	- Use Outside of CXR Interpretation: The model has been trained specifically for Chest X-rays and is not designed for other medical imaging modalities.

	## Inputs and Outputs
	- Input:
	- Image: A frontal Chest X-ray (PIL Image or NumPy array).
	- Text: A user prompt (free-text query about the image).
	- Chat History (optional): Multi-turn interaction history.
	- Output:
	- Text Response: A natural language answer to the user's query.
	- Bounding Boxes (if applicable): Coordinates indicating the location of anatomical structures or abnormalities.

	## Model Architecture
	- Backbone: LLaVA-OneVision-7B (https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-si-hf), a vision-language model adapted for medical tasks.
	- Vision Encoder: SigLIP, used for image feature extraction.
	- Instruction Tuning: Fine-tuned with multi-task objectives, covering report generation, abnormality detection, and multi-turn Q&A.

	## Training Data
	RadVLM was trained on a large-scale instruction dataset derived from publicly available medical sources:

	- MIMIC-CXR: Radiology reports paired with images.
	- CheXpert: Abnormality classification labels.
	- VinDr-CXR: Manually annotated abnormality locations.
	- Chest Imagenome: Bounding boxes for anatomical regions.
	- MS-CXR & PadChest-GR: Phrase grounding data.
	All data sources were de-identified and anonymized prior to use.


	## Dependencies

	```
	pip install torch torchvision
	pip install transformers==4.46.0
	```


	## Inference function

	Below is the `inference_radvlm` function that facilitates multi-turn interactions with the model. This function handles both single-turn and multi-turn conversations, managing the chat history to maintain context across multiple exchanges.

	```python
	import requests
	from PIL import Image
	from numpy import asarray
	import torch
	from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
	import re

	def inference_radvlm(model, processor, image, prompt, chat_history=None, max_new_tokens=1500):
	"""
	Generate a response using RadVLM in either single-turn or multi-turn mode.

	Args:
	model: The RadVLM model.
	processor: The processor for RadVLM (provides apply_chat_template and tokenization).
	image: A PIL Image or NumPy array representing the input image.
	prompt: The user prompt for this turn.
	chat_history: A list of (user_msg, assistant_msg) tuples representing the conversation so far.
	If None or empty, single-turn mode is used. Even in single-turn mode,
	this function returns chat_history so that you can continue in subsequent turns.
	max_new_tokens: The maximum number of new tokens to generate.

	Returns:
	response (str): The assistant's response for this turn.
	chat_history (list): The updated chat_history including this turn's (prompt, response).
	"""

	# Initialize chat history if not provided
	if chat_history is None:
	chat_history = []

	# Build the chat history
	conversation = []
	for idx, (user_text, assistant_text) in enumerate(chat_history):
	if idx == 0:
	conversation.append({
	"role": "user",
	"content": [
	{"type": "text", "text": user_text},
	{"type": "image"},
	],
	})
	else:
	conversation.append({
	"role": "user",
	"content": [
	{"type": "text", "text": user_text},
	],
	})
	conversation.append({
	"role": "assistant",
	"content": [
	{"type": "text", "text": assistant_text},
	],
	})

	# Add the current user prompt
	if len(chat_history) == 0:
	# First turn includes the image
	conversation.append({
	"role": "user",
	"content": [
	{"type": "text", "text": prompt},
	{"type": "image"},
	],
	})
	else:
	# Subsequent turns without the image
	conversation.append({
	"role": "user",
	"content": [{"type": "text", "text": prompt}],
	})

	# Apply the chat template to create the full prompt
	full_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

	# Prepare model inputs
	inputs = processor(images=image, text=full_prompt, return_tensors="pt", padding=True).to(
	model.device, torch.float16
	)

	# Generate the response
	with torch.inference_mode():
	output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)

	# Decode the output
	full_response = processor.decode(output[0], skip_special_tokens=True)
	response = re.split(r"(user\|assistant)", full_response)[-1].strip()

	# Update chat history
	chat_history.append((prompt, response))

	return response, chat_history

	```

	## Quick-Start: Multi-turn Demo
	Below is a demonstration of how to utilize the inference_radvlm function in a multi-turn conversation.
	For this you need to set the variable `model_id` with the path containing the model weights.

	```python
	import torch
	from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
	from PIL import Image
	import requests
	from io import BytesIO
	import numpy as np

	# Initialize the model and processor
	model_id = "your/local/folder/with/RadVLM/weights"
	model = LlavaOnevisionForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	).to('cuda') # Use 'cuda' if GPU is available, else 'cpu'

	processor = AutoProcessor.from_pretrained(model_id)

	image_url = "https://prod-images-static.radiopaedia.org/images/29923576/fed73420497c8622734f21ce20fc91_gallery.jpeg"
	image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

	# Initialize chat history
	chat_history = []

	# First user prompt with image from URL
	user_prompt_1 = "What can you say about this X-ray?"
	response_1, chat_history = inference_radvlm(model, processor, image, user_prompt_1, chat_history)

	print("RadVLM:", response_1)

	# Second user prompt, continuing the conversation
	user_prompt_2 = "Is there something concerning in the lungs area?"
	response_2, chat_history = inference_radvlm(model, processor, image, user_prompt_2, chat_history)

	print("RadVLM:", response_2)

	# Third user prompt
	user_prompt_3 = "What about the cardiac silhouette? Is it normal?"
	response_3, chat_history = inference_radvlm(model, processor, image, user_prompt_3, chat_history)

	print("Assistant:", response_3)
	```

	## References

	For reference, please use the following:

	```bibtex
	@misc{deperrois2025radvlmmultitaskconversationalvisionlanguage,
	title={RadVLM: A Multitask Conversational Vision-Language Model for Radiology},
	author={Nicolas Deperrois and Hidetoshi Matsuo and Samuel Ruipérez-Campillo and Moritz Vandenhirtz and Sonia Laguna and Alain Ryser and Koji Fujimoto and Mizuho Nishio and Thomas M. Sutter and Julia E. Vogt and Jonas Kluckert and Thomas Frauenfelder and Christian Blüthgen and Farhad Nooralahzadeh and Michael Krauthammer},
	year={2025},
	eprint={2502.03333},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2502.03333},
	}
	```