|
|
|
|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- multimodal |
|
|
library_name: transformers |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# <img src="assets/OctoMed.svg" alt="OctoMed Logo" width="100" style="vertical-align:bottom; margin-right:0px;" /> OctoMed-7B |
|
|
|
|
|
## Introduction |
|
|
|
|
|
OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens. |
|
|
|
|
|
Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks. |
|
|
|
|
|
OctoMed-7B produces internal reasoning traces in \<think>...\</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Medical Benchmark Performances |
|
|
|
|
|
<p align="center"> |
|
|
<img src="assets/performances.svg" alt="Medical Benchmark Performances" width="100%" /> |
|
|
</p> |
|
|
|
|
|
**Notes:** |
|
|
- Green = OSS smaller models (<10B), Cyan = large proprietary models. |
|
|
- † = 10-sample majority vote ensemble result. |
|
|
|
|
|
### Legacy Medical Benchmark Performance |
|
|
|
|
|
| Dataset | Setting | Performance | |
|
|
|----------|---------|--------------| |
|
|
| VQA-RAD | Open (Token F1) | 64.23 | |
|
|
| VQA-RAD | Closed (Accuracy) | 85.66 | |
|
|
| SLAKE | Open (Token F1) | 84.96 | |
|
|
| SLAKE | Closed (Accuracy) | 89.66 | |
|
|
|
|
|
We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a **direct** prompt by including the phrase **Answer in a short word or phrase.** at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work. |
|
|
|
|
|
|
|
|
## Requirements |
|
|
We recommend installing the transformers version used in our experiments and other dependencies with this command: |
|
|
``` |
|
|
pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14 |
|
|
``` |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM. |
|
|
|
|
|
<details> |
|
|
<summary>Inference with HF Transformers 🤗</summary> |
|
|
Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# default: Load the model on the available device(s) |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto" |
|
|
) |
|
|
|
|
|
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. |
|
|
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
# "OctoMed/OctoMed-7B", |
|
|
# dtype=torch.bfloat16, |
|
|
# attn_implementation="flash_attention_2", |
|
|
# device_map="auto", |
|
|
# ) |
|
|
|
|
|
# The default range for the number of visual tokens per image in the model is 4-16384. |
|
|
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. |
|
|
min_pixels = 262144 |
|
|
max_pixels = 262144 |
|
|
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels) |
|
|
|
|
|
# Text-Only Query |
|
|
# messages = [ |
|
|
# { |
|
|
# "role": "user", |
|
|
# "content": [ |
|
|
# {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"}, |
|
|
# ], |
|
|
# } |
|
|
# ] |
|
|
|
|
|
# General Query |
|
|
# messages = [ |
|
|
# { |
|
|
# "role": "user", |
|
|
# "content": [ |
|
|
# { |
|
|
# "type": "image", |
|
|
# "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg", |
|
|
# }, |
|
|
# {"type": "text", "text": "Describe this image."}, |
|
|
# ], |
|
|
# } |
|
|
# ] |
|
|
|
|
|
# Multiple Choice Query |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg", |
|
|
}, |
|
|
{"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# Preparation for inference |
|
|
text = processor.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True |
|
|
) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
|
|
|
|
|
|
inputs = inputs.to(device="cuda") |
|
|
|
|
|
# Inference: Generation of the output |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=8192) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text) |
|
|
|
|
|
``` |
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Inference with vLLM</summary> |
|
|
|
|
|
Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1): |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
from transformers import AutoProcessor |
|
|
|
|
|
min_pixels = 262144 |
|
|
max_pixels = 262144 |
|
|
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels) |
|
|
|
|
|
llm = LLM( |
|
|
model="OctoMed/OctoMed-7B", |
|
|
trust_remote_code=True, |
|
|
dtype="bfloat16", |
|
|
max_model_len=8192, |
|
|
tensor_parallel_size=4, |
|
|
gpu_memory_utilization=0.8, |
|
|
limit_mm_per_prompt={"image": 1} |
|
|
) |
|
|
|
|
|
# Set up sampling parameters |
|
|
sampling_params = SamplingParams( |
|
|
temperature=0.6, |
|
|
top_p=0.95, |
|
|
max_tokens=8192, |
|
|
) |
|
|
|
|
|
image_data = [] |
|
|
|
|
|
# Text-Only Query |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# General Query |
|
|
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg'] |
|
|
# messages = [ |
|
|
# { |
|
|
# "role": "user", |
|
|
# "content": [ |
|
|
# { |
|
|
# "type": "image", |
|
|
# "image": image_data[0], |
|
|
# }, |
|
|
# {"type": "text", "text": "Describe this image."}, |
|
|
# ], |
|
|
# } |
|
|
# ] |
|
|
|
|
|
# Multiple Choice Query |
|
|
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg'] |
|
|
# messages = [ |
|
|
# { |
|
|
# "role": "user", |
|
|
# "content": [ |
|
|
# { |
|
|
# "type": "image", |
|
|
# "image": image_data[0], |
|
|
# }, |
|
|
# {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."}, |
|
|
# ], |
|
|
# } |
|
|
# ] |
|
|
|
|
|
prompt = processor.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True) |
|
|
|
|
|
if image_data: |
|
|
mm_prompt = { |
|
|
"prompt": prompt, |
|
|
"multi_modal_data": {"image": image_data} |
|
|
} |
|
|
else: |
|
|
mm_prompt = {"prompt": prompt} |
|
|
|
|
|
# Generate response |
|
|
outputs = llm.generate([mm_prompt], sampling_params) |
|
|
|
|
|
# Print the generated response |
|
|
for output in outputs: |
|
|
prompt = output.prompt |
|
|
generated_text = output.outputs[0].text |
|
|
print(f"Prompt: {prompt}") |
|
|
print(f"Generated text: {generated_text}") |
|
|
print("-" * 50) |
|
|
``` |
|
|
</details> |
|
|
|
|
|
|
|
|
|
|
|
### Suggested Hyperparameters |
|
|
We suggest using the same settings used in evaluation to reproduce results: |
|
|
|
|
|
Format multiple choice questions with the following template: |
|
|
``` |
|
|
{optional image(s)} |
|
|
{question} |
|
|
{options, 1 on each line} |
|
|
|
|
|
Please reason step-by-step, and put your final answer within \\boxed{}. |
|
|
``` |
|
|
|
|
|
Example Prompt: |
|
|
``` |
|
|
{image(s)} |
|
|
What orientation was the MRI in image B taken in? |
|
|
A: Axial |
|
|
B: Coronal |
|
|
C: Sagittal |
|
|
D: Oblique |
|
|
|
|
|
Please reason step-by-step, and put your final answer within \\boxed{}. |
|
|
``` |
|
|
- Use the default system prompt ("You are a helpful assistant.") |
|
|
- Extract the answer by looking at the content within the last \\boxed{}. |
|
|
- Temperature of 0.6 |
|
|
- Top-p of 0.95 |
|
|
- min_pixels = 262144 |
|
|
- max_pixels = 262144 |
|
|
|
|
|
|
|
|
### Known Issues |
|
|
* Model is sensitive to system prompt. We recommend using the default one. |
|
|
* The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so. |
|
|
|
|
|
We hope to address these concerns moving forward in future iterations! |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, feel free to give us a cite. |
|
|
|
|
|
``` |
|
|
@article{ossowski2025octomed, |
|
|
title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning}, |
|
|
author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, Guanghui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung}, |
|
|
journal={arXiv preprint arXiv:2511.23269}, |
|
|
year={2025} |
|
|
} |
|
|
``` |