OctoMed-7B / README.md

Update README.md

f84fa4b verified 2 months ago

9.56 kB


	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	library_name: transformers
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	---



	# <img src="assets/OctoMed.svg" alt="OctoMed Logo" width="100" style="vertical-align:bottom; margin-right:0px;" /> OctoMed-7B

	## Introduction

	OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.

	Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.

	OctoMed-7B produces internal reasoning traces in \<think>...\</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.

	## Evaluation

	### Medical Benchmark Performances

	<p align="center">
	<img src="assets/performances.svg" alt="Medical Benchmark Performances" width="100%" />
	</p>

	Notes:
	- Green = OSS smaller models (<10B), Cyan = large proprietary models.
	- † = 10-sample majority vote ensemble result.

	### Legacy Medical Benchmark Performance

	\| Dataset \| Setting \| Performance \|
	\|----------\|---------\|--------------\|
	\| VQA-RAD \| Open (Token F1) \| 64.23 \|
	\| VQA-RAD \| Closed (Accuracy) \| 85.66 \|
	\| SLAKE \| Open (Token F1) \| 84.96 \|
	\| SLAKE \| Closed (Accuracy) \| 89.66 \|

	We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a direct prompt by including the phrase Answer in a short word or phrase. at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.


	## Requirements
	We recommend installing the transformers version used in our experiments and other dependencies with this command:
	```
	pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14
	```

	## Quickstart

	Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.

	<details>
	<summary>Inference with HF Transformers 🤗</summary>
	Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:

	```python
	import torch
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# default: Load the model on the available device(s)
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
	)

	# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
	# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	# "OctoMed/OctoMed-7B",
	# dtype=torch.bfloat16,
	# attn_implementation="flash_attention_2",
	# device_map="auto",
	# )

	# The default range for the number of visual tokens per image in the model is 4-16384.
	# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
	min_pixels = 262144
	max_pixels = 262144
	processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

	# Text-Only Query
	# messages = [
	# {
	# "role": "user",
	# "content": [
	# {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
	# ],
	# }
	# ]

	# General Query
	# messages = [
	# {
	# "role": "user",
	# "content": [
	# {
	# "type": "image",
	# "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
	# },
	# {"type": "text", "text": "Describe this image."},
	# ],
	# }
	# ]

	# Multiple Choice Query
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
	},
	{"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)


	inputs = inputs.to(device="cuda")

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=8192)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)

	```
	</details>

	<details>
	<summary>Inference with vLLM</summary>

	Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoProcessor

	min_pixels = 262144
	max_pixels = 262144
	processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

	llm = LLM(
	model="OctoMed/OctoMed-7B",
	trust_remote_code=True,
	dtype="bfloat16",
	max_model_len=8192,
	tensor_parallel_size=4,
	gpu_memory_utilization=0.8,
	limit_mm_per_prompt={"image": 1}
	)

	# Set up sampling parameters
	sampling_params = SamplingParams(
	temperature=0.6,
	top_p=0.95,
	max_tokens=8192,
	)

	image_data = []

	# Text-Only Query
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
	],
	}
	]

	# General Query
	# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
	# messages = [
	# {
	# "role": "user",
	# "content": [
	# {
	# "type": "image",
	# "image": image_data[0],
	# },
	# {"type": "text", "text": "Describe this image."},
	# ],
	# }
	# ]

	# Multiple Choice Query
	# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
	# messages = [
	# {
	# "role": "user",
	# "content": [
	# {
	# "type": "image",
	# "image": image_data[0],
	# },
	# {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
	# ],
	# }
	# ]

	prompt = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True)

	if image_data:
	mm_prompt = {
	"prompt": prompt,
	"multi_modal_data": {"image": image_data}
	}
	else:
	mm_prompt = {"prompt": prompt}

	# Generate response
	outputs = llm.generate([mm_prompt], sampling_params)

	# Print the generated response
	for output in outputs:
	prompt = output.prompt
	generated_text = output.outputs[0].text
	print(f"Prompt: {prompt}")
	print(f"Generated text: {generated_text}")
	print("-" * 50)
	```
	</details>



	### Suggested Hyperparameters
	We suggest using the same settings used in evaluation to reproduce results:

	Format multiple choice questions with the following template:
	```
	{optional image(s)}
	{question}
	{options, 1 on each line}

	Please reason step-by-step, and put your final answer within \\boxed{}.
	```

	Example Prompt:
	```
	{image(s)}
	What orientation was the MRI in image B taken in?
	A: Axial
	B: Coronal
	C: Sagittal
	D: Oblique

	Please reason step-by-step, and put your final answer within \\boxed{}.
	```
	- Use the default system prompt ("You are a helpful assistant.")
	- Extract the answer by looking at the content within the last \\boxed{}.
	- Temperature of 0.6
	- Top-p of 0.95
	- min_pixels = 262144
	- max_pixels = 262144


	### Known Issues
	* Model is sensitive to system prompt. We recommend using the default one.
	* The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.

	We hope to address these concerns moving forward in future iterations!

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```
	@article{ossowski2025octomed,
	title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
	author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, Guanghui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
	journal={arXiv preprint arXiv:2511.23269},
	year={2025}
	}
	```