Instructions to use OctoMed/OctoMed-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OctoMed/OctoMed-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="OctoMed/OctoMed-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B")
model = AutoModelForMultimodalLM.from_pretrained("OctoMed/OctoMed-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OctoMed/OctoMed-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OctoMed/OctoMed-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OctoMed/OctoMed-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/OctoMed/OctoMed-7B

SGLang

How to use OctoMed/OctoMed-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OctoMed/OctoMed-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OctoMed/OctoMed-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OctoMed/OctoMed-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OctoMed/OctoMed-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use OctoMed/OctoMed-7B with Docker Model Runner:
```
docker model run hf.co/OctoMed/OctoMed-7B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

OctoMed-7B

Introduction

OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.

Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.

OctoMed-7B produces internal reasoning traces in <think>...</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.

Evaluation

Medical Benchmark Performances

Notes:

Green = OSS smaller models (<10B), Cyan = large proprietary models.
† = 10-sample majority vote ensemble result.

Legacy Medical Benchmark Performance

Dataset	Setting	Performance
VQA-RAD	Open (Token F1)	64.23
VQA-RAD	Closed (Accuracy)	85.66
SLAKE	Open (Token F1)	84.96
SLAKE	Closed (Accuracy)	89.66

We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a direct prompt by including the phrase Answer in a short word or phrase. at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.

Requirements

We recommend installing the transformers version used in our experiments and other dependencies with this command:

pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14

Quickstart

Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.

Inference with HF Transformers 🤗

Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "OctoMed/OctoMed-7B",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
min_pixels = 262144
max_pixels = 262144
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

# Text-Only Query
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
#         ],
#     }
# ]

# General Query
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
#             },
#             {"type": "text", "text": "Describe this image."},
#         ],
#     }
# ]

# Multiple Choice Query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
            },
            {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

        
inputs = inputs.to(device="cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Inference with vLLM

Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):

from vllm import LLM, SamplingParams
from transformers import AutoProcessor

min_pixels = 262144
max_pixels = 262144
processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)

llm = LLM(
    model="OctoMed/OctoMed-7B",
    trust_remote_code=True,
    dtype="bfloat16",
    max_model_len=8192,
    tensor_parallel_size=4,
    gpu_memory_utilization=0.8,
    limit_mm_per_prompt={"image": 1}
)

# Set up sampling parameters
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=8192,
)

image_data = []

# Text-Only Query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
        ],
    }
]

# General Query
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": image_data[0],
#             },
#             {"type": "text", "text": "Describe this image."},
#         ],
#     }
# ]

# Multiple Choice Query
# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": image_data[0],
#             },
#             {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
#         ],
#     }
# ]

prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True)

if image_data:
    mm_prompt = {
        "prompt": prompt,
        "multi_modal_data": {"image": image_data}
    }
else:
    mm_prompt = {"prompt": prompt}

# Generate response
outputs = llm.generate([mm_prompt], sampling_params)

# Print the generated response
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Generated text: {generated_text}")
    print("-" * 50)

Suggested Hyperparameters

We suggest using the same settings used in evaluation to reproduce results:

Format multiple choice questions with the following template:

{optional image(s)}
{question}
{options, 1 on each line}

Please reason step-by-step, and put your final answer within \\boxed{}.

Example Prompt:

{image(s)}
What orientation was the MRI in image B taken in?
A: Axial
B: Coronal
C: Sagittal
D: Oblique

Please reason step-by-step, and put your final answer within \\boxed{}.

Use the default system prompt ("You are a helpful assistant.")
Extract the answer by looking at the content within the last \boxed{}.
Temperature of 0.6
Top-p of 0.95
min_pixels = 262144
max_pixels = 262144

Known Issues

Model is sensitive to system prompt. We recommend using the default one.
The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.

We hope to address these concerns moving forward in future iterations!

Citation

If you find our work helpful, feel free to give us a cite.

@article{ossowski2025octomed,
  title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
  author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, Guanghui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
  journal={arXiv preprint arXiv:2511.23269},
  year={2025}
}