OctoMed
/

OctoMed-7B

@@ -1,301 +1,301 @@
----
-license: apache-2.0
-language:
-- en
-pipeline_tag: image-text-to-text
-tags:
-- multimodal
-library_name: transformers
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
----
-# <img src="assets/OctoMed.svg" alt="OctoMed Logo" width="100" style="vertical-align:bottom; margin-right:0px;" /> OctoMed-7B
-## Introduction
-OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.
-Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.
-OctoMed-7B produces internal reasoning traces in \<think>...\</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.
-## Evaluation
-### Medical Benchmark Performances
-<p align="center">
-    <img src="assets/performances.svg" alt="Medical Benchmark Performances" width="100%" />
-</p>
-**Notes:**
-- Green = OSS smaller models (<10B), Cyan = large proprietary models.
-- † = 10-sample majority vote ensemble result.
-### Legacy Medical Benchmark Performance
-| Dataset  | Setting | Performance |
-|----------|---------|--------------|
-| VQA-RAD  | Open (Token F1)    | 64.23        |
-| VQA-RAD  | Closed (Accuracy)  | 85.66        |
-| SLAKE    | Open (Token F1)   | 84.96        |
-| SLAKE    | Closed (Accuracy) | 89.66        |
-We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a **direct** prompt by including the phrase **Answer in a short word or phrase.** at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.
-## Requirements
-We recommend installing the transformers version used in our experiments and other dependencies with this command:
-```
-pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14
-```
-## Quickstart
-Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.
-<details>
-<summary>Inference with HF Transformers 🤗</summary>
-Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:
-```python
-import torch
-from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
-from qwen_vl_utils import process_vision_info
-# default: Load the model on the available device(s)
-model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-    "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
-)
-# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
-# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-#     "OctoMed/OctoMed-7B",
-#     dtype=torch.bfloat16,
-#     attn_implementation="flash_attention_2",
-#     device_map="auto",
-# )
-# The default range for the number of visual tokens per image in the model is 4-16384.
-# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
-min_pixels = 262144
-max_pixels = 262144
-processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
-# Text-Only Query
-# messages = [
-#     {
-#         "role": "user",
-#         "content": [
-#             {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
-#         ],
-#     }
-# ]
-# General Query
-# messages = [
-#     {
-#         "role": "user",
-#         "content": [
-#             {
-#                 "type": "image",
-#                 "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
-#             },
-#             {"type": "text", "text": "Describe this image."},
-#         ],
-#     }
-# ]
-# Multiple Choice Query
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {
-                "type": "image",
-                "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
-            },
-            {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
-        ],
-    }
-]
-# Preparation for inference
-text = processor.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True
-)
-image_inputs, video_inputs = process_vision_info(messages)
-inputs = processor(
-    text=[text],
-    images=image_inputs,
-    videos=video_inputs,
-    padding=True,
-    return_tensors="pt",
-)
-inputs = inputs.to(device="cuda")
-# Inference: Generation of the output
-generated_ids = model.generate(**inputs, max_new_tokens=8192)
-generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
-]
-output_text = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)
-print(output_text)
-```
-</details>
-<details>
-<summary>Inference with vLLM</summary>
-Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):
-```python
-from vllm import LLM, SamplingParams
-from transformers import AutoProcessor
-min_pixels = 262144
-max_pixels = 262144
-processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
-llm = LLM(
-    model="OctoMed/OctoMed-7B",
-    trust_remote_code=True,
-    dtype="bfloat16",
-    max_model_len=8192,
-    tensor_parallel_size=4,
-    gpu_memory_utilization=0.8,
-    limit_mm_per_prompt={"image": 1}
-)
-# Set up sampling parameters
-sampling_params = SamplingParams(
-    temperature=0.6,
-    top_p=0.95,
-    max_tokens=8192,
-)
-image_data = []
-# Text-Only Query
-messages = [
-    {
-        "role": "user",
-        "content": [
-            {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
-        ],
-    }
-]
-# General Query
-# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
-# messages = [
-#     {
-#         "role": "user",
-#         "content": [
-#             {
-#                 "type": "image",
-#                 "image": image_data[0],
-#             },
-#             {"type": "text", "text": "Describe this image."},
-#         ],
-#     }
-# ]
-# Multiple Choice Query
-# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
-# messages = [
-#     {
-#         "role": "user",
-#         "content": [
-#             {
-#                 "type": "image",
-#                 "image": image_data[0],
-#             },
-#             {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
-#         ],
-#     }
-# ]
-prompt = processor.apply_chat_template(
-    messages, tokenize=False, add_generation_prompt=True)
-if image_data:
-    mm_prompt = {
-        "prompt": prompt,
-        "multi_modal_data": {"image": image_data}
-    }
-else:
-    mm_prompt = {"prompt": prompt}
-# Generate response
-outputs = llm.generate([mm_prompt], sampling_params)
-# Print the generated response
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt}")
-    print(f"Generated text: {generated_text}")
-    print("-" * 50)
-```
-</details>
-### Suggested Hyperparameters
-We suggest using the same settings used in evaluation to reproduce results:
-Format multiple choice questions with the following template:
-```
-{optional image(s)}
-{question}
-{options, 1 on each line}
-Please reason step-by-step, and put your final answer within \\boxed{}.
-```
-Example Prompt:
-```
-{image(s)}
-What orientation was the MRI in image B taken in?
-A: Axial
-B: Coronal
-C: Sagittal
-D: Oblique
-Please reason step-by-step, and put your final answer within \\boxed{}.
-```
-- Use the default system prompt ("You are a helpful assistant.")
-- Extract the answer by looking at the content within the last \\boxed{}.
-- Temperature of 0.6
-- Top-p of 0.95
-- min_pixels = 262144
-- max_pixels = 262144
-### Known Issues
-* Model is sensitive to system prompt. We recommend using the default one.
-* The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.
-* Multi-turn conversation tasks are not part of the SFT training, and therefore may not be logically coherent.
-## Citation
-If you find our work helpful, feel free to give us a cite.
-```
-@article{OctoMed,
-  title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
-  author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, GuangHui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
-  journal={arXiv preprint arXiv:2409.12191},
-  year={2025}
-}
 ```

+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+library_name: transformers
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
+---
+# <img src="assets/OctoMed.svg" alt="OctoMed Logo" width="100" style="vertical-align:bottom; margin-right:0px;" /> OctoMed-7B
+## Introduction
+OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.
+Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.
+OctoMed-7B produces internal reasoning traces in \<think>...\</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.
+## Evaluation
+### Medical Benchmark Performances
+<p align="center">
+    <img src="assets/performances.svg" alt="Medical Benchmark Performances" width="100%" />
+</p>
+**Notes:**
+- Green = OSS smaller models (<10B), Cyan = large proprietary models.
+- † = 10-sample majority vote ensemble result.
+### Legacy Medical Benchmark Performance
+| Dataset  | Setting | Performance |
+|----------|---------|--------------|
+| VQA-RAD  | Open (Token F1)    | 64.23        |
+| VQA-RAD  | Closed (Accuracy)  | 85.66        |
+| SLAKE    | Open (Token F1)   | 84.96        |
+| SLAKE    | Closed (Accuracy) | 89.66        |
+We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a **direct** prompt by including the phrase **Answer in a short word or phrase.** at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.
+## Requirements
+We recommend installing the transformers version used in our experiments and other dependencies with this command:
+```
+pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14
+```
+## Quickstart
+Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.
+<details>
+<summary>Inference with HF Transformers 🤗</summary>
+Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:
+```python
+import torch
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# default: Load the model on the available device(s)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
+)
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+#     "OctoMed/OctoMed-7B",
+#     dtype=torch.bfloat16,
+#     attn_implementation="flash_attention_2",
+#     device_map="auto",
+# )
+# The default range for the number of visual tokens per image in the model is 4-16384.
+# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
+min_pixels = 262144
+max_pixels = 262144
+processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
+# Text-Only Query
+# messages = [
+#     {
+#         "role": "user",
+#         "content": [
+#             {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
+#         ],
+#     }
+# ]
+# General Query
+# messages = [
+#     {
+#         "role": "user",
+#         "content": [
+#             {
+#                 "type": "image",
+#                 "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
+#             },
+#             {"type": "text", "text": "Describe this image."},
+#         ],
+#     }
+# ]
+# Multiple Choice Query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
+            },
+            {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to(device="cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=8192)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+</details>
+<details>
+<summary>Inference with vLLM</summary>
+Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoProcessor
+min_pixels = 262144
+max_pixels = 262144
+processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
+llm = LLM(
+    model="OctoMed/OctoMed-7B",
+    trust_remote_code=True,
+    dtype="bfloat16",
+    max_model_len=8192,
+    tensor_parallel_size=4,
+    gpu_memory_utilization=0.8,
+    limit_mm_per_prompt={"image": 1}
+)
+# Set up sampling parameters
+sampling_params = SamplingParams(
+    temperature=0.6,
+    top_p=0.95,
+    max_tokens=8192,
+)
+image_data = []
+# Text-Only Query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
+        ],
+    }
+]
+# General Query
+# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
+# messages = [
+#     {
+#         "role": "user",
+#         "content": [
+#             {
+#                 "type": "image",
+#                 "image": image_data[0],
+#             },
+#             {"type": "text", "text": "Describe this image."},
+#         ],
+#     }
+# ]
+# Multiple Choice Query
+# image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
+# messages = [
+#     {
+#         "role": "user",
+#         "content": [
+#             {
+#                 "type": "image",
+#                 "image": image_data[0],
+#             },
+#             {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
+#         ],
+#     }
+# ]
+prompt = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True)
+if image_data:
+    mm_prompt = {
+        "prompt": prompt,
+        "multi_modal_data": {"image": image_data}
+    }
+else:
+    mm_prompt = {"prompt": prompt}
+# Generate response
+outputs = llm.generate([mm_prompt], sampling_params)
+# Print the generated response
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt}")
+    print(f"Generated text: {generated_text}")
+    print("-" * 50)
+```
+</details>
+### Suggested Hyperparameters
+We suggest using the same settings used in evaluation to reproduce results:
+Format multiple choice questions with the following template:
+```
+{optional image(s)}
+{question}
+{options, 1 on each line}
+Please reason step-by-step, and put your final answer within \\boxed{}.
+```
+Example Prompt:
+```
+{image(s)}
+What orientation was the MRI in image B taken in?
+A: Axial
+B: Coronal
+C: Sagittal
+D: Oblique
+Please reason step-by-step, and put your final answer within \\boxed{}.
+```
+- Use the default system prompt ("You are a helpful assistant.")
+- Extract the answer by looking at the content within the last \\boxed{}.
+- Temperature of 0.6
+- Top-p of 0.95
+- min_pixels = 262144
+- max_pixels = 262144
+### Known Issues
+* Model is sensitive to system prompt. We recommend using the default one.
+* The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.
+* Multi-turn conversation tasks are not part of the SFT training, and therefore may not be logically coherent.
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```
+@article{OctoMed,
+  title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
+  author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, GuangHui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
+  journal={arXiv preprint arXiv:2511.23269},
+  year={2025}
+}
 ```