OctoMed commited on
Commit
460cdf1
·
verified ·
1 Parent(s): c77ba93

Update README.md

Browse files

Update citation info

Files changed (1) hide show
  1. README.md +300 -300
README.md CHANGED
@@ -1,301 +1,301 @@
1
-
2
- ---
3
- license: apache-2.0
4
- language:
5
- - en
6
- pipeline_tag: image-text-to-text
7
- tags:
8
- - multimodal
9
- library_name: transformers
10
- base_model:
11
- - Qwen/Qwen2.5-VL-7B-Instruct
12
- ---
13
-
14
-
15
-
16
- # <img src="assets/OctoMed.svg" alt="OctoMed Logo" width="100" style="vertical-align:bottom; margin-right:0px;" /> OctoMed-7B
17
-
18
- ## Introduction
19
-
20
- OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.
21
-
22
- Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.
23
-
24
- OctoMed-7B produces internal reasoning traces in \<think>...\</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.
25
-
26
- ## Evaluation
27
-
28
- ### Medical Benchmark Performances
29
-
30
- <p align="center">
31
- <img src="assets/performances.svg" alt="Medical Benchmark Performances" width="100%" />
32
- </p>
33
-
34
- **Notes:**
35
- - Green = OSS smaller models (<10B), Cyan = large proprietary models.
36
- - † = 10-sample majority vote ensemble result.
37
-
38
- ### Legacy Medical Benchmark Performance
39
-
40
- | Dataset | Setting | Performance |
41
- |----------|---------|--------------|
42
- | VQA-RAD | Open (Token F1) | 64.23 |
43
- | VQA-RAD | Closed (Accuracy) | 85.66 |
44
- | SLAKE | Open (Token F1) | 84.96 |
45
- | SLAKE | Closed (Accuracy) | 89.66 |
46
-
47
- We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a **direct** prompt by including the phrase **Answer in a short word or phrase.** at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.
48
-
49
-
50
- ## Requirements
51
- We recommend installing the transformers version used in our experiments and other dependencies with this command:
52
- ```
53
- pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14
54
- ```
55
-
56
- ## Quickstart
57
-
58
- Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.
59
-
60
- <details>
61
- <summary>Inference with HF Transformers 🤗</summary>
62
- Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:
63
-
64
- ```python
65
- import torch
66
- from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
67
- from qwen_vl_utils import process_vision_info
68
-
69
- # default: Load the model on the available device(s)
70
- model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
71
- "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
72
- )
73
-
74
- # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
75
- # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
76
- # "OctoMed/OctoMed-7B",
77
- # dtype=torch.bfloat16,
78
- # attn_implementation="flash_attention_2",
79
- # device_map="auto",
80
- # )
81
-
82
- # The default range for the number of visual tokens per image in the model is 4-16384.
83
- # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
84
- min_pixels = 262144
85
- max_pixels = 262144
86
- processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
87
-
88
- # Text-Only Query
89
- # messages = [
90
- # {
91
- # "role": "user",
92
- # "content": [
93
- # {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
94
- # ],
95
- # }
96
- # ]
97
-
98
- # General Query
99
- # messages = [
100
- # {
101
- # "role": "user",
102
- # "content": [
103
- # {
104
- # "type": "image",
105
- # "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
106
- # },
107
- # {"type": "text", "text": "Describe this image."},
108
- # ],
109
- # }
110
- # ]
111
-
112
- # Multiple Choice Query
113
- messages = [
114
- {
115
- "role": "user",
116
- "content": [
117
- {
118
- "type": "image",
119
- "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
120
- },
121
- {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
122
- ],
123
- }
124
- ]
125
-
126
- # Preparation for inference
127
- text = processor.apply_chat_template(
128
- messages, tokenize=False, add_generation_prompt=True
129
- )
130
- image_inputs, video_inputs = process_vision_info(messages)
131
- inputs = processor(
132
- text=[text],
133
- images=image_inputs,
134
- videos=video_inputs,
135
- padding=True,
136
- return_tensors="pt",
137
- )
138
-
139
-
140
- inputs = inputs.to(device="cuda")
141
-
142
- # Inference: Generation of the output
143
- generated_ids = model.generate(**inputs, max_new_tokens=8192)
144
- generated_ids_trimmed = [
145
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
146
- ]
147
- output_text = processor.batch_decode(
148
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
149
- )
150
- print(output_text)
151
-
152
- ```
153
- </details>
154
-
155
- <details>
156
- <summary>Inference with vLLM</summary>
157
-
158
- Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):
159
-
160
- ```python
161
- from vllm import LLM, SamplingParams
162
- from transformers import AutoProcessor
163
-
164
- min_pixels = 262144
165
- max_pixels = 262144
166
- processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
167
-
168
- llm = LLM(
169
- model="OctoMed/OctoMed-7B",
170
- trust_remote_code=True,
171
- dtype="bfloat16",
172
- max_model_len=8192,
173
- tensor_parallel_size=4,
174
- gpu_memory_utilization=0.8,
175
- limit_mm_per_prompt={"image": 1}
176
- )
177
-
178
- # Set up sampling parameters
179
- sampling_params = SamplingParams(
180
- temperature=0.6,
181
- top_p=0.95,
182
- max_tokens=8192,
183
- )
184
-
185
- image_data = []
186
-
187
- # Text-Only Query
188
- messages = [
189
- {
190
- "role": "user",
191
- "content": [
192
- {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
193
- ],
194
- }
195
- ]
196
-
197
- # General Query
198
- # image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
199
- # messages = [
200
- # {
201
- # "role": "user",
202
- # "content": [
203
- # {
204
- # "type": "image",
205
- # "image": image_data[0],
206
- # },
207
- # {"type": "text", "text": "Describe this image."},
208
- # ],
209
- # }
210
- # ]
211
-
212
- # Multiple Choice Query
213
- # image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
214
- # messages = [
215
- # {
216
- # "role": "user",
217
- # "content": [
218
- # {
219
- # "type": "image",
220
- # "image": image_data[0],
221
- # },
222
- # {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
223
- # ],
224
- # }
225
- # ]
226
-
227
- prompt = processor.apply_chat_template(
228
- messages, tokenize=False, add_generation_prompt=True)
229
-
230
- if image_data:
231
- mm_prompt = {
232
- "prompt": prompt,
233
- "multi_modal_data": {"image": image_data}
234
- }
235
- else:
236
- mm_prompt = {"prompt": prompt}
237
-
238
- # Generate response
239
- outputs = llm.generate([mm_prompt], sampling_params)
240
-
241
- # Print the generated response
242
- for output in outputs:
243
- prompt = output.prompt
244
- generated_text = output.outputs[0].text
245
- print(f"Prompt: {prompt}")
246
- print(f"Generated text: {generated_text}")
247
- print("-" * 50)
248
- ```
249
- </details>
250
-
251
-
252
-
253
- ### Suggested Hyperparameters
254
- We suggest using the same settings used in evaluation to reproduce results:
255
-
256
- Format multiple choice questions with the following template:
257
- ```
258
- {optional image(s)}
259
- {question}
260
- {options, 1 on each line}
261
-
262
- Please reason step-by-step, and put your final answer within \\boxed{}.
263
- ```
264
-
265
- Example Prompt:
266
- ```
267
- {image(s)}
268
- What orientation was the MRI in image B taken in?
269
- A: Axial
270
- B: Coronal
271
- C: Sagittal
272
- D: Oblique
273
-
274
- Please reason step-by-step, and put your final answer within \\boxed{}.
275
- ```
276
- - Use the default system prompt ("You are a helpful assistant.")
277
- - Extract the answer by looking at the content within the last \\boxed{}.
278
- - Temperature of 0.6
279
- - Top-p of 0.95
280
- - min_pixels = 262144
281
- - max_pixels = 262144
282
-
283
-
284
- ### Known Issues
285
- * Model is sensitive to system prompt. We recommend using the default one.
286
- * The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.
287
- * Multi-turn conversation tasks are not part of the SFT training, and therefore may not be logically coherent.
288
-
289
-
290
- ## Citation
291
-
292
- If you find our work helpful, feel free to give us a cite.
293
-
294
- ```
295
- @article{OctoMed,
296
- title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
297
- author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, GuangHui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
298
- journal={arXiv preprint arXiv:2409.12191},
299
- year={2025}
300
- }
301
  ```
 
1
+
2
+ ---
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ pipeline_tag: image-text-to-text
7
+ tags:
8
+ - multimodal
9
+ library_name: transformers
10
+ base_model:
11
+ - Qwen/Qwen2.5-VL-7B-Instruct
12
+ ---
13
+
14
+
15
+
16
+ # <img src="assets/OctoMed.svg" alt="OctoMed Logo" width="100" style="vertical-align:bottom; margin-right:0px;" /> OctoMed-7B
17
+
18
+ ## Introduction
19
+
20
+ OctoMed-7B is a high-performance multimodal medical reasoning model created through large-scale data curation and supervised fine-tuning (SFT). To support reliable clinical reasoning, we developed a scalable data pipeline that distills structured reasoning traces from DeepSeek-R1 and GPT-4o and produced the largest multimodal medical reasoning dataset to date with more than 8 million traces and 6.8 billion response tokens.
21
+
22
+ Using Qwen2.5-VL-7B-Instruct as the base model, OctoMed-7B is trained on this curated corpus and achieves strong, robust performance on a wide range of out-of-distribution medical benchmarks.
23
+
24
+ OctoMed-7B produces internal reasoning traces in \<think>...\</think> tokens before writing out its final answer. In general, the model has a tendency to think longer for harder or ill-defined questions, while sticking to shorter reasoning traces for easier queries.
25
+
26
+ ## Evaluation
27
+
28
+ ### Medical Benchmark Performances
29
+
30
+ <p align="center">
31
+ <img src="assets/performances.svg" alt="Medical Benchmark Performances" width="100%" />
32
+ </p>
33
+
34
+ **Notes:**
35
+ - Green = OSS smaller models (<10B), Cyan = large proprietary models.
36
+ - † = 10-sample majority vote ensemble result.
37
+
38
+ ### Legacy Medical Benchmark Performance
39
+
40
+ | Dataset | Setting | Performance |
41
+ |----------|---------|--------------|
42
+ | VQA-RAD | Open (Token F1) | 64.23 |
43
+ | VQA-RAD | Closed (Accuracy) | 85.66 |
44
+ | SLAKE | Open (Token F1) | 84.96 |
45
+ | SLAKE | Closed (Accuracy) | 89.66 |
46
+
47
+ We also train on the train splits of the VQA-RAD and SLAKE datasets and report the performances here. For these results, we apply a **direct** prompt by including the phrase **Answer in a short word or phrase.** at the end of each sample. GPT2 is used as the tokenizer to compute Token F1 for open-ended questions following prior work.
48
+
49
+
50
+ ## Requirements
51
+ We recommend installing the transformers version used in our experiments and other dependencies with this command:
52
+ ```
53
+ pip install transformers==4.57.1 accelerate==1.12.0 torchvision==0.24.1 qwen-vl-utils==0.0.14
54
+ ```
55
+
56
+ ## Quickstart
57
+
58
+ Below, we provide a some examples to show how to use OctoMed-7B with 🤗 Transformers or vLLM.
59
+
60
+ <details>
61
+ <summary>Inference with HF Transformers 🤗</summary>
62
+ Here we show a code snippet to show you how chat with OctoMed-7B using `transformers` and `qwen_vl_utils`:
63
+
64
+ ```python
65
+ import torch
66
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
67
+ from qwen_vl_utils import process_vision_info
68
+
69
+ # default: Load the model on the available device(s)
70
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
71
+ "OctoMed/OctoMed-7B", dtype=torch.bfloat16, device_map="auto"
72
+ )
73
+
74
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
75
+ # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
76
+ # "OctoMed/OctoMed-7B",
77
+ # dtype=torch.bfloat16,
78
+ # attn_implementation="flash_attention_2",
79
+ # device_map="auto",
80
+ # )
81
+
82
+ # The default range for the number of visual tokens per image in the model is 4-16384.
83
+ # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
84
+ min_pixels = 262144
85
+ max_pixels = 262144
86
+ processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
87
+
88
+ # Text-Only Query
89
+ # messages = [
90
+ # {
91
+ # "role": "user",
92
+ # "content": [
93
+ # {"type": "text", "text": "I've had a persistent dry cough for two weeks but no fever. Could this be allergies, and when should I see a doctor?"},
94
+ # ],
95
+ # }
96
+ # ]
97
+
98
+ # General Query
99
+ # messages = [
100
+ # {
101
+ # "role": "user",
102
+ # "content": [
103
+ # {
104
+ # "type": "image",
105
+ # "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
106
+ # },
107
+ # {"type": "text", "text": "Describe this image."},
108
+ # ],
109
+ # }
110
+ # ]
111
+
112
+ # Multiple Choice Query
113
+ messages = [
114
+ {
115
+ "role": "user",
116
+ "content": [
117
+ {
118
+ "type": "image",
119
+ "image": "https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg",
120
+ },
121
+ {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
122
+ ],
123
+ }
124
+ ]
125
+
126
+ # Preparation for inference
127
+ text = processor.apply_chat_template(
128
+ messages, tokenize=False, add_generation_prompt=True
129
+ )
130
+ image_inputs, video_inputs = process_vision_info(messages)
131
+ inputs = processor(
132
+ text=[text],
133
+ images=image_inputs,
134
+ videos=video_inputs,
135
+ padding=True,
136
+ return_tensors="pt",
137
+ )
138
+
139
+
140
+ inputs = inputs.to(device="cuda")
141
+
142
+ # Inference: Generation of the output
143
+ generated_ids = model.generate(**inputs, max_new_tokens=8192)
144
+ generated_ids_trimmed = [
145
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
146
+ ]
147
+ output_text = processor.batch_decode(
148
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
149
+ )
150
+ print(output_text)
151
+
152
+ ```
153
+ </details>
154
+
155
+ <details>
156
+ <summary>Inference with vLLM</summary>
157
+
158
+ Here we show an example of how to use OctoMed with vLLM (tested with vLLM==0.11.2 and transformers==4.57.1):
159
+
160
+ ```python
161
+ from vllm import LLM, SamplingParams
162
+ from transformers import AutoProcessor
163
+
164
+ min_pixels = 262144
165
+ max_pixels = 262144
166
+ processor = AutoProcessor.from_pretrained("OctoMed/OctoMed-7B", min_pixels=min_pixels, max_pixels=max_pixels)
167
+
168
+ llm = LLM(
169
+ model="OctoMed/OctoMed-7B",
170
+ trust_remote_code=True,
171
+ dtype="bfloat16",
172
+ max_model_len=8192,
173
+ tensor_parallel_size=4,
174
+ gpu_memory_utilization=0.8,
175
+ limit_mm_per_prompt={"image": 1}
176
+ )
177
+
178
+ # Set up sampling parameters
179
+ sampling_params = SamplingParams(
180
+ temperature=0.6,
181
+ top_p=0.95,
182
+ max_tokens=8192,
183
+ )
184
+
185
+ image_data = []
186
+
187
+ # Text-Only Query
188
+ messages = [
189
+ {
190
+ "role": "user",
191
+ "content": [
192
+ {"type": "text", "text": "Explain the difference between type 1 and type 2 diabetes."},
193
+ ],
194
+ }
195
+ ]
196
+
197
+ # General Query
198
+ # image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
199
+ # messages = [
200
+ # {
201
+ # "role": "user",
202
+ # "content": [
203
+ # {
204
+ # "type": "image",
205
+ # "image": image_data[0],
206
+ # },
207
+ # {"type": "text", "text": "Describe this image."},
208
+ # ],
209
+ # }
210
+ # ]
211
+
212
+ # Multiple Choice Query
213
+ # image_data = ['https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51b2/10835941/13323b55fbb5/13256_2024_4349_Fig1_HTML.jpg']
214
+ # messages = [
215
+ # {
216
+ # "role": "user",
217
+ # "content": [
218
+ # {
219
+ # "type": "image",
220
+ # "image": image_data[0],
221
+ # },
222
+ # {"type": "text", "text": "What orientation was the MRI in image B taken in?\nA. Axial\nB. Coronal\nC. Sagittal\nD. Oblique\n\nPlease reason step-by-step, and put your final answer within \\boxed{}."},
223
+ # ],
224
+ # }
225
+ # ]
226
+
227
+ prompt = processor.apply_chat_template(
228
+ messages, tokenize=False, add_generation_prompt=True)
229
+
230
+ if image_data:
231
+ mm_prompt = {
232
+ "prompt": prompt,
233
+ "multi_modal_data": {"image": image_data}
234
+ }
235
+ else:
236
+ mm_prompt = {"prompt": prompt}
237
+
238
+ # Generate response
239
+ outputs = llm.generate([mm_prompt], sampling_params)
240
+
241
+ # Print the generated response
242
+ for output in outputs:
243
+ prompt = output.prompt
244
+ generated_text = output.outputs[0].text
245
+ print(f"Prompt: {prompt}")
246
+ print(f"Generated text: {generated_text}")
247
+ print("-" * 50)
248
+ ```
249
+ </details>
250
+
251
+
252
+
253
+ ### Suggested Hyperparameters
254
+ We suggest using the same settings used in evaluation to reproduce results:
255
+
256
+ Format multiple choice questions with the following template:
257
+ ```
258
+ {optional image(s)}
259
+ {question}
260
+ {options, 1 on each line}
261
+
262
+ Please reason step-by-step, and put your final answer within \\boxed{}.
263
+ ```
264
+
265
+ Example Prompt:
266
+ ```
267
+ {image(s)}
268
+ What orientation was the MRI in image B taken in?
269
+ A: Axial
270
+ B: Coronal
271
+ C: Sagittal
272
+ D: Oblique
273
+
274
+ Please reason step-by-step, and put your final answer within \\boxed{}.
275
+ ```
276
+ - Use the default system prompt ("You are a helpful assistant.")
277
+ - Extract the answer by looking at the content within the last \\boxed{}.
278
+ - Temperature of 0.6
279
+ - Top-p of 0.95
280
+ - min_pixels = 262144
281
+ - max_pixels = 262144
282
+
283
+
284
+ ### Known Issues
285
+ * Model is sensitive to system prompt. We recommend using the default one.
286
+ * The model is finetuned for multiple-choice VQA. The model may follow instructions for other tasks but is not extensively tested or post-trained to do so.
287
+ * Multi-turn conversation tasks are not part of the SFT training, and therefore may not be logically coherent.
288
+
289
+
290
+ ## Citation
291
+
292
+ If you find our work helpful, feel free to give us a cite.
293
+
294
+ ```
295
+ @article{OctoMed,
296
+ title={OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning},
297
+ author={Ossowski, Timothy and Zhang, Sheng and Liu, Qianchu and Qin, GuangHui and Tan, Reuben and Naumann, Tristan and Hu, Junjie and Poon, Hoifung},
298
+ journal={arXiv preprint arXiv:2511.23269},
299
+ year={2025}
300
+ }
301
  ```