chaoyinshe commited on
Commit
8609fde
·
verified ·
1 Parent(s): b78ccff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -3
README.md CHANGED
@@ -1,3 +1,180 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ metrics:
7
+ - bertscore
8
+ - bleu
9
+ base_model:
10
+ - Qwen/Qwen2-VL-7B-Instruct
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
+ tags:
14
+ - medical
15
+ ---
16
+
17
+
18
+
19
+ # EchoVLM (paper implementation)
20
+
21
+ Official PyTorch implementation of the model described in
22
+ **"[EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence](https://arxiv.org/abs/2509.14977)"**.
23
+
24
+ ## 🤖 Model Details
25
+
26
+ | Item | Value |
27
+ |-------------|-------------------------------------------------|
28
+ | Paper | [arXiv:2509.14977](https://arxiv.org/abs/2509.14977) |
29
+ | Authors | Chaoyin She¹, Ruifang Lu² |
30
+ | Code | [GitHub repo](https://github.com/Asunatan/EchoVLM) |
31
+
32
+ ## 🚀 Quick Start
33
+ ### Using 🤗 Transformers to Chat
34
+
35
+ Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
36
+
37
+ ```python
38
+ from transformers import Qwen2VLMOEForConditionalGeneration, AutoProcessor
39
+ from qwen_vl_utils import process_vision_info
40
+ import torch
41
+
42
+ # ===== 1. Load model & processor =====
43
+ model = Qwen2VLMOEForConditionalGeneration.from_pretrained(
44
+ "chaoyinshe/EchoVLM",
45
+ torch_dtype=torch.bfloat16,
46
+ attn_implementation="flash_attention_2", # faster & memory-efficient
47
+ device_map="auto",
48
+ )
49
+ processor = AutoProcessor.from_pretrained("chaoyinshe/EchoVLM")
50
+ # The default range for the number of visual tokens per image in the model is 4-16384.
51
+ # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
52
+ # min_pixels = 256*28*28
53
+ # max_pixels = 1280*28*28
54
+ # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
55
+ messages = [
56
+ {
57
+ "role": "user",
58
+ "content": [
59
+ {
60
+ "type": "image",
61
+ "image": "An ultrasound image",
62
+ },
63
+ {"type": "text", "text": "Describe this image."},
64
+ ],
65
+ }
66
+ ]
67
+
68
+ # Preparation for inference
69
+ text = processor.apply_chat_template(
70
+ messages, tokenize=False, add_generation_prompt=True
71
+ )
72
+ image_inputs, video_inputs = process_vision_info(messages)
73
+ inputs = processor(
74
+ text=[text],
75
+ images=image_inputs,
76
+ videos=video_inputs,
77
+ padding=True,
78
+ return_tensors="pt",
79
+ )
80
+ inputs = inputs.to("cuda")
81
+
82
+ # Inference: Generation of the output
83
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
84
+ generated_ids_trimmed = [
85
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
86
+ ]
87
+ output_text = processor.batch_decode(
88
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
89
+ )
90
+ print(output_text)
91
+ ```
92
+ <details>
93
+ <summary>Multi image inference</summary>
94
+
95
+ ```python
96
+ # Messages containing multiple images and a text query
97
+ messages = [
98
+ {
99
+ "role": "user",
100
+ "content": [
101
+ {"type": "image", "image": "ultrasound image 1"},
102
+ {"type": "image", "image": "ultrasound image 2"},
103
+ {"type": "text", "text": "帮我给出超声报告"},
104
+ ],
105
+ }
106
+ ]
107
+
108
+ # Preparation for inference
109
+ text = processor.apply_chat_template(
110
+ messages, tokenize=False, add_generation_prompt=True
111
+ )
112
+ image_inputs, video_inputs = process_vision_info(messages)
113
+ inputs = processor(
114
+ text=[text],
115
+ images=image_inputs,
116
+ videos=video_inputs,
117
+ padding=True,
118
+ return_tensors="pt",
119
+ )
120
+ inputs = inputs.to("cuda")
121
+
122
+ # Inference
123
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
124
+ generated_ids_trimmed = [
125
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
126
+ ]
127
+ output_text = processor.batch_decode(
128
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
129
+ )
130
+ print(output_text)
131
+ ```
132
+ </details>
133
+ <details>
134
+ <summary>Batch inference</summary>
135
+
136
+ ```python
137
+ # Sample messages for batch inference
138
+ messages1 = [
139
+ {
140
+ "role": "user",
141
+ "content": [
142
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
143
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
144
+ {"type": "text", "text": "This patient has a hypoechoic nodule in the left breast. What is the next step in treatment?"},
145
+ ],
146
+ }
147
+ ]
148
+ messages2 = [
149
+ {"role": "system", "content": "You are a helpful assistant."},
150
+ {"role": "user", "content": "Who are you?"},
151
+ ]
152
+ # Combine messages for batch processing
153
+ messages = [messages1, messages2]
154
+
155
+ # Preparation for batch inference
156
+ texts = [
157
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
158
+ for msg in messages
159
+ ]
160
+ image_inputs, video_inputs = process_vision_info(messages)
161
+ inputs = processor(
162
+ text=texts,
163
+ images=image_inputs,
164
+ videos=video_inputs,
165
+ padding=True,
166
+ return_tensors="pt",
167
+ )
168
+ inputs = inputs.to("cuda")
169
+
170
+ # Batch Inference
171
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
172
+ generated_ids_trimmed = [
173
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
174
+ ]
175
+ output_texts = processor.batch_decode(
176
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
177
+ )
178
+ print(output_texts)
179
+ ```
180
+ </details>