chaoyinshe commited on
Commit
ca9f586
·
verified ·
1 Parent(s): 4aff295

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -150
README.md CHANGED
@@ -30,159 +30,14 @@ Official PyTorch implementation of the model described in
30
  | Model Hub | [Hugging Face](https://huggingface.co/chaoyinshe/EchoVLM) |
31
 
32
  ## 🔄 Updates
 
 
 
33
  - **Sep 19, 2025**: Released model weights on [Hugging Face](https://huggingface.co/chaoyinshe/EchoVLM).
34
- - **Sep 17, 2025**: Paper published on [arXiv](https://arxiv.org/abs/2509.14977).
35
- - **Coming soon**: V2 with Chain-of-Thought reasoning and reinforcement learning enhancements.
36
 
37
  ## 🚀 Quick Start
38
- ### Using 🤗 Transformers to Chat
39
-
40
- Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
41
-
42
- ```python
43
- from transformers import Qwen2VLMOEForConditionalGeneration, AutoProcessor
44
- from qwen_vl_utils import process_vision_info
45
- import torch
46
-
47
- # ===== 1. Load model & processor =====
48
- model = Qwen2VLMOEForConditionalGeneration.from_pretrained(
49
- "chaoyinshe/EchoVLM",
50
- torch_dtype=torch.bfloat16,
51
- attn_implementation="flash_attention_2", # faster & memory-efficient
52
- device_map="auto",
53
- )
54
- processor = AutoProcessor.from_pretrained("chaoyinshe/EchoVLM")
55
- # The default range for the number of visual tokens per image in the model is 4-16384.
56
- # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
57
- # min_pixels = 256*28*28
58
- # max_pixels = 1280*28*28
59
- # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
60
- messages = [
61
- {
62
- "role": "user",
63
- "content": [
64
- {
65
- "type": "image",
66
- "image": "An ultrasound image",
67
- },
68
- {"type": "text", "text": "Describe this image."},
69
- ],
70
- }
71
- ]
72
-
73
- # Preparation for inference
74
- text = processor.apply_chat_template(
75
- messages, tokenize=False, add_generation_prompt=True
76
- )
77
- image_inputs, video_inputs = process_vision_info(messages)
78
- inputs = processor(
79
- text=[text],
80
- images=image_inputs,
81
- videos=video_inputs,
82
- padding=True,
83
- return_tensors="pt",
84
- )
85
- inputs = inputs.to("cuda")
86
-
87
- # Inference: Generation of the output
88
- generated_ids = model.generate(**inputs, max_new_tokens=128)
89
- generated_ids_trimmed = [
90
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
91
- ]
92
- output_text = processor.batch_decode(
93
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
94
- )
95
- print(output_text)
96
- ```
97
- <details>
98
- <summary>Multi image inference</summary>
99
-
100
- ```python
101
- # Messages containing multiple images and a text query
102
- messages = [
103
- {
104
- "role": "user",
105
- "content": [
106
- {"type": "image", "image": "ultrasound image 1"},
107
- {"type": "image", "image": "ultrasound image 2"},
108
- {"type": "text", "text": "帮我给出超声报告"},
109
- ],
110
- }
111
- ]
112
-
113
- # Preparation for inference
114
- text = processor.apply_chat_template(
115
- messages, tokenize=False, add_generation_prompt=True
116
- )
117
- image_inputs, video_inputs = process_vision_info(messages)
118
- inputs = processor(
119
- text=[text],
120
- images=image_inputs,
121
- videos=video_inputs,
122
- padding=True,
123
- return_tensors="pt",
124
- )
125
- inputs = inputs.to("cuda")
126
-
127
- # Inference
128
- generated_ids = model.generate(**inputs, max_new_tokens=128)
129
- generated_ids_trimmed = [
130
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
131
- ]
132
- output_text = processor.batch_decode(
133
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
134
- )
135
- print(output_text)
136
- ```
137
- </details>
138
- <details>
139
- <summary>Batch inference</summary>
140
-
141
- ```python
142
- # Sample messages for batch inference
143
- messages1 = [
144
- {
145
- "role": "user",
146
- "content": [
147
- {"type": "image", "image": "file:///path/to/image1.jpg"},
148
- {"type": "image", "image": "file:///path/to/image2.jpg"},
149
- {"type": "text", "text": "This patient has a hypoechoic nodule in the left breast. What is the next step in treatment?"},
150
- ],
151
- }
152
- ]
153
- messages2 = [
154
- {"role": "system", "content": "You are a helpful assistant."},
155
- {"role": "user", "content": "Who are you?"},
156
- ]
157
- # Combine messages for batch processing
158
- messages = [messages1, messages2]
159
-
160
- # Preparation for batch inference
161
- texts = [
162
- processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
163
- for msg in messages
164
- ]
165
- image_inputs, video_inputs = process_vision_info(messages)
166
- inputs = processor(
167
- text=texts,
168
- images=image_inputs,
169
- videos=video_inputs,
170
- padding=True,
171
- return_tensors="pt",
172
- )
173
- inputs = inputs.to("cuda")
174
-
175
- # Batch Inference
176
- generated_ids = model.generate(**inputs, max_new_tokens=128)
177
- generated_ids_trimmed = [
178
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
179
- ]
180
- output_texts = processor.batch_decode(
181
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
182
- )
183
- print(output_texts)
184
- ```
185
- </details>
186
 
187
  ## 📌 Citation
188
 
 
30
  | Model Hub | [Hugging Face](https://huggingface.co/chaoyinshe/EchoVLM) |
31
 
32
  ## 🔄 Updates
33
+ - **Coming soon**: V2 with Chain-of-Thought reasoning and reinforcement learning enhancements—full training & inference code plus benchmark test-set will be fully open-sourced.
34
+ - **Dec 1, 2025**: To better promote development in this field, we've open-sourced our latest instruction fine-tuned model based on Lingshu-7B. Essentially built on Qwen2.5VL, it enjoys a better ecosystem—for example, it can seamlessly leverage vLLM for accelerated inference. Released model weights on [Hugging Face](https://huggingface.co/chaoyinshe/EchoVLM_V2_lingshu_base_7b_instruct_preview).
35
+ - **Sep 21, 2025**: The full, uncleaned model codebase is now open-sourced on GitHub!
36
  - **Sep 19, 2025**: Released model weights on [Hugging Face](https://huggingface.co/chaoyinshe/EchoVLM).
37
+ - **Sep 17, 2025**: Paper published on [arXiv](https://arxiv.org/abs/2509.14977).
 
38
 
39
  ## 🚀 Quick Start
40
+ Reference [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## 📌 Citation
43