TongkunGuan commited on
Commit
a07d3da
·
verified ·
1 Parent(s): cf532cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -149,10 +149,11 @@ features are aligned within the same semantic space. This “image-as-text” al
149
  applications, including text segmentation, retrieval, and visual question answering.
150
 
151
  <div align="center">
152
- <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/QTsvWxFJFTnISdhvbfZhD.png">
153
  </div>
154
 
155
 
 
156
  ### Evaluation on Vision Capability
157
 
158
  We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
@@ -168,16 +169,15 @@ Please refer to our technical report for more details.
168
  #### text retrial
169
 
170
  <div align="left">
171
- <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png">
172
  </div>
173
 
174
-
175
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png) -->
176
 
177
  #### image segmentation
178
 
179
  <div align="left">
180
- <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png">
181
  </div>
182
 
183
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png) -->
@@ -185,7 +185,7 @@ Please refer to our technical report for more details.
185
  #### visual question answering
186
 
187
  <div align="left">
188
- <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png">
189
  </div>
190
 
191
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png)
@@ -200,9 +200,10 @@ Following the previous training paradigm, TokenVL also includes two stages:
200
  **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
201
 
202
  <div align="center">
203
- <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/gDr1fQg7I1nTIsiRWNHTr.png">
204
  </div>
205
 
 
206
  The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
207
  method makes it difficult for these models to have a precise understanding.
208
  In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.
 
149
  applications, including text segmentation, retrieval, and visual question answering.
150
 
151
  <div align="center">
152
+ <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/6vNkEPzolBWVM--beoxLI.png">
153
  </div>
154
 
155
 
156
+
157
  ### Evaluation on Vision Capability
158
 
159
  We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
 
169
  #### text retrial
170
 
171
  <div align="left">
172
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/wlLcdB0hpC666PrEQSDaM.png">
173
  </div>
174
 
 
175
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png) -->
176
 
177
  #### image segmentation
178
 
179
  <div align="left">
180
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/0HqFXP8OC2tLH4d7scdMt.png">
181
  </div>
182
 
183
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png) -->
 
185
  #### visual question answering
186
 
187
  <div align="left">
188
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/3PBP0akDiMbupu_Gr7lzP.png">
189
  </div>
190
 
191
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png)
 
200
  **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
201
 
202
  <div align="center">
203
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/5ZCzz1tYy0bnIIZFxgTPN.png">
204
  </div>
205
 
206
+
207
  The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
208
  method makes it difficult for these models to have a precise understanding.
209
  In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.