Update README.md
Browse files
README.md
CHANGED
|
@@ -149,10 +149,11 @@ features are aligned within the same semantic space. This “image-as-text” al
|
|
| 149 |
applications, including text segmentation, retrieval, and visual question answering.
|
| 150 |
|
| 151 |
<div align="center">
|
| 152 |
-
<img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/
|
| 153 |
</div>
|
| 154 |
|
| 155 |
|
|
|
|
| 156 |
### Evaluation on Vision Capability
|
| 157 |
|
| 158 |
We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
|
|
@@ -168,16 +169,15 @@ Please refer to our technical report for more details.
|
|
| 168 |
#### text retrial
|
| 169 |
|
| 170 |
<div align="left">
|
| 171 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/
|
| 172 |
</div>
|
| 173 |
|
| 174 |
-
|
| 175 |
<!--  -->
|
| 176 |
|
| 177 |
#### image segmentation
|
| 178 |
|
| 179 |
<div align="left">
|
| 180 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/
|
| 181 |
</div>
|
| 182 |
|
| 183 |
<!--  -->
|
|
@@ -185,7 +185,7 @@ Please refer to our technical report for more details.
|
|
| 185 |
#### visual question answering
|
| 186 |
|
| 187 |
<div align="left">
|
| 188 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/
|
| 189 |
</div>
|
| 190 |
|
| 191 |
<!-- 
|
|
@@ -200,9 +200,10 @@ Following the previous training paradigm, TokenVL also includes two stages:
|
|
| 200 |
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
|
| 201 |
|
| 202 |
<div align="center">
|
| 203 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/
|
| 204 |
</div>
|
| 205 |
|
|
|
|
| 206 |
The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
|
| 207 |
method makes it difficult for these models to have a precise understanding.
|
| 208 |
In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.
|
|
|
|
| 149 |
applications, including text segmentation, retrieval, and visual question answering.
|
| 150 |
|
| 151 |
<div align="center">
|
| 152 |
+
<img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/6vNkEPzolBWVM--beoxLI.png">
|
| 153 |
</div>
|
| 154 |
|
| 155 |
|
| 156 |
+
|
| 157 |
### Evaluation on Vision Capability
|
| 158 |
|
| 159 |
We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
|
|
|
|
| 169 |
#### text retrial
|
| 170 |
|
| 171 |
<div align="left">
|
| 172 |
+
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/wlLcdB0hpC666PrEQSDaM.png">
|
| 173 |
</div>
|
| 174 |
|
|
|
|
| 175 |
<!--  -->
|
| 176 |
|
| 177 |
#### image segmentation
|
| 178 |
|
| 179 |
<div align="left">
|
| 180 |
+
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/0HqFXP8OC2tLH4d7scdMt.png">
|
| 181 |
</div>
|
| 182 |
|
| 183 |
<!--  -->
|
|
|
|
| 185 |
#### visual question answering
|
| 186 |
|
| 187 |
<div align="left">
|
| 188 |
+
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/3PBP0akDiMbupu_Gr7lzP.png">
|
| 189 |
</div>
|
| 190 |
|
| 191 |
<!-- 
|
|
|
|
| 200 |
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
|
| 201 |
|
| 202 |
<div align="center">
|
| 203 |
+
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/5ZCzz1tYy0bnIIZFxgTPN.png">
|
| 204 |
</div>
|
| 205 |
|
| 206 |
+
|
| 207 |
The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
|
| 208 |
method makes it difficult for these models to have a precise understanding.
|
| 209 |
In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.
|