TongkunGuan
/

TokenFD

Image-to-Text

Model card Files Files and versions

xet

Community

TongkunGuan commited on Mar 8, 2025

Commit

a07d3da

verified ·

1 Parent(s): cf532cf

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -6

README.md CHANGED Viewed

@@ -149,10 +149,11 @@ features are aligned within the same semantic space. This “image-as-text” al
 applications, including text segmentation, retrieval, and visual question answering.
 <div align="center">
-  <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/QTsvWxFJFTnISdhvbfZhD.png">
 </div>
 ### Evaluation on Vision Capability
 We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
@@ -168,16 +169,15 @@ Please refer to our technical report for more details.
 #### text retrial
 <div align="left">
-  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png) -->
 #### image segmentation
 <div align="left">
-  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png) -->
@@ -185,7 +185,7 @@ Please refer to our technical report for more details.
 #### visual question answering
 <div align="left">
-  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png)
@@ -200,9 +200,10 @@ Following the previous training paradigm, TokenVL also includes two stages:
 **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
 <div align="center">
-  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/gDr1fQg7I1nTIsiRWNHTr.png">
 </div>
 The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
 method makes it difficult for these models to have a precise understanding.
 In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.

 applications, including text segmentation, retrieval, and visual question answering.
 <div align="center">
+  <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/6vNkEPzolBWVM--beoxLI.png">
 </div>
 ### Evaluation on Vision Capability
 We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
 #### text retrial
 <div align="left">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/wlLcdB0hpC666PrEQSDaM.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png) -->
 #### image segmentation
 <div align="left">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/0HqFXP8OC2tLH4d7scdMt.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png) -->
 #### visual question answering
 <div align="left">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/3PBP0akDiMbupu_Gr7lzP.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png)
 **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
 <div align="center">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/5ZCzz1tYy0bnIIZFxgTPN.png">
 </div>
 The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
 method makes it difficult for these models to have a precise understanding.
 In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.