TongkunGuan
/

TokenFD

Image-to-Text

Model card Files Files and versions

xet

Community

TongkunGuan commited on Feb 21, 2025

Commit

8b89f0e

verified ·

1 Parent(s): c6425c8

Update README.md

Browse files

Files changed (1) hide show

README.md +29 -7

README.md CHANGED Viewed

@@ -101,21 +101,43 @@ outputs = model(pixel_values)
 ### Evaluation on Vision Capability
-We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. The evaluation is divided into two key categories: (1) image classification, representing global-view semantic quality, and (2) semantic segmentation, capturing local-view semantic quality. This approach allows us to assess the representation quality of InternViT across its successive version updates. Please refer to our technical report for more details.
-#### Image Classification
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/0Zx1JWB-2kHEfLbboiVy1.png)
-**Image classification performance across different versions of InternViT.** We use IN-1K for training and evaluate on the IN-1K validation set as well as multiple ImageNet variants, including IN-ReaL, IN-V2, IN-A, IN-R, and IN-Sketch. Results are reported for both linear probing and attention pooling probing methods, with average accuracy for each method. ∆ represents the performance gap between attention pooling probing and linear probing, where a larger ∆ suggests a shift from learning simple linear features to capturing more complex, nonlinear semantic representations.
-#### Semantic Segmentation Performance
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/XjJx5WSIXjsaQGLPCsQuP.png)
-**Semantic segmentation performance across different versions of InternViT.** The models are evaluated on ADE20K and COCO-Stuff-164K using three configurations: linear probing, head tuning, and full tuning. The table shows the mIoU scores for each configuration and their averages. ∆1 represents the gap between head tuning and linear probing, while ∆2 shows the gap between full tuning and linear probing. A larger ∆ value indicates a shift from simple linear features to more complex, nonlinear representations.
 ## TokenVL
 we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.

 ### Evaluation on Vision Capability
+We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
+The evaluation is divided into two key categories:
+(1) text retrial;
+(2) image segmentation;
+(3) visual question answering;
+This approach allows us to assess the representation quality of TokenOCR.
+Please refer to our technical report for more details.
+#### text retrial
+<div align="center">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png">
+</div>
+<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png) -->
+#### image segmentation
+<div align="center">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png">
+</div>
+<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png) -->
+#### visual question answering
+<div align="center">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png">
+</div>
+<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png)
+ -->
 ## TokenVL
 we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.