TongkunGuan commited on
Commit
8b89f0e
·
verified ·
1 Parent(s): c6425c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -7
README.md CHANGED
@@ -101,21 +101,43 @@ outputs = model(pixel_values)
101
 
102
  ### Evaluation on Vision Capability
103
 
104
- We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. The evaluation is divided into two key categories: (1) image classification, representing global-view semantic quality, and (2) semantic segmentation, capturing local-view semantic quality. This approach allows us to assess the representation quality of InternViT across its successive version updates. Please refer to our technical report for more details.
 
105
 
106
- #### Image Classification
107
 
108
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/0Zx1JWB-2kHEfLbboiVy1.png)
109
 
110
- **Image classification performance across different versions of InternViT.** We use IN-1K for training and evaluate on the IN-1K validation set as well as multiple ImageNet variants, including IN-ReaL, IN-V2, IN-A, IN-R, and IN-Sketch. Results are reported for both linear probing and attention pooling probing methods, with average accuracy for each method. ∆ represents the performance gap between attention pooling probing and linear probing, where a larger ∆ suggests a shift from learning simple linear features to capturing more complex, nonlinear semantic representations.
111
 
112
- #### Semantic Segmentation Performance
 
113
 
114
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/XjJx5WSIXjsaQGLPCsQuP.png)
115
 
116
- **Semantic segmentation performance across different versions of InternViT.** The models are evaluated on ADE20K and COCO-Stuff-164K using three configurations: linear probing, head tuning, and full tuning. The table shows the mIoU scores for each configuration and their averages. ∆1 represents the gap between head tuning and linear probing, while ∆2 shows the gap between full tuning and linear probing. A larger ∆ value indicates a shift from simple linear features to more complex, nonlinear representations.
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
 
 
 
 
 
118
 
 
 
119
  ## TokenVL
120
 
121
  we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
 
101
 
102
  ### Evaluation on Vision Capability
103
 
104
+ We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
105
+ The evaluation is divided into two key categories:
106
 
107
+ (1) text retrial;
108
 
109
+ (2) image segmentation;
110
 
111
+ (3) visual question answering;
112
 
113
+ This approach allows us to assess the representation quality of TokenOCR.
114
+ Please refer to our technical report for more details.
115
 
116
+ #### text retrial
117
 
118
+ <div align="center">
119
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png">
120
+ </div>
121
+
122
+
123
+ <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png) -->
124
+
125
+ #### image segmentation
126
+
127
+ <div align="center">
128
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png">
129
+ </div>
130
+
131
+ <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png) -->
132
 
133
+ #### visual question answering
134
+
135
+ <div align="center">
136
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png">
137
+ </div>
138
 
139
+ <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png)
140
+ -->
141
  ## TokenVL
142
 
143
  we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.