TongkunGuan
/

TokenFD

Image-to-Text

Model card Files Files and versions

xet

Community

TongkunGuan commited on Feb 21, 2025

Commit

c2573da

verified ·

1 Parent(s): 7fd4955

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -62

README.md CHANGED Viewed

@@ -99,88 +99,48 @@ pixel_values = pixel_values.to(torch.bfloat16).cuda()
 outputs = model(pixel_values)
 ```
-## TokenVL
-we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
-Following the previous training paradigm, TokenVL also includes two stages:
-**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
-**Stage 2: Supervised Instruction Tuning for VQA tasks.**
-## Model Architecture
-As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/BiiyXN6NOk0p-3rl3ueyL.png)
-As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448×448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data.
-## Training Strategy
-### Dynamic High-Resolution for Multimodal Data
-In InternVL 2.0 and 2.5, we extend the dynamic high-resolution training approach, enhancing its capabilities to handle multi-image and video datasets.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/xoMY6rwRrNxbAGYPNyU8g.png)
-- For single-image datasets, the total number of tiles `n_max` are allocated to a single image for maximum resolution. Visual tokens are enclosed in `<img>` and `</img>` tags.
-- For multi-image datasets, the total number of tiles `n_max` are distributed across all images in a sample. Each image is labeled with auxiliary tags like `Image-1` and enclosed in `<img>` and `</img>` tags.
-- For videos, each frame is resized to 448×448. Frames are labeled with tags like `Frame-1` and enclosed in `<img>` and `</img>` tags, similar to images.
-### Single Model Training Pipeline
-The training pipeline for a single model in InternVL 2.5 is structured across three stages, designed to enhance the model's visual perception and multimodal capabilities.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/5NduZeCPLgPJTFr0RGTq3.png)
-- **Stage 1: MLP Warmup.** In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training.
-- **Stage 1.5: ViT Incremental Learning (Optional).** This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced.
-- **Stage 2: Full Model Instruction Tuning.** The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete.
-## Evaluation on Vision Capability
 We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. The evaluation is divided into two key categories: (1) image classification, representing global-view semantic quality, and (2) semantic segmentation, capturing local-view semantic quality. This approach allows us to assess the representation quality of InternViT across its successive version updates. Please refer to our technical report for more details.
-## Image Classification
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/0Zx1JWB-2kHEfLbboiVy1.png)
 **Image classification performance across different versions of InternViT.** We use IN-1K for training and evaluate on the IN-1K validation set as well as multiple ImageNet variants, including IN-ReaL, IN-V2, IN-A, IN-R, and IN-Sketch. Results are reported for both linear probing and attention pooling probing methods, with average accuracy for each method. ∆ represents the performance gap between attention pooling probing and linear probing, where a larger ∆ suggests a shift from learning simple linear features to capturing more complex, nonlinear semantic representations.
-## Semantic Segmentation Performance
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/XjJx5WSIXjsaQGLPCsQuP.png)
 **Semantic segmentation performance across different versions of InternViT.** The models are evaluated on ADE20K and COCO-Stuff-164K using three configurations: linear probing, head tuning, and full tuning. The table shows the mIoU scores for each configuration and their averages. ∆1 represents the gap between head tuning and linear probing, while ∆2 shows the gap between full tuning and linear probing. A larger ∆ value indicates a shift from simple linear features to more complex, nonlinear representations.
-## Quick Start
-> \[!Warning\]
-> 🚨 Note: In our experience, the InternViT V2.5 series is better suited for building MLLMs than traditional computer vision tasks.
-```python
-import torch
-from PIL import Image
-from transformers import AutoModel, CLIPImageProcessor
-model = AutoModel.from_pretrained(
-    'OpenGVLab/InternViT-300M-448px-V2_5',
-    torch_dtype=torch.bfloat16,
-    low_cpu_mem_usage=True,
-    trust_remote_code=True).cuda().eval()
-image = Image.open('./examples/image1.jpg').convert('RGB')
-image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-300M-448px-V2_5')
-pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
-pixel_values = pixel_values.to(torch.bfloat16).cuda()
-outputs = model(pixel_values)
-```
 ## License

 outputs = model(pixel_values)
 ```
+### Evaluation on Vision Capability
 We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. The evaluation is divided into two key categories: (1) image classification, representing global-view semantic quality, and (2) semantic segmentation, capturing local-view semantic quality. This approach allows us to assess the representation quality of InternViT across its successive version updates. Please refer to our technical report for more details.
+#### Image Classification
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/0Zx1JWB-2kHEfLbboiVy1.png)
 **Image classification performance across different versions of InternViT.** We use IN-1K for training and evaluate on the IN-1K validation set as well as multiple ImageNet variants, including IN-ReaL, IN-V2, IN-A, IN-R, and IN-Sketch. Results are reported for both linear probing and attention pooling probing methods, with average accuracy for each method. ∆ represents the performance gap between attention pooling probing and linear probing, where a larger ∆ suggests a shift from learning simple linear features to capturing more complex, nonlinear semantic representations.
+#### Semantic Segmentation Performance
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/XjJx5WSIXjsaQGLPCsQuP.png)
 **Semantic segmentation performance across different versions of InternViT.** The models are evaluated on ADE20K and COCO-Stuff-164K using three configurations: linear probing, head tuning, and full tuning. The table shows the mIoU scores for each configuration and their averages. ∆1 represents the gap between head tuning and linear probing, while ∆2 shows the gap between full tuning and linear probing. A larger ∆ value indicates a shift from simple linear features to more complex, nonlinear representations.
+## TokenVL
+we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
+Following the previous training paradigm, TokenVL also includes two stages:
+**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
+<div align="center">
+  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/gDr1fQg7I1nTIsiRWNHTr.png">
+</div>
+The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
+method makes it difficult for these models to have a precise understanding.
+In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.
+**Stage 2: Supervised Instruction Tuning for VQA tasks.**
+During the Supervised Instruction Tuning stage, we cancel the token alignment branch as answers may not appear in the image for some reasoning tasks
+(e.g., How much taller is the red bar compared to the green bar?). This also ensures no computational overhead during inference to improve the document understanding capability. Finally, we inherit the
+remaining weights from the LLM-guided Token Alignment and unfreeze all parameters to facilitate comprehensive parameter updates.
+### OCRBench Results
 ## License