|
|
--- |
|
|
datasets: |
|
|
- pierreguillou/DocLayNet-base |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- google/vit-base-patch16-224-in21k |
|
|
library_name: transformers |
|
|
tags: |
|
|
- vision |
|
|
- document-layout-analysis |
|
|
- document-classification |
|
|
- vit |
|
|
- doclaynet |
|
|
--- |
|
|
# Vision Transformer(ViT) for Document Classification(DocLayNet) |
|
|
|
|
|
This model is a fine-tuned Vision Transformer (ViT) for document layout classification based on the DocLayNet dataset. |
|
|
|
|
|
Trained on images of the document categories from DocLayNet dataset where the categories namely(with their indexes) are : |
|
|
|
|
|
```python |
|
|
{'financial_reports': 0, |
|
|
'government_tenders': 1, |
|
|
'laws_and_regulations': 2, |
|
|
'manuals': 3, |
|
|
'patents': 4, |
|
|
'scientific_articles': 5} |
|
|
|
|
|
``` |
|
|
|
|
|
## Model description |
|
|
|
|
|
This model is built upon the `google/vit-base-patch16-224-in21k` Vision Transformer architecture and fine-tuned specifically for document layout classification. The base ViT model uses a patch size of 16x16 pixels and was pre-trained on ImageNet-21k. The model has been optimized to recognize and classify different types of document layouts from the DocLayNet dataset. |
|
|
|
|
|
## Training data |
|
|
|
|
|
The model was trained on DocLayNet-base dataset, which is available on the Hugging Face Hub: [pierreguillou/DocLayNet-base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base) |
|
|
|
|
|
DocLayNet is a comprehensive dataset for document layout analysis, containing various document types and their corresponding layout annotations. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
Trained for 10 epochs on a single gpu for ~10 mins. |
|
|
|
|
|
The training hyperparameters: |
|
|
|
|
|
```python |
|
|
{ |
|
|
'batch_size': 64, |
|
|
'num_epochs': 20, |
|
|
'learning_rate': 1e-4, |
|
|
'weight_decay': 0.05, |
|
|
'warmup_ratio': 0.2, |
|
|
'gradient_clip': 0.1, |
|
|
'dropout_rate': 0.1, |
|
|
'label_smoothing': 0.1, |
|
|
'optimizer': 'AdamW' |
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
## Evaluation results |
|
|
The model achieved the following performance metrics on the test set: |
|
|
|
|
|
Test Loss: 0.8622 |
|
|
Test Accuracy: 81.36% |
|
|
|
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model using the image-classification pipeline |
|
|
pipe = pipeline("image-classification", model="kaixkhazaki/vit_doclaynet_base") |
|
|
|
|
|
# Test it with an image |
|
|
result = pipe("path_to_image.jpg") |
|
|
print(result) |
|
|
|
|
|
``` |