|
|
--- |
|
|
language: multilingual |
|
|
tags: |
|
|
- document-classification |
|
|
- text-classification |
|
|
- multilingual |
|
|
- doclaynet |
|
|
- e5 |
|
|
pipeline_tag: text-classification |
|
|
base_model: intfloat/multilingual-e5-large |
|
|
datasets: |
|
|
- pierreguillou/DocLayNet-base |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: multilingual-e5-doclaynet |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Document Classification |
|
|
dataset: |
|
|
name: DocLayNet |
|
|
type: pierreguillou/DocLayNet-base |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.9719 |
|
|
name: Test Accuracy |
|
|
- type: loss |
|
|
value: 0.5192 |
|
|
name: Test Loss |
|
|
library_name: transformers |
|
|
--- |
|
|
# Multilingual E5 for Document Classification (DocLayNet) |
|
|
This model is a fine-tuned version of intfloat/multilingual-e5-large for document text classification based on the DocLayNet dataset. |
|
|
|
|
|
## Evaluation results |
|
|
|
|
|
- Test Loss: 0.5192, Test Acc: 0.9719 |
|
|
|
|
|
## Usage: |
|
|
|
|
|
```python |
|
|
|
|
|
# Use a pipeline as a high-level helper |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("text-classification", model="kaixkhazaki/multilingual-e5-doclaynet") |
|
|
|
|
|
prediction = pipe("This is some text from a financial report") |
|
|
print(prediction) |
|
|
``` |
|
|
|
|
|
## Model description |
|
|
- Base model: intfloat/multilingual-e5-large |
|
|
- Task: Document text classification |
|
|
- Languages: Multilingual |
|
|
|
|
|
## Training data |
|
|
- Dataset: DocLayNet-base |
|
|
- Source: https://huggingface.co/datasets/pierreguillou/DocLayNet-base |
|
|
- Categories: |
|
|
```python |
|
|
{ |
|
|
'financial_reports': 0, |
|
|
'government_tenders': 1, |
|
|
'laws_and_regulations': 2, |
|
|
'manuals': 3, |
|
|
'patents': 4, |
|
|
'scientific_articles': 5 |
|
|
} |
|
|
``` |
|
|
## Training procedure |
|
|
|
|
|
Trained on single gpu for 2 epochs for apx. 20 minutes. |
|
|
|
|
|
hyperparameters: |
|
|
```python |
|
|
{ |
|
|
'batch_size': 8, |
|
|
'num_epochs': 10, |
|
|
'learning_rate': 2e-5, |
|
|
'weight_decay': 0.01, |
|
|
'warmup_ratio': 0.1, |
|
|
'gradient_clip': 1.0, |
|
|
'label_smoothing': 0.1, |
|
|
'optimizer': 'AdamW', |
|
|
'scheduler': 'cosine_with_warmup' |
|
|
} |
|
|
``` |
|
|
|
|
|
|