jvdzwaan
/

ocrpostcorrection-task-1

Token Classification

post-ocr correction

ocr postcorrection

Model card Files Files and versions

jvdzwaan commited on Oct 3, 2022

Commit

8b7fca7

·

1 Parent(s): dfb587e

Add model card (README.md)

Files changed (1) hide show

README.md +79 -0

README.md ADDED Viewed

	@@ -0,0 +1,79 @@

+---
+language:
+  - bg
+  - cz
+  - de
+  - en
+  - es
+  - fi
+  - fr
+  - nl
+  - pl
+  - sl
+tags:
+- "post-ocr correction"
+- "ocr postcorrection"
+metrics:
+- loss
+- F1
+---
+# OCR postcorrection task 1
+This is a BertForTokenClassification model that predicts whether a token is an OCR
+mistake or not. It is based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
+and finetuned on the dataset of the
+[2019 ICDAR competition on post-OCR correction](https://sites.google.com/view/icdar2019-postcorrectionocr).
+It contains texts in the following languages:
+- BG
+- CZ
+- DE
+- EN
+- ES
+- FI
+- FR
+- NL
+- PL
+- SL
+10% of the texts (stratified on language) were selected for validation. The test set is as provided.
+The training data consists of (partially overlapping) sequences of 150 tokens. Only
+sequences with a normalized editdistance of < 0.3 were included in the train and
+validation set. The test set was not filtered on editdistance.
+There are 3 classes in the data:
+- 0: No OCR mistake
+- 1: Start token of an OCR mistake
+- 2: Inside token of an OCR mistake
+## Results
+Loss and F1 measure on separate languages.
+| Set | Loss |
+| -- | -- |
+| Train | 0.224500 |
+| Val | 0.285791 |
+| Test | 0.4178357720375061 |
+Average F1 by language:
+| BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
+| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
+| 0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 |  0.8 | 0.69 |
+## Demo
+[Space for this model.](https://huggingface.co/spaces/jvdzwaan/ocrpostcorrection-task1-demo)
+## Code
+* [OCR post correction package](https://github.com/jvdzwaan/ocrpostcorrection)
+* [Notebooks](https://github.com/jvdzwaan/ocrpostcorrection-notebooks)
+  - [Jupyter notebook used for generating the training data](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/local/icdar-create-hf-dataset.ipynb)
+  - [Jupyter notebook used for training the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-train.ipynb)
+  - [Jupyter notebook used for evaluating the model](https://github.com/jvdzwaan/ocrpostcorrection-notebooks/blob/main/colab/icdar-task1-hf-evaluation.ipynb)