Pclanglais commited on
Commit
344dfe3
·
verified ·
1 Parent(s): 36477ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -6,7 +6,7 @@ language:
6
  - de
7
  ---
8
 
9
- **OCRonos** is a series of specialized language models for the correction of badly digitized texts part of PleIAs' *Bad Data Toolbox*.
10
 
11
  OCROnos models are versatile tools supporting the correction of OCR errors, wrong word cut/merge and overall broken text structures. They have been trained by PleIAs on a highly diverse set of ocrized texts in multiple languages from PleIAs open pre-training corpus, drawn from cultural heritage sources (Common Corpus) and financial and administrative documents in open data (Finance Commons).
12
 
@@ -14,7 +14,7 @@ This release currently features a model based on llama-3-8b that has been the mo
14
 
15
  OCRonos is generally faithful to what the original material, provides sensible restitution of deteriorated text and will rarely rewrite correct words. On highly deteriorated content, OCRonos can act as a synthetic rewriting tool rather than a strict correction tool.
16
 
17
- Along with the other models of PleIAs *Bad Data Toolbox*, OCRonos contributes to make challenging resources usable for LLM applications and, more broadly, search retrieval. It is especially fitting in situation where the original PDF sources is too damaged for correct OCRization or even non-existent/complex to retrieve.
18
 
19
  OCRonos can be tested on a free demo along with [Segmentext](https://huggingface.co/PleIAs/Segmentext), another model trained by PleIAs for the text segmentation of broken PDFs.
20
 
 
6
  - de
7
  ---
8
 
9
+ **OCRonos** is a series of specialized language models for the correction of badly digitized texts.
10
 
11
  OCROnos models are versatile tools supporting the correction of OCR errors, wrong word cut/merge and overall broken text structures. They have been trained by PleIAs on a highly diverse set of ocrized texts in multiple languages from PleIAs open pre-training corpus, drawn from cultural heritage sources (Common Corpus) and financial and administrative documents in open data (Finance Commons).
12
 
 
14
 
15
  OCRonos is generally faithful to what the original material, provides sensible restitution of deteriorated text and will rarely rewrite correct words. On highly deteriorated content, OCRonos can act as a synthetic rewriting tool rather than a strict correction tool.
16
 
17
+ Along with the other models of PleIAs **Bad Data Toolbox**, OCRonos contributes to make challenging resources usable for LLM applications and, more broadly, search retrieval. It is especially fitting in situation where the original PDF sources is too damaged for correct OCRization or even non-existent/complex to retrieve.
18
 
19
  OCRonos can be tested on a free demo along with [Segmentext](https://huggingface.co/PleIAs/Segmentext), another model trained by PleIAs for the text segmentation of broken PDFs.
20