aimgo commited on
Commit
3f5715d
·
verified ·
1 Parent(s): 3c2348c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -8,14 +8,33 @@ pipeline_tag: text-generation
8
 
9
  **Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text.
10
 
11
- **This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards what that which it has been trained on.**
12
- As such, use it only in circumstances when the primary concern is only to recover intelligble Latin, not to recover intelligble Latin of a *particular* style.
13
 
 
 
 
 
 
 
 
14
  Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
15
 
16
  OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n",
17
 
18
  Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  If you use this in your work, please cite:
21
  ```
 
8
 
9
  **Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text.
10
 
11
+ **This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards what that which it has been trained on.**
 
12
 
13
+ This is to say **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations.
14
+
15
+
16
+ The model is intended to be used on segments of **250** characters. Anything else will compromise performance.
17
+ As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligble Latin of a *particular* style.
18
+
19
+ ### Lightly Corrupted Text
20
  Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
21
 
22
  OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n",
23
 
24
  Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
25
+ ### Heavily Corrupted Text
26
+ Orig: "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum."
27
+
28
+ OCR: "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«"
29
+
30
+ Corrected: "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum"
31
+ ### Severely Corrupted Text
32
+
33
+ Orig: "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus."
34
+
35
+ OCR: "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«"
36
+
37
+ Emendator Reconstruction: "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus"
38
 
39
  If you use this in your work, please cite:
40
  ```