Update README.md
Browse files
README.md
CHANGED
|
@@ -8,14 +8,33 @@ pipeline_tag: text-generation
|
|
| 8 |
|
| 9 |
**Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text.
|
| 10 |
|
| 11 |
-
**This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards what that which it has been trained on.**
|
| 12 |
-
As such, use it only in circumstances when the primary concern is only to recover intelligble Latin, not to recover intelligble Latin of a *particular* style.
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
|
| 15 |
|
| 16 |
OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n",
|
| 17 |
|
| 18 |
Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
If you use this in your work, please cite:
|
| 21 |
```
|
|
|
|
| 8 |
|
| 9 |
**Emendator** is a [byt5-xl](https://huggingface.co/google/byt5-xl) model finetuned to correct OCR artifacts in Latin text.
|
| 10 |
|
| 11 |
+
**This model cannot provide completely faithful reconstruction for all orthographies - on a large scale, it will shift the distribution of tokens towards what that which it has been trained on.**
|
|
|
|
| 12 |
|
| 13 |
+
This is to say **Emendator will take editorial liberties with your data.** It is fond of introducing abbreviations.
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
The model is intended to be used on segments of **250** characters. Anything else will compromise performance.
|
| 17 |
+
As such, use it only in circumstances when the primary concern is only to recover intelligible Latin, not to recover intelligble Latin of a *particular* style.
|
| 18 |
+
|
| 19 |
+
### Lightly Corrupted Text
|
| 20 |
Original: "atque optimo viro, peterem; superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
|
| 21 |
|
| 22 |
OCR: "atqu optimo viro, peterem, superavi tamen dignitate Catilinam, qatia Gdbam. Quod si id crimen homini novo esse deberet, protecto\n",
|
| 23 |
|
| 24 |
Emendator Reconstruction: "atque optimo viro, peterem, superavi tamen dignitate Catilinam, gratia Galbam. Quod si id crimen homini novo esse deberet, profecto",
|
| 25 |
+
### Heavily Corrupted Text
|
| 26 |
+
Orig: "Aegidius Lusitanus Olysipone natus Sacrae Theologiae Magister scripsit Commentarium in libros Sententiarum."
|
| 27 |
+
|
| 28 |
+
OCR: "/Egidinslufitanusolufiponc natusS.Thcol.Mag.fcripfitCorn rncntariurninlibros>Scntcntiarurn«"
|
| 29 |
+
|
| 30 |
+
Corrected: "Egidius Lusitanus olusipone natus S. Theol. Mag. scripsit Commentarium in libros Sententiarum"
|
| 31 |
+
### Severely Corrupted Text
|
| 32 |
+
|
| 33 |
+
Orig: "Diligam te Domine fortitudo mea. Dominus firmamentum meum et refugium meum et liberator meus."
|
| 34 |
+
|
| 35 |
+
OCR: "Diligamtc Do rnincfortitudomca:Dominusfirma rncntumrncurn&rcfugiurnmcurn&libcratorrnc9«"
|
| 36 |
+
|
| 37 |
+
Emendator Reconstruction: "Diligam te Domine fortitudo mea Dominus firmamentum meum refugium meum liberator meus"
|
| 38 |
|
| 39 |
If you use this in your work, please cite:
|
| 40 |
```
|