File size: 2,599 Bytes
b7a6de9
 
1c2f1a0
 
 
2ace45f
1c2f1a0
 
2b09258
 
739cc90
 
1c2f1a0
556f139
 
42edd64
70c8d8f
 
82af289
70c8d8f
 
 
 
 
 
 
06d1935
 
82af289
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06d1935
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
license: cc-by-nc-sa-4.0
language:
- cs
- de
pipeline_tag: translation
tags:
- code
- wikidata
- biography
- semantic_annotation
- entity_linking
base_model: google/mt5-small
inference:
  parameters:
    max_new_tokens: 100
---

A model for annotating entries in biographical dictionaries using Wikidata entities. Based on <a href="https://huggingface.co/google/mt5-small">Google's mT5</a>.

Example input text:

<code>Anschiringer, Anton, Publizist, * 1812 Wien, † 17. 12. 1873 Reichenberg (Liberec). Erzieher im Hause des Großindustriellen...</code>

Example output text:

<code>{{WD|label|Anschiringer, Anton}}, {{WD|<a href="https://www.wikidata.org/entity/P106">P106</a>|<a href="https://www.wikidata.org/entity/Q6051619">Q6051619</a>|Publizist}}, * {{WD|<a href="https://www.wikidata.org/entity/P569">P569</a>|1812}} {{WD|<a href="https://www.wikidata.org/entity/P19">P19</a>|<a href="https://www.wikidata.org/entity/Q1741">Q1741</a>|Wien}}, † {{WD|<a href="https://www.wikidata.org/entity/P570">P570</a>|1873-12-17|17. 12. 1873}} {{WD|<a href="https://www.wikidata.org/entity/P20">P20</a>|<a href="https://www.wikidata.org/entity/Q146351">Q146351</a>|Reichenberg (Liberec)}}. Erzieher im Hause des Großindustriellen...</code>

<h2>Evaluation</h2>

After training on the dataset of BLGBL, vol. I, the transformer shows a loss value of **0.3878** for this model.

More relevant is the data on how many valid statements the model can obtain from the input. The evaluation test was performed on 100 unseen entries from BLGBL, vol. II.

| | Basic statements | Qualifier statements | Total |
|-|------------------|----------------------|-------|
| Ground truth | 1,209 | 572 | 1,781 |
| Valid statements by the model | 714 | 120 | 834 |
| Accuracy | 0.5906 | 0.2098 | 0.4683 |
| **Loss** | **0.4094** | 0.7902 | **0.5317** |

In other words, the model correctly retrieves about 60% of the basic statements and 20% of the qualifiers, for a total of 50% of the basic and qualifier statements.

<h2>Acknowledgement</h2>

The model is the result of a project "Wikimedia versus traditional biographical encyclopedias. Overlaps, gaps, quality and future possibilities" funded by the <a href="https://meta.wikimedia.org/wiki/Grants:Programs/Wikimedia_Research_Fund/Wikimedia_versus_traditional_biographical_encyclopedias._Overlaps,_gaps,_quality_and_future_possibilities">Wikimedia Research Fund</a>.

Computational resources were provided by the <a href="https://www.e-infra.cz/">e-INFRA CZ project</a> (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic.