biography2wikidata / README.md
daelba's picture
+ tags
739cc90 verified
metadata
license: cc-by-nc-sa-4.0
language:
  - cs
  - de
pipeline_tag: translation
tags:
  - code
  - wikidata
  - biography
  - semantic_annotation
  - entity_linking
base_model: google/mt5-small
inference:
  parameters:
    max_new_tokens: 100

A model for annotating entries in biographical dictionaries using Wikidata entities. Based on Google's mT5.

Example input text:

Anschiringer, Anton, Publizist, * 1812 Wien, † 17. 12. 1873 Reichenberg (Liberec). Erzieher im Hause des Großindustriellen...

Example output text:

{{WD|label|Anschiringer, Anton}}, {{WD|P106|Q6051619|Publizist}}, * {{WD|P569|1812}} {{WD|P19|Q1741|Wien}}, † {{WD|P570|1873-12-17|17. 12. 1873}} {{WD|P20|Q146351|Reichenberg (Liberec)}}. Erzieher im Hause des Großindustriellen...

Evaluation

After training on the dataset of BLGBL, vol. I, the transformer shows a loss value of 0.3878 for this model.

More relevant is the data on how many valid statements the model can obtain from the input. The evaluation test was performed on 100 unseen entries from BLGBL, vol. II.

Basic statements Qualifier statements Total
Ground truth 1,209 572 1,781
Valid statements by the model 714 120 834
Accuracy 0.5906 0.2098 0.4683
Loss 0.4094 0.7902 0.5317

In other words, the model correctly retrieves about 60% of the basic statements and 20% of the qualifiers, for a total of 50% of the basic and qualifier statements.

Acknowledgement

The model is the result of a project "Wikimedia versus traditional biographical encyclopedias. Overlaps, gaps, quality and future possibilities" funded by the Wikimedia Research Fund.

Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic.