Multilingual Autoregressive Entity Linking
Paper
•
2103.12528
•
Published
impresso-project/nel-mgenre-multilingual-light
The Impresso multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) proposed by De Cao et al, a sequence-to-sequence architecture for entity disambiguation based on mBART. It uses constrained generation to output entity names mapped to Wikidata/QIDs.
This model was adapted for historical texts and fine-tuned on the HIPE-2022 dataset, which includes a variety of historical document types and languages.
facebook/mgenre-wikiThe model was trained on the following datasets:
| Dataset alias | README | Document type | Languages | Suitable for | Project | License |
|---|---|---|---|---|---|---|
| ajmc | link | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | AjMC | |
| hipe2020 | link | historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | CLEF-HIPE-2020 | |
| topres19th | link | historical newspapers | en | NERC-Coarse, EL | Living with Machines | |
| newseye | link | historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL | NewsEye | |
| sonar | link | historical newspapers | de | NERC-Coarse, EL | SoNAR |
from transformers import AutoTokenizer, pipeline
NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual-light"
nel_tokenizer = AutoTokenizer.from_pretrained(NEL_MODEL_NAME)
nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
tokenizer=nel_tokenizer,
trust_remote_code=True,
device='cpu')
sentence = "Le 0ctobre 1894, [START] Dreyfvs [END] est arrêté à Paris, accusé d'espionnage pour l'Allemagne — un événement qui déch1ra la société fr4nçaise pendant des années."
print(nel_pipeline(sentence))
[
{'surface': 'Dreyfvs', 'wkd_pred': 'Alfred Dreyfus >> fr ',
'type': 'UNK', 'confidence_nel': 100.0, 'lOffset': 24, 'rOffset': 33},
{'surface': 'Dreyfvs', 'wkd_pred': 'Alfred Dreyfus >> fr', 'type': 'UNK',
'confidence_nel': 41.0, 'lOffset': 24, 'rOffset': 33}, {'surface': 'Dreyfvs',
'wkd_pred': 'Alfred Dreyfuss >> fr ', 'type': 'UNK', 'confidence_nel': 38.0,
'lOffset': 24, 'rOffset': 33}, {'surface': 'Dreyfvs', 'wkd_pred': 'Alfred Dreyfus >> fr ',
'type': 'UNK', 'confidence_nel': 26.0, 'lOffset': 24, 'rOffset': 33}, {'surface': 'Dreyfvs',
'wkd_pred': 'Alfred Dreyfw >> fr ', 'type': 'UNK', 'confidence_nel': 24.0, 'lOffset': 24, 'rOffset': 33}]
The type of the entity is UNK because the model was not trained on the entity type. The confidence_nel score indicates the model's confidence in the prediction.