OldBERTur: Named Entity Recognition for Normalised Old Icelandic

This model performs Named Entity Recognition (NER) on normalised Old Icelandic texts sourced from medieval manuscripts, identifying Person and Location entities. Please note, while the model is fully functional, this model card is due to be updated in the near future with supplementary information.

Model Description

Model type: Token classification (NER)
Base model: mideind/IceBERT
Language: Old Icelandic (normalised transcription level)
Entity types: Person, Location (BIO tagging scheme)
F1 Score: 0.93

This model is fine-tuned from IceBERT for NER, designed for normalised Old Icelandic texts as defined in the Menota normalised transcription level description.

For diplomatic transcriptions of Old Icelandic texts, please use oldbertur-diplomatic-old-icelandic-ner instead.

Intended Uses

Named entity recognition in normalised Old Icelandic texts
Digital humanities research on Medieval Icelandic literature
Semi-automatic annotation of historical Icelandic documents
Information extraction from saga literature and historical texts

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Riksarkivet/oldbertur-normalised-old-icelandic-ner")
model = AutoModelForTokenClassification.from_pretrained("Riksarkivet/oldbertur-normalised-old-icelandic-ner")

# Use aggregation_strategy="first" to properly combine subword tokens
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = "Í þann tíma var hǫfðingi ágǽtr á Íslandi í Ísafirði, er Vermundr hét"
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.3f})")

Expected output:

Íslandi: Location (1.000)
Ísafirði,: Location (1.000)
Vermundr: Person (1.000)

Training Data

This model uses the (M + I)^R + MIM training configuration, combining:

Source	Description
Menota (M)	Normalised Old Icelandic texts from the Medieval Nordic Text Archive
IcePaHC (I)	Icelandic Parsed Historical Corpus (normalised Old Icelandic texts)
MIM-GOLD-NER (MIM)	Modern Icelandic NER data for data augmentation

The superscript ^R means that sentence-level class resampling (sCR) was applied to both Menota and IcePaHC data. Since both sources use normalised orthography, both are resampled to address entity class imbalance.

Training set statistics:

Configuration	Person	Location	Total
(M + I)^R + MIM	37,917	11,931	49,848

Entity breakdown by source:

Source	Person	Location	Total
Menota (M)	1,486	180	1,666
IcePaHC (I)	2,797	362	3,159
(M + I)^R after resampling	22,330	2,929	25,259
MIM-GOLD-NER	15,587	9,002	24,589

Evaluation sets:

Dev: 26,419 tokens, 1,301 entities (1036 Person; 265 Location)
Test: 25,893 tokens, 1,260 entities (997 Person; 263 Location)

The dev and test sets consist exclusively of Old Icelandic texts in order to reflect our target domain.

Evaluation Results

Metric	Score
F1	0.93
Precision	0.90
Recall	0.95

Labels

The model uses BIO tagging with the following labels:

Label	Description
`O`	Outside any entity
`B-Person`	Beginning of a person name
`I-Person`	Inside/continuation of a person name
`B-Location`	Beginning of a location name
`I-Location`	Inside/continuation of a location name

Limitations

Orthography: This model is trained on normalised texts. For diplomatic transcriptions, use the diplomatic variation of this model.
Entity types: Only Person and Location entities are supported. Other entity types (organisations, dates, etc.) are not recognised due to scarcity in the training data.
Time period: Primarily trained on texts from 1250-1400 CE. Performance may vary on texts from other periods.
Domain: Optimised for saga literature and historical texts. May perform differently on other text types.

Training Procedure

NER is framed as a token classification task, with a classification head added on top of IceBERT.

Hyperparameters:

Base model: mideind/IceBERT
Epochs: 5
Learning rate: 2e-5
Batch size: 16
Max sequence length: 256 tokens
Warm-up ratio: 10%
Weight decay: 0.01

Class imbalance handling:

Weighted cross-entropy loss with class weights: 0.1 for non-entities (O), 30.0 for entity classes to address class imbalance in the training data.

Citation

If you use this model, please cite:

@misc{coming_soon,
}

Resources

For more information on our code and data, see our GitHub repository. For more information about the work in general, see our paper (coming soon).

Code & Data: GitHub Repository
Diplomatic NER Model: oldbertur-diplomatic-old-icelandic-ner
Paper: Coming soon.
Base Model: mideind/IceBERT

Acknowledgments

We are grateful for the great work carried out by the projects below, and for making it possible for us to use their data in order to conduct our academic research and develop NER models for Medieval Icelandic. We thank developers, annotators, scholars, project managers, and anyone else who has contributed to these projects. We also express our sincerest gratitude to the students from Uppsala University who assisted in marking and annotating entities in the two Menota works Codex Wormianus (AM 242 fol) and Vǫluspá in Hauksbók (AM 544 4to).

Menota: The Menota project
Icelandic Parsed Historical Corpus (IcePaHC): Wallenberg et al., 2024
IceBERT base model: Snæbjarnarson et al. 2022
MIM-GOLD-NER: Ingólfsdóttir et al. 2020

License

This model is released under the GNU General Public License v3.0 (GPL-3.0).

Contact

For questions or issues, please open an issue on the GitHub repository or contact: phenningsson@me.com

Downloads last month: 22

Model tree for Riksarkivet/oldbertur-normalised-old-icelandic-ner

Base model

mideind/IceBERT

Finetuned

(3)

this model

Evaluation results

F1 (micro) on Medieval Icelandic NER (Normalised)
self-reported

0.930
Precision on Medieval Icelandic NER (Normalised)
self-reported

0.900
Recall on Medieval Icelandic NER (Normalised)
self-reported

0.950