OldBERTur: Named Entity Recognition for Normalised Old Icelandic

This model performs Named Entity Recognition (NER) on normalised Old Icelandic texts sourced from medieval manuscripts, identifying Person and Location entities. Please note, while the model is fully functional, this model card is due to be updated in the near future with supplementary information.

Model Description

  • Model type: Token classification (NER)
  • Base model: mideind/IceBERT
  • Language: Old Icelandic (normalised transcription level)
  • Entity types: Person, Location (BIO tagging scheme)
  • F1 Score: 0.93

This model is fine-tuned from IceBERT for NER, designed for normalised Old Icelandic texts as defined in the Menota normalised transcription level description.

For diplomatic transcriptions of Old Icelandic texts, please use coming soon instead.

Intended Uses

  • Named entity recognition in normalised Old Icelandic texts
  • Digital humanities research on Medieval Icelandic literature
  • Semi-automatic annotation of historical Icelandic documents
  • Information extraction from saga literature and historical texts

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Riksarkivet/oldbertur-normalised-old-icelandic-ner")
model = AutoModelForTokenClassification.from_pretrained("Riksarkivet/oldbertur-normalised-old-icelandic-ner")

# Use aggregation_strategy="first" to properly combine subword tokens
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

text = "Í þann tíma var hǫfðingi ágǽtr á Íslandi í Ísafirði, er Vermundr hét"
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.3f})")

Expected output:

Íslandi: Location (1.000)
Ísafirði,: Location (1.000)
Vermundr: Person (1.000)

Training Data

This model uses the (M + I)R + MIM training configuration, combining:

Source Description
Menota (M) Normalised Old Icelandic texts from the Medieval Nordic Text Archive
IcePaHC (I) Icelandic Parsed Historical Corpus (normalised Old Icelandic texts)
MIM-GOLD-NER (MIM) Modern Icelandic NER data for data augmentation

The superscript R means that sentence-level class resampling (sCR) was applied to both Menota and IcePaHC data. Since both sources use normalised orthography, both are resampled to address entity class imbalance.

Training set statistics:

Configuration Person Location Total
(M + I)R + MIM 37,917 11,931 49,848

Entity breakdown by source:

Source Person Location Total
Menota (M) 1,486 180 1,666
IcePaHC (I) 2,797 362 3,159
(M + I)R after resampling 22,330 2,929 25,259
MIM-GOLD-NER 15,587 9,002 24,589

Evaluation sets:

  • Dev: 26,419 tokens, 1,301 entities (1036 Person; 265 Location)
  • Test: 25,893 tokens, 1,260 entities (997 Person; 263 Location)

The dev and test sets consist exclusively of Old Icelandic texts in order to reflect our target domain.

Evaluation Results

Metric Score
F1 0.93
Precision 0.90
Recall 0.95

Labels

The model uses BIO tagging with the following labels:

Label Description
O Outside any entity
B-Person Beginning of a person name
I-Person Inside/continuation of a person name
B-Location Beginning of a location name
I-Location Inside/continuation of a location name

Limitations

  • Orthography: This model is trained on normalised texts. For diplomatic transcriptions, use the diplomatic variation of this model.
  • Entity types: Only Person and Location entities are supported. Other entity types (organisations, dates, etc.) are not recognised due to scarcity in the training data.
  • Time period: Primarily trained on texts from 1250-1400 CE. Performance may vary on texts from other periods.
  • Domain: Optimised for saga literature and historical texts. May perform differently on other text types.

Training Procedure

NER is framed as a token classification task, with a classification head added on top of IceBERT.

Hyperparameters:

  • Base model: mideind/IceBERT
  • Epochs: 5
  • Learning rate: 2e-5
  • Batch size: 16
  • Max sequence length: 256 tokens
  • Warm-up ratio: 10%
  • Weight decay: 0.01

Class imbalance handling:

  • Weighted cross-entropy loss with class weights: 0.1 for non-entities (O), 30.0 for entity classes to address class imbalance in the training data.

Citation

If you use this model, please cite:

@misc{coming_soon,
}

Resources

This model card is due to be updated in the near future with more information, along with the diplomatic NER model. For more information on our code and data, see our GitHub repository. For more information about the work in general, see our paper (coming soon).

Acknowledgments

We are grateful for the great work carried out by the projects below, and for making it possible for us to use their data in order to conduct our academic research and develop NER models for Medieval Icelandic. We thank developers, annotators, scholars, project managers, and anyone else who has contributed to these projects. We also express our sincerest gratitude to the students from Uppsala University who assisted in marking and annotating entities in the two Menota works Codex Wormianus (AM 242 fol) and Vǫluspá in Hauksbók (AM 544 4to).

License

This model is released under the GNU General Public License v3.0 (GPL-3.0).

Contact

For questions or issues, please open an issue on the GitHub repository or contact: phenningsson@me.com

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Riksarkivet/oldbertur-normalised-old-icelandic-ner

Base model

mideind/IceBERT
Finetuned
(2)
this model

Evaluation results