Medieval Latin NER (Student Model)

1. Model Description

This model is a fine-tuned XLM-RoBERTa-base specialized for Named Entity Recognition (NER) on Medieval Latin historical texts. It was trained to recognize 19 distinct historical, legal, and geographic entity types commonly found in medieval documents.

The model was developed using Knowledge Distillation. It is a lightweight "Student" model, distilled from a larger "Teacher" SpanNER model (ERCDiDip/medieval-latin-span-ner traned on 20 charters), making it faster and more efficient for large-scale processing while maintaining high accuracy.

  • Organization: ERCDiDip
  • Model Type: Token Classification
  • Base Model: xlm-roberta-base
  • Language: Latin (Medieval)

2. Entity Types (Labels)

The model follows the BIO (Begin, Inside, Outside) tagging scheme for the following categories:

Tag Description
PER Individual person names (given or family names).
ACTOR Person names including titles, professions, or social status.
TITLE Social rank, noble titles, or ecclesiastical offices (e.g., comes, episcopus).
REL Kinship or social relationships (e.g., filius, uxor).
LOC Geographical places, cities, or settlements.
INS Corporate bodies like monasteries, abbeys, or churches.
NAT Natural features (rivers, forests, mountains).
EST Physical plots of land, farms, or meadows.
PROP Detailed boundary descriptions of properties.
LEG Legal clauses, penalties, and commands.
TRANS Core transaction verbs (e.g., dedit, confirmavit).
TIM General time periods or indictions.
DAT Specific calendar dates or liturgical feasts.
MON Currencies and monetary values (e.g., libra, solidus).
TAX Tolls, tithes, or taxes.
COM Commodities, crops, or animals.
NUM Numbers and roman numerals.
MEA Units of measurement (e.g., mansus, aratrum).
RELIC Holy relics and sacred objects.

3. Evaluation Results

The model was evaluated on a held-out test set. It shows high performance on frequent entities such as people, locations, and titles.

Entity Precision Recall F1-Score Support
ACTOR 0.10 0.18 0.13 11
COM 0.63 0.92 0.75 13
DAT 0.60 0.71 0.65 34
EST 0.80 0.92 0.85 190
INS 0.76 0.89 0.82 247
LEG 0.39 0.50 0.44 42
LOC 0.91 0.94 0.93 1099
MEA 1.00 0.57 0.73 7
MON 0.53 1.00 0.70 8
NAT 0.57 0.74 0.64 34
NUM 0.76 0.94 0.84 144
PER 0.93 0.97 0.95 1148
REL 0.87 0.96 0.91 264
TAX 0.90 0.93 0.92 29
TIM 0.43 0.58 0.50 96
TITLE 0.88 0.94 0.91 1019
TRANS 0.51 0.67 0.58 27
Micro Avg 0.85 0.92 0.89 4412
Macro Avg 0.68 0.79 0.72 4412

4. How to Use

You can use this model with the Hugging Face pipeline:

from transformers import pipeline

ner_pipeline = pipeline(
    "ner", 
    model="ERCDiDip/medieval-latin-ner", 
    aggregation_strategy="simple"
)

text = "Jacobus filius Nicolai de villa Sancta Maria dedit unam marcam."
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2f})")

5. Training Details

  • Distillation: Pseudo-labels generated by ERCDiDip/medieval-latin-span-ner.
  • Loss: Cross-Entropy with class weights (O-class weight: 0.05) to handle label imbalance.
  • Optimizer: AdamW with learning rate 2e-5.
  • Epochs: 20.

6. Limitations

  • Does not support nested or overlapping entities (flat NER only).
  • Performance is lower on very rare classes (e.g., ACTOR).
  • Abbreviated Latin text should be expanded for best results.

7. Citation

If you use this model in your research, please cite the ERCDiDip project.

Downloads last month
62
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support