Medieval Latin NER (Student Model)

1. Model Description

This model is a fine-tuned XLM-RoBERTa-base specialized for Named Entity Recognition (NER) on Medieval Latin historical texts. It was trained to recognize 19 distinct historical, legal, and geographic entity types commonly found in medieval documents.

The model was developed using Knowledge Distillation. It is a lightweight "Student" model, distilled from a larger "Teacher" SpanNER model (ERCDiDip/medieval-latin-span-ner traned on 20 charters), making it faster and more efficient for large-scale processing while maintaining high accuracy.

Organization: ERCDiDip
Model Type: Token Classification
Base Model: xlm-roberta-base
Language: Latin (Medieval)

2. Entity Types (Labels)

The model follows the BIO (Begin, Inside, Outside) tagging scheme for the following categories:

Tag	Description
PER	Individual person names (given or family names).
ACTOR	Person names including titles, professions, or social status.
TITLE	Social rank, noble titles, or ecclesiastical offices (e.g., comes, episcopus).
REL	Kinship or social relationships (e.g., filius, uxor).
LOC	Geographical places, cities, or settlements.
INS	Corporate bodies like monasteries, abbeys, or churches.
NAT	Natural features (rivers, forests, mountains).
EST	Physical plots of land, farms, or meadows.
PROP	Detailed boundary descriptions of properties.
LEG	Legal clauses, penalties, and commands.
TRANS	Core transaction verbs (e.g., dedit, confirmavit).
TIM	General time periods or indictions.
DAT	Specific calendar dates or liturgical feasts.
MON	Currencies and monetary values (e.g., libra, solidus).
TAX	Tolls, tithes, or taxes.
COM	Commodities, crops, or animals.
NUM	Numbers and roman numerals.
MEA	Units of measurement (e.g., mansus, aratrum).
RELIC	Holy relics and sacred objects.

3. Evaluation Results

The model was evaluated on a held-out test set. It shows high performance on frequent entities such as people, locations, and titles.

Entity	Precision	Recall	F1-Score	Support
ACTOR	0.10	0.18	0.13	11
COM	0.63	0.92	0.75	13
DAT	0.60	0.71	0.65	34
EST	0.80	0.92	0.85	190
INS	0.76	0.89	0.82	247
LEG	0.39	0.50	0.44	42
LOC	0.91	0.94	0.93	1099
MEA	1.00	0.57	0.73	7
MON	0.53	1.00	0.70	8
NAT	0.57	0.74	0.64	34
NUM	0.76	0.94	0.84	144
PER	0.93	0.97	0.95	1148
REL	0.87	0.96	0.91	264
TAX	0.90	0.93	0.92	29
TIM	0.43	0.58	0.50	96
TITLE	0.88	0.94	0.91	1019
TRANS	0.51	0.67	0.58	27

Micro Avg	0.85	0.92	0.89	4412
Macro Avg	0.68	0.79	0.72	4412

4. How to Use

You can use this model with the Hugging Face pipeline:

from transformers import pipeline

ner_pipeline = pipeline(
    "ner", 
    model="ERCDiDip/medieval-latin-ner", 
    aggregation_strategy="simple"
)

text = "Jacobus filius Nicolai de villa Sancta Maria dedit unam marcam."
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2f})")

5. Training Details

Distillation: Pseudo-labels generated by ERCDiDip/medieval-latin-span-ner.
Loss: Cross-Entropy with class weights (O-class weight: 0.05) to handle label imbalance.
Optimizer: AdamW with learning rate 2e-5.
Epochs: 20.

6. Limitations

Does not support nested or overlapping entities (flat NER only).
Performance is lower on very rare classes (e.g., ACTOR).
Abbreviated Latin text should be expanded for best results.

7. Citation

If you use this model in your research, please cite the ERCDiDip project.

Downloads last month: 3

Safetensors

Model size

0.3B params

Tensor type

F32