Medieval Latin NER (Student Model)
1. Model Description
This model is a fine-tuned XLM-RoBERTa-base specialized for Named Entity Recognition (NER) on Medieval Latin historical texts. It was trained to recognize 19 distinct historical, legal, and geographic entity types commonly found in medieval documents.
The model was developed using Knowledge Distillation. It is a lightweight "Student" model, distilled from a larger "Teacher" SpanNER model (ERCDiDip/medieval-latin-span-ner traned on 20 charters), making it faster and more efficient for large-scale processing while maintaining high accuracy.
- Organization: ERCDiDip
- Model Type: Token Classification
- Base Model: xlm-roberta-base
- Language: Latin (Medieval)
2. Entity Types (Labels)
The model follows the BIO (Begin, Inside, Outside) tagging scheme for the following categories:
| Tag | Description |
|---|---|
| PER | Individual person names (given or family names). |
| ACTOR | Person names including titles, professions, or social status. |
| TITLE | Social rank, noble titles, or ecclesiastical offices (e.g., comes, episcopus). |
| REL | Kinship or social relationships (e.g., filius, uxor). |
| LOC | Geographical places, cities, or settlements. |
| INS | Corporate bodies like monasteries, abbeys, or churches. |
| NAT | Natural features (rivers, forests, mountains). |
| EST | Physical plots of land, farms, or meadows. |
| PROP | Detailed boundary descriptions of properties. |
| LEG | Legal clauses, penalties, and commands. |
| TRANS | Core transaction verbs (e.g., dedit, confirmavit). |
| TIM | General time periods or indictions. |
| DAT | Specific calendar dates or liturgical feasts. |
| MON | Currencies and monetary values (e.g., libra, solidus). |
| TAX | Tolls, tithes, or taxes. |
| COM | Commodities, crops, or animals. |
| NUM | Numbers and roman numerals. |
| MEA | Units of measurement (e.g., mansus, aratrum). |
| RELIC | Holy relics and sacred objects. |
3. Evaluation Results
The model was evaluated on a held-out test set. It shows high performance on frequent entities such as people, locations, and titles.
| Entity | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| ACTOR | 0.10 | 0.18 | 0.13 | 11 |
| COM | 0.63 | 0.92 | 0.75 | 13 |
| DAT | 0.60 | 0.71 | 0.65 | 34 |
| EST | 0.80 | 0.92 | 0.85 | 190 |
| INS | 0.76 | 0.89 | 0.82 | 247 |
| LEG | 0.39 | 0.50 | 0.44 | 42 |
| LOC | 0.91 | 0.94 | 0.93 | 1099 |
| MEA | 1.00 | 0.57 | 0.73 | 7 |
| MON | 0.53 | 1.00 | 0.70 | 8 |
| NAT | 0.57 | 0.74 | 0.64 | 34 |
| NUM | 0.76 | 0.94 | 0.84 | 144 |
| PER | 0.93 | 0.97 | 0.95 | 1148 |
| REL | 0.87 | 0.96 | 0.91 | 264 |
| TAX | 0.90 | 0.93 | 0.92 | 29 |
| TIM | 0.43 | 0.58 | 0.50 | 96 |
| TITLE | 0.88 | 0.94 | 0.91 | 1019 |
| TRANS | 0.51 | 0.67 | 0.58 | 27 |
| Micro Avg | 0.85 | 0.92 | 0.89 | 4412 |
| Macro Avg | 0.68 | 0.79 | 0.72 | 4412 |
4. How to Use
You can use this model with the Hugging Face pipeline:
from transformers import pipeline
ner_pipeline = pipeline(
"ner",
model="ERCDiDip/medieval-latin-ner",
aggregation_strategy="simple"
)
text = "Jacobus filius Nicolai de villa Sancta Maria dedit unam marcam."
results = ner_pipeline(text)
for entity in results:
print(f"{entity['entity_group']}: {entity['word']} ({entity['score']:.2f})")
5. Training Details
- Distillation: Pseudo-labels generated by
ERCDiDip/medieval-latin-span-ner. - Loss: Cross-Entropy with class weights (O-class weight: 0.05) to handle label imbalance.
- Optimizer: AdamW with learning rate 2e-5.
- Epochs: 20.
6. Limitations
- Does not support nested or overlapping entities (flat NER only).
- Performance is lower on very rare classes (e.g., ACTOR).
- Abbreviated Latin text should be expanded for best results.
7. Citation
If you use this model in your research, please cite the ERCDiDip project.
- Downloads last month
- 62