Medieval Latin Span-NER (Bi-Encoder Architecture)

This repository contains a custom Span-based Named Entity Recognition (NER) model designed specifically for Medieval Latin text, such as historical charters and legal documents. Unlike standard token-level NER models, this architecture handles complex, overlapping, and highly variable span lengths by utilizing a custom Bi-Encoder approach.

Model Architecture

The model deviates from standard Hugging Face pipelines (AutoModelForTokenClassification) to effectively capture both short entities and long, descriptive boundary definitions (e.g., properties and legal clauses).

Key architectural features include:

Text Encoder: FacebookAI/xlm-roberta-large serves as the primary contextual sequence encoder.
Label Encoder: BAAI/bge-m3 is utilized as a frozen semantic label encoder to map rich textual label descriptions into a dense semantic space.
Span Representation Layer: Uses Multi-Head Attention to pool sequence outputs across generated token spans, supplemented by learned span-width embeddings.
Hybrid Loss Engine: Combines a tamed Dynamic Focal Loss (to suppress inlier majority classes without gradient starvation) and Dice Loss (for boundary smoothing).
Contrastive Learning: Utilizes an InfoNCE loss branch with Hard Negative Mining (20% ratio) to push semantic representations of spans toward their corresponding label embeddings in the latent space.

Evaluation and Ablation Results

The model has been rigorously evaluated on a custom Medieval Latin dataset. The evaluation utilizes two distinct metrics to capture different failure modes:

Overlap F1: Measures span-level semantic coverage (does the model find the core entity?).
Exact F1: Measures strict boundary precision (does the model correctly identify the exact start and end tokens?).

Full Model Performance:

Overlap F1: 83.4%
Exact F1: 67.7%

Label Dictionary

The model is trained to recognize 19 distinct entity classes relevant to medieval diplomatics:

PER: Individual person name.
ACTOR: Full noun phrase referring to a person (name, title, profession, origin).
TITLE: Social rank, noble title, or ecclesiastical office.
REL: Word or phrase indicating family, kinship, or social relationship.
LOC: Geographical place, settlement, city, or diocese.
INS: Monastery, abbey, church, or religious order.
NAT: Natural landscape feature (river, mountain, forest).
EST: Physical plot of land, estate, farm, or vineyard.
PROP: Detailed boundary description of a property.
LEG: Legal clause declaring rights, conditions, or penalties.
TRANS: Verb or phrase denoting a core transaction or donation.
TIM: Time period, duration, or regnal year.
DAT: Specific calendar date or liturgical feast day.
MON: Money, currency, coin, or monetary value.
TAX: Customary toll, legal tax, or tribute.
COM: Harvested crops, physical goods, or traded animals.
NUM: Number written as a word or Roman numeral.
MEA: Unit of measurement for land, volume, or weight.
RELIC: Holy relic, cross, altar, or sacred object.

How to Use

Because this model uses a custom architecture, it cannot be loaded using the standard pipeline() API. You must download the architecture script (span_ner_model.py) alongside the model weights.

Requirements

pip install torch transformers huggingface_hub

Dataset

20 Named Entity Recognition (NER) Dataset for Medieval Latin Charters from Monasterium.net - https://zenodo.org/records/19009431

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ERCDiDip/medieval-latin-span-ner

Base model

FacebookAI/xlm-roberta-large

Finetuned

(980)

this model