SpanBERT-Coreference: Coreference Resolution for Portuguese Municipal Meeting Minutes

This model consists of a fine-tuned SpanBERT for coreference resolution in minutes of Portuguese municipal meetings, identifying when different expressions in a document refer to the same real-world entity.

Model Description

The SpanBERT-PT-Coreference uses a mention-ranking architecture (Lee et al., 2017) on top of a SpanBERT encoder pre-trained in Portuguese, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. The model detects coreferent mention clusters across 17 entity categories, enabling consistent pseudonymization of sensitive personal data throughout long documents.

Key Features

🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
🔗 Coreference-Aware Pseudonymization: Links multiple mentions of the same entity, ensuring consistent ID assignment across the document
📄 Long Document Support: Handles documents of up to ~28,000 tokens via sliding window encoding with stride overlap
⚙️ Mention-Ranking Architecture: Scores all antecedent candidates for each mention span and clusters them via Union-Find

Model Details

Base Model: SpanBERT (Portuguese continued pre-training)
Architecture: Mention-ranking coreference (Lee et al., 2017)
Parameters: ~110M (encoder) + ~25M (coreference heads)
Max Window Size: 512 tokens (stride: 128)
Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities, ~1.3M tokens)
Evaluation Metrics: MUC, B³, CEAF-e, CoNLL F1, LEA
Training Framework: PyTorch + Transformers

Entity Types

The model resolves coreference across 17 entity categories:

Entity Type	Description	Example
`PERSONAL-NAME`	Proper names of individuals	João Silva → João → ele
`PERSONAL-ADMIN`	Administrative identifiers and process numbers	5597/2023 → o processo
`PERSONAL-POSITION`	Professional roles and political positions	Presidente da Câmara → o Presidente
`PERSONAL-ADDRESS`	Addresses and street names	Rua das Flores n.º 10 → Rua das Flores
`PERSONAL-LOCATION`	Cities, parishes, or geographic locations	Porto → o concelho
`PERSONAL-DATE`	Dates of events or decisions	20/05/2023
`PERSONAL-COMPANY`	Companies or private legal entities	Construções & Filho, Lda
`PERSONAL-INFO`	Biographical or sensitive personal data	NIF, número de contribuinte
`PERSONAL-DEGREE`	Academic titles or professional degrees	Licenciatura de Psicologia
`PERSONAL-TIME`	References to specific times	14:30h
`PERSONAL-LICENSE`	License plates or registration numbers	48-RF-99
`PERSONAL-JOB`	Person's profession or occupation	Professor
`PERSONAL-VEHICLE`	Vehicle identification and models	Mercedes-Benz Classe S
`PERSONAL-FACULTY`	Higher education institutions	Faculdade de Economia
`PERSONAL-FAMILY`	Kinship or family relationships	Marido
`PERSONAL-ARTISTIC`	Artistic names or pseudonyms	Pintura
`PERSONAL-OTHER`	Miscellaneous personal information	Dados diversos

Cluster Distribution (Training Set)

Entity Class	Number of Clusters
`PERSONAL-ADMIN`	273
`PERSONAL-NAME`	259
`PERSONAL-POSITION`	174
`PERSONAL-ADDRESS`	115
`PERSONAL-LOCATION`	64
`PERSONAL-DATE`	49
`PERSONAL-OTHER`	16
`PERSONAL-INFO`	12
`PERSONAL-COMPANY`	10
`PERSONAL-DEGREE`	6
`PERSONAL-LICENSE`	6
`PERSONAL-VEHICLE`	4
`PERSONAL-JOB`	4
`PERSONAL-TIME`	3
`PERSONAL-FAMILY`	2
`PERSONAL-ARTISTIC`	1
Total	998

How It Works

The model uses a two-stage pipeline. First, entity spans are identified by a NER model (liaad/CitiLink-XLMR-Anonymization-pt). Then, the coreference model groups spans that refer to the same entity, assigning consistent IDs throughout the document.

INPUT:

João Paulo Rosinha Daniel apresentou o requerimento n.º 2683/20020627.
O pedido de João Paulo, referente ao processo 2683/20020627, foi aprovado.

OUTPUT:

<NAME-1> apresentou o requerimento n.º <ADMIN-1>.
O pedido de <NAME-1>, referente ao processo <ADMIN-1>, foi aprovado.

Architecture Details

The mention-ranking model represents each span as:

span_repr = [h_start ; h_end ; head_attn ; width_emb]
           = [768 ; 768 ; 768 ; 20] = 2324 dims

Antecedent scoring uses:

score(i,j) = mention_score(i) + mention_score(j) + antecedent_score([vi; vj; vi*vj; dist_emb])

Clusters are built via Union-Find, restricted to spans of the same NER class and scores above threshold.

Results

Overall Performance (Test Set)

Metric	Precision (%)	Recall (%)	F1 Score (%)
MUC	90.20	92.80	91.50
B³	86.70	92.50	89.50
CEAF-e	89.20	74.40	81.20
CoNLL F1	—	—	87.40
LEA	64.80	88.20	74.80

Usage

Quick Start

The simplest way to use the model:

import torch
import json
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download

model_id = "inesctec/CitiLink-SpanBERT-Coreference-pt"

# Load tokenizer and encoder
tokenizer = AutoTokenizer.from_pretrained(model_id)
encoder   = AutoModel.from_pretrained(model_id)

# Load coreference heads and config
cabecas_path   = hf_hub_download(repo_id=model_id, filename="cabecas_coref.pt")
coref_cfg_path = hf_hub_download(repo_id=model_id, filename="coref_config.json")

with open(coref_cfg_path) as f:
    coref_cfg = json.load(f)

cabecas = torch.load(cabecas_path, map_location="cpu", weights_only=False)

# Input text (already tokenized by whitespace)
tokens = [
    "João", "Paulo", "Rosinha", "Daniel", "apresentou", "o",
    "requerimento", "n.º", "2683/20020627", ".",
    "O", "pedido", "de", "João", "Paulo", "foi", "aprovado",
    "ao", "abrigo", "do", "processo", "2683/20020627", "."
]

# NER spans already extracted by a NER model — (start, end, class)
ner_spans = [
    (0,  3,  "NAME"),   # "João Paulo Rosinha Daniel"
    (8,  8,  "ADMIN"),  # "2683/20020627"
    (13, 14, "NAME"),   # "João Paulo"
    (21, 21, "ADMIN"),  # "2683/20020627" (segunda ocorrência)
]

# Tokenize for encoder
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")

# Run encoder
with torch.no_grad():
    outputs = encoder(**inputs)

print("Coreference model loaded successfully.")
print(f"Hidden states shape: {outputs.last_hidden_state.shape}")

Limitations

Domain Specificity: Best performance on administrative/governmental meeting minutes
Language: Trained exclusively on European Portuguese (PT-PT)
Window Size: Processes documents via 512-token sliding windows; very long-range coreference (>5,000 tokens apart) may be missed
NER Dependency: Requires entity spans from a NER model as input — does not perform span detection independently
Same-class Restriction: Only links mentions of the same NER entity class (e.g., NAME to NAME, never NAME to POSITION)
Dataset Size: Fine-tuned on 61 training documents; performance on highly specific sub-domains may vary

Version: 1.0
Last Updated: 2026-06-25

License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: LICENSE

Downloads last month: 15

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for liaad/CitiLink-SpanBERT-Coreference-pt

Base model

neuralmind/bert-base-portuguese-cased

Finetuned

(215)

this model

liaad
/

CitiLink-SpanBERT-Coreference-pt