SpanBERT-Coreference: Coreference Resolution for Portuguese Municipal Meeting Minutes

This model consists of a fine-tuned SpanBERT for coreference resolution in minutes of Portuguese municipal meetings, identifying when different expressions in a document refer to the same real-world entity.

Model Description

The SpanBERT-PT-Coreference uses a mention-ranking architecture (Lee et al., 2017) on top of a SpanBERT encoder pre-trained in Portuguese, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. The model detects coreferent mention clusters across 17 entity categories, enabling consistent pseudonymization of sensitive personal data throughout long documents.

Key Features

  • 🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
  • 🔗 Coreference-Aware Pseudonymization: Links multiple mentions of the same entity, ensuring consistent ID assignment across the document
  • 📄 Long Document Support: Handles documents of up to ~28,000 tokens via sliding window encoding with stride overlap
  • ⚙️ Mention-Ranking Architecture: Scores all antecedent candidates for each mention span and clusters them via Union-Find

Model Details

  • Base Model: SpanBERT (Portuguese continued pre-training)
  • Architecture: Mention-ranking coreference (Lee et al., 2017)
  • Parameters: ~110M (encoder) + ~25M (coreference heads)
  • Max Window Size: 512 tokens (stride: 128)
  • Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities, ~1.3M tokens)
  • Evaluation Metrics: MUC, B³, CEAF-e, CoNLL F1, LEA
  • Training Framework: PyTorch + Transformers

Entity Types

The model resolves coreference across 17 entity categories:

Entity Type Description Example
PERSONAL-NAME Proper names of individuals João Silva → João → ele
PERSONAL-ADMIN Administrative identifiers and process numbers 5597/2023 → o processo
PERSONAL-POSITION Professional roles and political positions Presidente da Câmara → o Presidente
PERSONAL-ADDRESS Addresses and street names Rua das Flores n.º 10 → Rua das Flores
PERSONAL-LOCATION Cities, parishes, or geographic locations Porto → o concelho
PERSONAL-DATE Dates of events or decisions 20/05/2023
PERSONAL-COMPANY Companies or private legal entities Construções & Filho, Lda
PERSONAL-INFO Biographical or sensitive personal data NIF, número de contribuinte
PERSONAL-DEGREE Academic titles or professional degrees Licenciatura de Psicologia
PERSONAL-TIME References to specific times 14:30h
PERSONAL-LICENSE License plates or registration numbers 48-RF-99
PERSONAL-JOB Person's profession or occupation Professor
PERSONAL-VEHICLE Vehicle identification and models Mercedes-Benz Classe S
PERSONAL-FACULTY Higher education institutions Faculdade de Economia
PERSONAL-FAMILY Kinship or family relationships Marido
PERSONAL-ARTISTIC Artistic names or pseudonyms Pintura
PERSONAL-OTHER Miscellaneous personal information Dados diversos

Cluster Distribution (Training Set)

Entity Class Number of Clusters
PERSONAL-ADMIN 273
PERSONAL-NAME 259
PERSONAL-POSITION 174
PERSONAL-ADDRESS 115
PERSONAL-LOCATION 64
PERSONAL-DATE 49
PERSONAL-OTHER 16
PERSONAL-INFO 12
PERSONAL-COMPANY 10
PERSONAL-DEGREE 6
PERSONAL-LICENSE 6
PERSONAL-VEHICLE 4
PERSONAL-JOB 4
PERSONAL-TIME 3
PERSONAL-FAMILY 2
PERSONAL-ARTISTIC 1
Total 998

How It Works

The model uses a two-stage pipeline. First, entity spans are identified by a NER model (liaad/CitiLink-XLMR-Anonymization-pt). Then, the coreference model groups spans that refer to the same entity, assigning consistent IDs throughout the document.

INPUT:

João Paulo Rosinha Daniel apresentou o requerimento n.º 2683/20020627.
O pedido de João Paulo, referente ao processo 2683/20020627, foi aprovado.

OUTPUT:

<NAME-1> apresentou o requerimento n.º <ADMIN-1>.
O pedido de <NAME-1>, referente ao processo <ADMIN-1>, foi aprovado.

Architecture Details

The mention-ranking model represents each span as:

span_repr = [h_start ; h_end ; head_attn ; width_emb]
           = [768 ; 768 ; 768 ; 20] = 2324 dims

Antecedent scoring uses:

score(i,j) = mention_score(i) + mention_score(j) + antecedent_score([vi; vj; vi*vj; dist_emb])

Clusters are built via Union-Find, restricted to spans of the same NER class and scores above threshold.

Results

Overall Performance (Test Set)

Metric Precision (%) Recall (%) F1 Score (%)
MUC 90.20 92.80 91.50
86.70 92.50 89.50
CEAF-e 89.20 74.40 81.20
CoNLL F1 87.40
LEA 64.80 88.20 74.80

Usage

Quick Start

The simplest way to use the model:

import torch
import json
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download

model_id = "inesctec/CitiLink-SpanBERT-Coreference-pt"

# Load tokenizer and encoder
tokenizer = AutoTokenizer.from_pretrained(model_id)
encoder   = AutoModel.from_pretrained(model_id)

# Load coreference heads and config
cabecas_path   = hf_hub_download(repo_id=model_id, filename="cabecas_coref.pt")
coref_cfg_path = hf_hub_download(repo_id=model_id, filename="coref_config.json")

with open(coref_cfg_path) as f:
    coref_cfg = json.load(f)

cabecas = torch.load(cabecas_path, map_location="cpu", weights_only=False)

# Input text (already tokenized by whitespace)
tokens = [
    "João", "Paulo", "Rosinha", "Daniel", "apresentou", "o",
    "requerimento", "n.º", "2683/20020627", ".",
    "O", "pedido", "de", "João", "Paulo", "foi", "aprovado",
    "ao", "abrigo", "do", "processo", "2683/20020627", "."
]

# NER spans already extracted by a NER model — (start, end, class)
ner_spans = [
    (0,  3,  "NAME"),   # "João Paulo Rosinha Daniel"
    (8,  8,  "ADMIN"),  # "2683/20020627"
    (13, 14, "NAME"),   # "João Paulo"
    (21, 21, "ADMIN"),  # "2683/20020627" (segunda ocorrência)
]

# Tokenize for encoder
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")

# Run encoder
with torch.no_grad():
    outputs = encoder(**inputs)

print("Coreference model loaded successfully.")
print(f"Hidden states shape: {outputs.last_hidden_state.shape}")

Limitations

  • Domain Specificity: Best performance on administrative/governmental meeting minutes
  • Language: Trained exclusively on European Portuguese (PT-PT)
  • Window Size: Processes documents via 512-token sliding windows; very long-range coreference (>5,000 tokens apart) may be missed
  • NER Dependency: Requires entity spans from a NER model as input — does not perform span detection independently
  • Same-class Restriction: Only links mentions of the same NER entity class (e.g., NAME to NAME, never NAME to POSITION)
  • Dataset Size: Fine-tuned on 61 training documents; performance on highly specific sub-domains may vary

Version: 1.0
Last Updated: 2026-06-25


License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: LICENSE

Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liaad/CitiLink-SpanBERT-Coreference-pt

Finetuned
(215)
this model

Space using liaad/CitiLink-SpanBERT-Coreference-pt 1