Instructions to use liaad/CitiLink-SpanBERT-Coreference-pt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use liaad/CitiLink-SpanBERT-Coreference-pt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="liaad/CitiLink-SpanBERT-Coreference-pt")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("liaad/CitiLink-SpanBERT-Coreference-pt") model = AutoModel.from_pretrained("liaad/CitiLink-SpanBERT-Coreference-pt") - Notebooks
- Google Colab
- Kaggle
SpanBERT-Coreference: Coreference Resolution for Portuguese Municipal Meeting Minutes
This model consists of a fine-tuned SpanBERT for coreference resolution in minutes of Portuguese municipal meetings, identifying when different expressions in a document refer to the same real-world entity.
Model Description
The SpanBERT-PT-Coreference uses a mention-ranking architecture (Lee et al., 2017) on top of a SpanBERT encoder pre-trained in Portuguese, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. The model detects coreferent mention clusters across 17 entity categories, enabling consistent pseudonymization of sensitive personal data throughout long documents.
Key Features
- 🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
- 🔗 Coreference-Aware Pseudonymization: Links multiple mentions of the same entity, ensuring consistent ID assignment across the document
- 📄 Long Document Support: Handles documents of up to ~28,000 tokens via sliding window encoding with stride overlap
- ⚙️ Mention-Ranking Architecture: Scores all antecedent candidates for each mention span and clusters them via Union-Find
Model Details
- Base Model: SpanBERT (Portuguese continued pre-training)
- Architecture: Mention-ranking coreference (Lee et al., 2017)
- Parameters: ~110M (encoder) + ~25M (coreference heads)
- Max Window Size: 512 tokens (stride: 128)
- Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities, ~1.3M tokens)
- Evaluation Metrics: MUC, B³, CEAF-e, CoNLL F1, LEA
- Training Framework: PyTorch + Transformers
Entity Types
The model resolves coreference across 17 entity categories:
| Entity Type | Description | Example |
|---|---|---|
PERSONAL-NAME |
Proper names of individuals | João Silva → João → ele |
PERSONAL-ADMIN |
Administrative identifiers and process numbers | 5597/2023 → o processo |
PERSONAL-POSITION |
Professional roles and political positions | Presidente da Câmara → o Presidente |
PERSONAL-ADDRESS |
Addresses and street names | Rua das Flores n.º 10 → Rua das Flores |
PERSONAL-LOCATION |
Cities, parishes, or geographic locations | Porto → o concelho |
PERSONAL-DATE |
Dates of events or decisions | 20/05/2023 |
PERSONAL-COMPANY |
Companies or private legal entities | Construções & Filho, Lda |
PERSONAL-INFO |
Biographical or sensitive personal data | NIF, número de contribuinte |
PERSONAL-DEGREE |
Academic titles or professional degrees | Licenciatura de Psicologia |
PERSONAL-TIME |
References to specific times | 14:30h |
PERSONAL-LICENSE |
License plates or registration numbers | 48-RF-99 |
PERSONAL-JOB |
Person's profession or occupation | Professor |
PERSONAL-VEHICLE |
Vehicle identification and models | Mercedes-Benz Classe S |
PERSONAL-FACULTY |
Higher education institutions | Faculdade de Economia |
PERSONAL-FAMILY |
Kinship or family relationships | Marido |
PERSONAL-ARTISTIC |
Artistic names or pseudonyms | Pintura |
PERSONAL-OTHER |
Miscellaneous personal information | Dados diversos |
Cluster Distribution (Training Set)
| Entity Class | Number of Clusters |
|---|---|
PERSONAL-ADMIN |
273 |
PERSONAL-NAME |
259 |
PERSONAL-POSITION |
174 |
PERSONAL-ADDRESS |
115 |
PERSONAL-LOCATION |
64 |
PERSONAL-DATE |
49 |
PERSONAL-OTHER |
16 |
PERSONAL-INFO |
12 |
PERSONAL-COMPANY |
10 |
PERSONAL-DEGREE |
6 |
PERSONAL-LICENSE |
6 |
PERSONAL-VEHICLE |
4 |
PERSONAL-JOB |
4 |
PERSONAL-TIME |
3 |
PERSONAL-FAMILY |
2 |
PERSONAL-ARTISTIC |
1 |
| Total | 998 |
How It Works
The model uses a two-stage pipeline. First, entity spans are identified by a NER model (liaad/CitiLink-XLMR-Anonymization-pt). Then, the coreference model groups spans that refer to the same entity, assigning consistent IDs throughout the document.
INPUT:
João Paulo Rosinha Daniel apresentou o requerimento n.º 2683/20020627.
O pedido de João Paulo, referente ao processo 2683/20020627, foi aprovado.
OUTPUT:
<NAME-1> apresentou o requerimento n.º <ADMIN-1>.
O pedido de <NAME-1>, referente ao processo <ADMIN-1>, foi aprovado.
Architecture Details
The mention-ranking model represents each span as:
span_repr = [h_start ; h_end ; head_attn ; width_emb]
= [768 ; 768 ; 768 ; 20] = 2324 dims
Antecedent scoring uses:
score(i,j) = mention_score(i) + mention_score(j) + antecedent_score([vi; vj; vi*vj; dist_emb])
Clusters are built via Union-Find, restricted to spans of the same NER class and scores above threshold.
Results
Overall Performance (Test Set)
| Metric | Precision (%) | Recall (%) | F1 Score (%) |
|---|---|---|---|
| MUC | 90.20 | 92.80 | 91.50 |
| B³ | 86.70 | 92.50 | 89.50 |
| CEAF-e | 89.20 | 74.40 | 81.20 |
| CoNLL F1 | — | — | 87.40 |
| LEA | 64.80 | 88.20 | 74.80 |
Usage
Quick Start
The simplest way to use the model:
import torch
import json
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download
model_id = "inesctec/CitiLink-SpanBERT-Coreference-pt"
# Load tokenizer and encoder
tokenizer = AutoTokenizer.from_pretrained(model_id)
encoder = AutoModel.from_pretrained(model_id)
# Load coreference heads and config
cabecas_path = hf_hub_download(repo_id=model_id, filename="cabecas_coref.pt")
coref_cfg_path = hf_hub_download(repo_id=model_id, filename="coref_config.json")
with open(coref_cfg_path) as f:
coref_cfg = json.load(f)
cabecas = torch.load(cabecas_path, map_location="cpu", weights_only=False)
# Input text (already tokenized by whitespace)
tokens = [
"João", "Paulo", "Rosinha", "Daniel", "apresentou", "o",
"requerimento", "n.º", "2683/20020627", ".",
"O", "pedido", "de", "João", "Paulo", "foi", "aprovado",
"ao", "abrigo", "do", "processo", "2683/20020627", "."
]
# NER spans already extracted by a NER model — (start, end, class)
ner_spans = [
(0, 3, "NAME"), # "João Paulo Rosinha Daniel"
(8, 8, "ADMIN"), # "2683/20020627"
(13, 14, "NAME"), # "João Paulo"
(21, 21, "ADMIN"), # "2683/20020627" (segunda ocorrência)
]
# Tokenize for encoder
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
# Run encoder
with torch.no_grad():
outputs = encoder(**inputs)
print("Coreference model loaded successfully.")
print(f"Hidden states shape: {outputs.last_hidden_state.shape}")
Limitations
- Domain Specificity: Best performance on administrative/governmental meeting minutes
- Language: Trained exclusively on European Portuguese (PT-PT)
- Window Size: Processes documents via 512-token sliding windows; very long-range coreference (>5,000 tokens apart) may be missed
- NER Dependency: Requires entity spans from a NER model as input — does not perform span detection independently
- Same-class Restriction: Only links mentions of the same NER entity class (e.g., NAME to NAME, never NAME to POSITION)
- Dataset Size: Fine-tuned on 61 training documents; performance on highly specific sub-domains may vary
Version: 1.0
Last Updated: 2026-06-25
License
This project uses a custom dual-license based on AGPL v3.
See the full license terms here: LICENSE
- Downloads last month
- 15
Model tree for liaad/CitiLink-SpanBERT-Coreference-pt
Base model
neuralmind/bert-base-portuguese-cased