XLMR-Base-Council-Anonymizer: Personal Data Identification for Portuguese Municipal Meeting Minutes

This model consists of a fine-tuned XLM-RoBERTa Base for the extraction and identification of sensitive personal data in minutes of Portuguese municipal meetings.

Model Description

The XLMR-BCA the multilingual contextual representations of FacebookAI's XLM-RoBERTa, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. Unlike generic NER models, this model was trained with Weighted Cross-Entropy Loss to handle class imbalance, allowing for effective detection even in entities with few occurrences.

Key Features

  • πŸ›οΈ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
  • πŸ›‘οΈ Privacy-Focused NER: Identifies and classifies sensitive entities (PII) to support automatic anonymization processes.
  • βš™οΈ Transformer-based Architecture: It uses the power of XLM-RoBERTa to capture the grammatical and formal context of administrative documents.

Model Details

  • Base Model: XLM-RoBERTa Base
  • Architecture: Token Classification (NER) com Weighted Cross-Entropy Loss
  • Parameters: ~270M
  • Max Sequence Length: 512 tokens
  • Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities)
  • Evaluation Metrics: F1-Score, Recall and Precision
  • Training Framework: PyTorch + Transformers + Seqeval

Entity Types

The model recognizes 19 entity types in BIO format (49 labels total):

Entity Type Description Example
PERSONAL-NAME Proper names of individuals JoΓ£o Silva
PERSONAL-ADMIN Administrative identifiers and case/process numbers 5597/2023
PERSONAL-POSITION Professional roles, political positions, or technical functions Diretor do Departamento dos Recursos Humanos
PERSONAL-ADDRESS Addresses, street names, and door/plot numbers Rua das Flores n.ΒΊ 10, Avenida Central
PERSONAL-DATE Dates of events, decisions, or time periods 20/05/2023
PERSONAL-LOCATION Cities, parishes, districts, or geographic locations Freguesia do Porto
PERSONAL-OTHER Generic personal information and miscellaneous contact data ReferΓͺncias de contacto, dados diversos
PERSONAL-INFO Biographical data or sensitive personal information 11490753
PERSONAL-COMPANY Companies or private legal entities ConstruΓ§Γ΅es & Filho, Lda
PERSONAL-ARTISTIC Nomes artΓ­sticos, pseudΓ³nimos Pintura
PERSONAL-DEGREE Academic titles or professional degrees Licenciatura de Psicologia
PERSONAL-TIME References to specific times 14:30h
PERSONAL-LICENSE License plates or registration numbers 48-RF-99
PERSONAL-JOB Person’s profession or occupation. Professor
PERSONAL-VEHICLE Vehicle identification and models Mercedes-Benz Classe S
PERSONAL-FACULTY Higher education institutions or university faculties Faculdade de Economia da Universidade do Porto
PERSONAL-FAMILY Mentions of kinship, family relationships, or heirs Marido

How It Works

The model performs token-level classification, analyzing each word individually based on its linguistic context. Through this analysis, the system identifies patterns to detect sensitive information using the labels mentioned above and assigns specific labels that allow for the automatic anonymization of the data.

INPUT:

O interessado JoΓ£o Silva submeteu o processo administrativo 5597/2023 no dia 20/05/2023, relativo ao imΓ³vel localizado na Rua das Flores n.ΒΊ 10.

Output:

O interessado <NAME> submeteu o processo administrativo <ADMIN> no dia <DATE>, relativo ao imΓ³vel localizado na <ADDRESS>.

Results

Entity-Level Performance (Test Set)

Metric Score
F1 Score X%
Precision X%
Recall X%

Per-Entity Performance

Entity Type Precision Recall F1 Score Support
PERSONAL-NAME β€” β€” β€” 1186
PERSONAL-ADMIN β€” β€” β€” 1186
PERSONAL-POSITION β€” β€” β€” 716
PERSONAL-ADDRESS β€” β€” β€” 368
PERSONAL-DATE β€” β€” β€” 249
PERSONAL-LOCATION β€” β€” β€” 191
PERSONAL-OTHER β€” β€” β€” 70
PERSONAL-INFO β€” β€” β€” 43
PERSONAL-COMPANY β€” β€” β€” 29
PERSONAL-TIME β€” β€” β€” 22
PERSONAL-LICENSE β€” β€” β€” 19
PERSONAL-DEGREE β€” β€” β€” 18
PERSONAL-VEHICLE β€” β€” β€” 14
PERSONAL-FAMILY β€” β€” β€” 7
PERSONAL-FACULTY β€” β€” β€” 6
PERSONAL-ARTISTIC β€” β€” β€” 4

Usage

Quick Start

The simplest way to use the model:

from transformers import pipeline

model_name = "anonymous270126/XLMR-anonymization-council-pt"

nlp = pipeline("ner", model=model_name, tokenizer=model_name, aggregation_strategy="simple")

text = "A reuniΓ£o foi presidida por Manuel Brito no concelho de Alandroal."

results = nlp(text)

for entity in results:
    print(f"Entidade: {entity['word']} | Categoria: {entity['entity_group']} | Score: {entity['score']:.4f}")

Limitations

  • Domain Specificity: Best performance on administrative/governmental meeting minutes
  • Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
  • Sequence length: Limited to 512 tokens per window

Version: 1.0
Last Updated: 2026-01-27


license: cc-by-nc-nd-4.0

Downloads last month
18
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for inesctec/Citilink-XLMR-Anonymization-pt

Finetuned
(919)
this model

Space using inesctec/Citilink-XLMR-Anonymization-pt 1

Collection including inesctec/Citilink-XLMR-Anonymization-pt