XLMR-Base-Council-Anonymizer: Personal Data Identification for Portuguese Municipal Meeting Minutes

This model consists of a fine-tuned XLM-RoBERTa Base for the extraction and identification of sensitive personal data in minutes of Portuguese municipal meetings.

Model Description

The XLMR-BCA the multilingual contextual representations of FacebookAI's XLM-RoBERTa, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. Unlike generic NER models, this model was trained with Weighted Cross-Entropy Loss to handle class imbalance, allowing for effective detection even in entities with few occurrences.

Key Features

  • 🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
  • 🛡️ Privacy-Focused NER: Identifies and classifies sensitive entities (PII) to support automatic anonymization processes.
  • ⚙️ Transformer-based Architecture: It uses the power of XLM-RoBERTa to capture the grammatical and formal context of administrative documents.

Model Details

  • Base Model: XLM-RoBERTa Base
  • Architecture: Token Classification (NER) com Weighted Cross-Entropy Loss
  • Parameters: ~270M
  • Max Sequence Length: 512 tokens
  • Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities)
  • Evaluation Metrics: F1-Score, Recall and Precision
  • Training Framework: PyTorch + Transformers + Seqeval

Entity Types

The model recognizes 19 entity types in BIO format (49 labels total):

Entity Type Description Example
PERSONAL-PUBLIC Public offices, government bodies, or official collective entities Câmara Municipal, Executivo, Assembleia
PERSONAL-ADMIN Administrative identifiers and case/process numbers 5597/2023
PERSONAL-NAME Proper names of individuals João Silva
PERSONAL-POSITION Professional roles, political positions, or technical functions Diretor do Departamento dos Recursos Humanos
PERSONAL-ADDRESS Addresses, street names, and door/plot numbers Rua das Flores n.º 10, Avenida Central
PERSONAL-DATE Dates of events, decisions, or time periods 20/05/2023
PERSONAL-LOCATION Cities, parishes, districts, or geographic locations Freguesia do Porto
PERSONAL-OTHER Generic personal information and miscellaneous contact data Referências de contacto, dados diversos
PERSONAL-INFO Biographical data or sensitive personal information 11490753
PERSONAL-COMPANY Companies or private legal entities Construções & Filho, Lda
PERSONAL-TIME References to specific times 14:30h
PERSONAL-LICENSE License plates or registration numbers 48-RF-99
PERSONAL-DEGREE Academic titles or professional degrees Licenciatura de Psicologia
PERSONAL-VEHICLE Vehicle identification and models Mercedes-Benz Classe S
PERSONAL-FAMILY Mentions of kinship, family relationships, or heirs Marido
PERSONAL-FACULTY Higher education institutions or university faculties Faculdade de Economia da Universidade do Porto
PERSONAL-ARTISTIC Nomes artísticos, pseudónimos Pintura

"PROFISSAO / TELEMOVEL"

How It Works

The model performs token-level classification, analyzing each word individually based on its linguistic context. Through this analysis, the system identifies patterns to detect sensitive information using the labels mentioned above and assigns specific labels that allow for the automatic anonymization of the data.

INPUT:

O interessado João Silva submeteu o processo administrativo 5597/2023 no dia 20/05/2023, relativo ao imóvel localizado na Rua das Flores n.º 10.

Output:

O interessado <NAME> submeteu o processo administrativo <ADMIN> no dia <DATE>, relativo ao imóvel localizado na <ADDRESS>.

Results

Entity-Level Performance (Test Set)

Metric Score
F1 Score X%
Precision X%
Recall X%

Per-Entity Performance

Entity Type Precision Recall F1 Score Support
PERSONAL-PUBLIC 12827
PERSONAL-ADMIN 1186
PERSONAL-NAME 1186
PERSONAL-POSITION 716
PERSONAL-ADDRESS 368
PERSONAL-DATE 249
PERSONAL-LOCATION 191
PERSONAL-OTHER 70
PERSONAL-INFO 43
PERSONAL-COMPANY 29
PERSONAL-TIME 22
PERSONAL-LICENSE 19
PERSONAL-DEGREE 18
PERSONAL-VEHICLE 14
PERSONAL-FAMILY 7
PERSONAL-FACULTY 6
PERSONAL-ARTISTIC 4

Usage

Quick Start

The simplest way to use the model:

from transformers import AutoTokenizer, AutoModel

model_name = "tiagomfmarques/xlmr-base-council-anonymizer"

Limitations

  • Domain Specificity: Best performance on administrative/governmental meeting minutes
  • Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
  • Sequence length: Limited to 512 tokens per window

License

This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International

Version: 1.0
Last Updated: 2025-12-22

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tiagomfmarques/anonimizacao_teste

Finetuned
(870)
this model