XLMR-Base-Council-Anonymizer: Personal Data Identification for Portuguese Municipal Meeting Minutes
This model consists of a fine-tuned XLM-RoBERTa Base for the extraction and identification of sensitive personal data in minutes of Portuguese municipal meetings.
Model Description
The XLMR-BCA the multilingual contextual representations of FacebookAI's XLM-RoBERTa, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. Unlike generic NER models, this model was trained with Weighted Cross-Entropy Loss to handle class imbalance, allowing for effective detection even in entities with few occurrences.
Key Features
- 🏛️ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
- 🛡️ Privacy-Focused NER: Identifies and classifies sensitive entities (PII) to support automatic anonymization processes.
- ⚙️ Transformer-based Architecture: It uses the power of XLM-RoBERTa to capture the grammatical and formal context of administrative documents.
Model Details
- Base Model: XLM-RoBERTa Base
- Architecture: Token Classification (NER) com Weighted Cross-Entropy Loss
- Parameters: ~270M
- Max Sequence Length: 512 tokens
- Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities)
- Evaluation Metrics: F1-Score, Recall and Precision
- Training Framework: PyTorch + Transformers + Seqeval
Entity Types
The model recognizes 19 entity types in BIO format (49 labels total):
| Entity Type | Description | Example |
|---|---|---|
PERSONAL-PUBLIC |
Public offices, government bodies, or official collective entities | Câmara Municipal, Executivo, Assembleia |
PERSONAL-ADMIN |
Administrative identifiers and case/process numbers | 5597/2023 |
PERSONAL-NAME |
Proper names of individuals | João Silva |
PERSONAL-POSITION |
Professional roles, political positions, or technical functions | Diretor do Departamento dos Recursos Humanos |
PERSONAL-ADDRESS |
Addresses, street names, and door/plot numbers | Rua das Flores n.º 10, Avenida Central |
PERSONAL-DATE |
Dates of events, decisions, or time periods | 20/05/2023 |
PERSONAL-LOCATION |
Cities, parishes, districts, or geographic locations | Freguesia do Porto |
PERSONAL-OTHER |
Generic personal information and miscellaneous contact data | Referências de contacto, dados diversos |
PERSONAL-INFO |
Biographical data or sensitive personal information | 11490753 |
PERSONAL-COMPANY |
Companies or private legal entities | Construções & Filho, Lda |
PERSONAL-TIME |
References to specific times | 14:30h |
PERSONAL-LICENSE |
License plates or registration numbers | 48-RF-99 |
PERSONAL-DEGREE |
Academic titles or professional degrees | Licenciatura de Psicologia |
PERSONAL-VEHICLE |
Vehicle identification and models | Mercedes-Benz Classe S |
PERSONAL-FAMILY |
Mentions of kinship, family relationships, or heirs | Marido |
PERSONAL-FACULTY |
Higher education institutions or university faculties | Faculdade de Economia da Universidade do Porto |
PERSONAL-ARTISTIC |
Nomes artísticos, pseudónimos | Pintura |
"PROFISSAO / TELEMOVEL"
How It Works
The model performs token-level classification, analyzing each word individually based on its linguistic context. Through this analysis, the system identifies patterns to detect sensitive information using the labels mentioned above and assigns specific labels that allow for the automatic anonymization of the data.
INPUT:
O interessado João Silva submeteu o processo administrativo 5597/2023 no dia 20/05/2023, relativo ao imóvel localizado na Rua das Flores n.º 10.
Output:
O interessado <NAME> submeteu o processo administrativo <ADMIN> no dia <DATE>, relativo ao imóvel localizado na <ADDRESS>.
Results
Entity-Level Performance (Test Set)
| Metric | Score |
|---|---|
| F1 Score | X% |
| Precision | X% |
| Recall | X% |
Per-Entity Performance
| Entity Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
PERSONAL-PUBLIC |
— | — | — | 12827 |
PERSONAL-ADMIN |
— | — | — | 1186 |
PERSONAL-NAME |
— | — | — | 1186 |
PERSONAL-POSITION |
— | — | — | 716 |
PERSONAL-ADDRESS |
— | — | — | 368 |
PERSONAL-DATE |
— | — | — | 249 |
PERSONAL-LOCATION |
— | — | — | 191 |
PERSONAL-OTHER |
— | — | — | 70 |
PERSONAL-INFO |
— | — | — | 43 |
PERSONAL-COMPANY |
— | — | — | 29 |
PERSONAL-TIME |
— | — | — | 22 |
PERSONAL-LICENSE |
— | — | — | 19 |
PERSONAL-DEGREE |
— | — | — | 18 |
PERSONAL-VEHICLE |
— | — | — | 14 |
PERSONAL-FAMILY |
— | — | — | 7 |
PERSONAL-FACULTY |
— | — | — | 6 |
PERSONAL-ARTISTIC |
— | — | — | 4 |
Usage
Quick Start
The simplest way to use the model:
from transformers import AutoTokenizer, AutoModel
model_name = "tiagomfmarques/xlmr-base-council-anonymizer"
Limitations
- Domain Specificity: Best performance on administrative/governmental meeting minutes
- Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
- Sequence length: Limited to 512 tokens per window
License
This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International
Version: 1.0
Last Updated: 2025-12-22
Model tree for tiagomfmarques/anonimizacao_teste
Base model
FacebookAI/xlm-roberta-large