XLMR-Base-Council-Anonymizer: Personal Data Identification for Portuguese Municipal Meeting Minutes
This model consists of a fine-tuned XLM-RoBERTa Base for the extraction and identification of sensitive personal data in minutes of Portuguese municipal meetings.
Model Description
The XLMR-BCA the multilingual contextual representations of FacebookAI's XLM-RoBERTa, specifically optimized for the linguistic and formal structure of administrative minutes in Portugal. Unlike generic NER models, this model was trained with Weighted Cross-Entropy Loss to handle class imbalance, allowing for effective detection even in entities with few occurrences.
Key Features
- ποΈ Specialized for Municipal Minutes: Fine-tuned on authentic Portuguese council meeting minutes
- π‘οΈ Privacy-Focused NER: Identifies and classifies sensitive entities (PII) to support automatic anonymization processes.
- βοΈ Transformer-based Architecture: It uses the power of XLM-RoBERTa to capture the grammatical and formal context of administrative documents.
Model Details
- Base Model: XLM-RoBERTa Base
- Architecture: Token Classification (NER) com Weighted Cross-Entropy Loss
- Parameters: ~270M
- Max Sequence Length: 512 tokens
- Fine-tuning Dataset: 120 Portuguese meeting minutes (6 municipalities)
- Evaluation Metrics: F1-Score, Recall and Precision
- Training Framework: PyTorch + Transformers + Seqeval
Entity Types
The model recognizes 19 entity types in BIO format (49 labels total):
| Entity Type | Description | Example |
|---|---|---|
PERSONAL-NAME |
Proper names of individuals | JoΓ£o Silva |
PERSONAL-ADMIN |
Administrative identifiers and case/process numbers | 5597/2023 |
PERSONAL-POSITION |
Professional roles, political positions, or technical functions | Diretor do Departamento dos Recursos Humanos |
PERSONAL-ADDRESS |
Addresses, street names, and door/plot numbers | Rua das Flores n.ΒΊ 10, Avenida Central |
PERSONAL-DATE |
Dates of events, decisions, or time periods | 20/05/2023 |
PERSONAL-LOCATION |
Cities, parishes, districts, or geographic locations | Freguesia do Porto |
PERSONAL-OTHER |
Generic personal information and miscellaneous contact data | ReferΓͺncias de contacto, dados diversos |
PERSONAL-INFO |
Biographical data or sensitive personal information | 11490753 |
PERSONAL-COMPANY |
Companies or private legal entities | ConstruΓ§Γ΅es & Filho, Lda |
PERSONAL-ARTISTIC |
Nomes artΓsticos, pseudΓ³nimos | Pintura |
PERSONAL-DEGREE |
Academic titles or professional degrees | Licenciatura de Psicologia |
PERSONAL-TIME |
References to specific times | 14:30h |
PERSONAL-LICENSE |
License plates or registration numbers | 48-RF-99 |
PERSONAL-JOB |
Personβs profession or occupation. | Professor |
PERSONAL-VEHICLE |
Vehicle identification and models | Mercedes-Benz Classe S |
PERSONAL-FACULTY |
Higher education institutions or university faculties | Faculdade de Economia da Universidade do Porto |
PERSONAL-FAMILY |
Mentions of kinship, family relationships, or heirs | Marido |
How It Works
The model performs token-level classification, analyzing each word individually based on its linguistic context. Through this analysis, the system identifies patterns to detect sensitive information using the labels mentioned above and assigns specific labels that allow for the automatic anonymization of the data.
INPUT:
O interessado JoΓ£o Silva submeteu o processo administrativo 5597/2023 no dia 20/05/2023, relativo ao imΓ³vel localizado na Rua das Flores n.ΒΊ 10.
Output:
O interessado <NAME> submeteu o processo administrativo <ADMIN> no dia <DATE>, relativo ao imΓ³vel localizado na <ADDRESS>.
Results
Entity-Level Performance (Test Set)
| Metric | Score |
|---|---|
| F1 Score | X% |
| Precision | X% |
| Recall | X% |
Per-Entity Performance
| Entity Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
PERSONAL-NAME |
β | β | β | 1186 |
PERSONAL-ADMIN |
β | β | β | 1186 |
PERSONAL-POSITION |
β | β | β | 716 |
PERSONAL-ADDRESS |
β | β | β | 368 |
PERSONAL-DATE |
β | β | β | 249 |
PERSONAL-LOCATION |
β | β | β | 191 |
PERSONAL-OTHER |
β | β | β | 70 |
PERSONAL-INFO |
β | β | β | 43 |
PERSONAL-COMPANY |
β | β | β | 29 |
PERSONAL-TIME |
β | β | β | 22 |
PERSONAL-LICENSE |
β | β | β | 19 |
PERSONAL-DEGREE |
β | β | β | 18 |
PERSONAL-VEHICLE |
β | β | β | 14 |
PERSONAL-FAMILY |
β | β | β | 7 |
PERSONAL-FACULTY |
β | β | β | 6 |
PERSONAL-ARTISTIC |
β | β | β | 4 |
Usage
Quick Start
The simplest way to use the model:
from transformers import pipeline
model_name = "anonymous270126/XLMR-anonymization-council-pt"
nlp = pipeline("ner", model=model_name, tokenizer=model_name, aggregation_strategy="simple")
text = "A reuniΓ£o foi presidida por Manuel Brito no concelho de Alandroal."
results = nlp(text)
for entity in results:
print(f"Entidade: {entity['word']} | Categoria: {entity['entity_group']} | Score: {entity['score']:.4f}")
Limitations
- Domain Specificity: Best performance on administrative/governmental meeting minutes
- Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
- Sequence length: Limited to 512 tokens per window
Version: 1.0
Last Updated: 2026-01-27
license: cc-by-nc-nd-4.0
- Downloads last month
- 18
Model tree for inesctec/Citilink-XLMR-Anonymization-pt
Base model
FacebookAI/xlm-roberta-large