XLM-RoBERTa-CRF-VotIE: Portuguese Voting Information Extraction
This model is a fine-tuned XLM-RoBERTa Base with a Conditional Random Fields (CRF) layer for extracting structured voting information from Portuguese municipal meeting minutes. It achieves state-of-the-art performance on the Citilink dataset.
Model Description
XLM-RoBERTa-CRF-VotIE combines the robust multilingual representations of Facebook AI's XLM-RoBERTa base model with a CRF layer for structured sequence prediction. The model performs token-level classification to identify and extract voting-related entities from Portuguese administrative text.
Key Features
- Architecture: XLM-RoBERTa Base (768-dim, 12 layers) + Linear + CRF
- Task: Sequence Labeling with BIO tagging
- Language: Portuguese (Portugal)
- Domain: Municipal meeting minutes and voting records
- Entity Types: 8 types (17 labels with BIO encoding)
- Performance: 93.22% entity-level F1 score
Intended Uses
This model is designed for:
- Extracting voting information from Portuguese municipal documents
- Identifying participants and their voting positions (favor, against, abstention, absent)
- Recognizing voting subjects and counting methods
- Structuring unstructured administrative text
- Research in information extraction from Portuguese administrative documents
Entity Types
The model recognizes 8 entity types in BIO format (17 labels total):
| Entity Type | Description | Example |
|---|---|---|
VOTER-FAVOR |
Participants who voted in favor | "The Municipal Executive" |
VOTER-AGAINST |
Participants who voted against | "João Silva" |
VOTER-ABSTENTION |
Participants who abstained | "The councilor from PS" |
VOTER-ABSENT |
Participants who were absent | "Ana Simões" |
VOTING |
Voting action expressions | "deliberado", "aprovado" |
SUBJECT |
The subject matter being voted on | "budget changes" |
COUNTING-UNANIMITY |
Unanimous vote indicators | "unanimously" |
COUNTING-MAJORITY |
Majority vote indicators | "by majority" |
Training Details
Training Data
The model was trained on the Citilink dataset (https://rdm.inesctec.pt/dataset/cs-2025-007), which consists of Portuguese municipal meeting minutes annotated with voting information:
- Training set: 1,737 examples
- Validation set: 433 examples
- Test set: 529 examples
- Total tokens: ~300K tokens
- Total entities: ~5K entities
Training Procedure
Hyperparameters:
- Base model:
FacebookAI/xlm-roberta-base - Batch size: 16
- Learning rate: 5e-5 (linear decay)
- Weight decay: 0.01
- Dropout: 0.1
- Max sequence length: 512 tokens
- Epochs: 10
- Optimizer: AdamW
Results
Entity-Level Performance (Test Set)
| Metric | Score |
|---|---|
| F1 Score | 93.22% |
| Precision | 90.99% |
| Recall | 95.62% |
| Accuracy | 98.88% |
Per-Entity Performance
| Entity Type | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| COUNTING-MAJORITY | 93.94% | 100.00% | 96.88% | 62 |
| COUNTING-UNANIMITY | 93.15% | 98.03% | 95.53% | 305 |
| SUBJECT | 74.11% | 83.81% | 78.66% | 420 |
| VOTER-ABSENT | 88.89% | 88.89% | 88.89% | 18 |
| VOTER-ABSTENTION | 95.68% | 100.00% | 97.79% | 133 |
| VOTER-AGAINST | 92.31% | 100.00% | 96.00% | 36 |
| VOTER-FAVOR | 93.83% | 96.33% | 95.07% | 300 |
| VOTING | 96.01% | 97.86% | 96.92% | 467 |
Relaxed Boundary Metrics
When allowing partial entity matches:
| Metric | Score |
|---|---|
| F1 Score | 95.91% |
| Precision | 95.17% |
| Recall | 96.81% |
Relaxed boundary evaluation shows +2.7% F1 improvement, indicating the model captures entity content well even when boundaries differ slightly.
Usage
Quick Start
The simplest way to use the model:
from transformers import AutoTokenizer, AutoModel
# Load model
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Analyze text
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)
# Print results
for pred in predictions:
print(f"{pred['word']:20} {pred['label']}")
Output:
O B-VOTER-FAVOR
Executivo I-VOTER-FAVOR
deliberou B-VOTING
aprovar O
o O
projeto O
por B-COUNTING-UNANIMITY
unanimidade. I-COUNTING-UNANIMITY
Extract Structured Entities
The model includes a convenient extract_entities method:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = """A Câmara Municipal deliberou aprovar a proposta apresentada pelo
Senhor Presidente. Votaram a favor os Senhores Vereadores João Silva e
Maria Costa. Votou contra o Senhor Vereador Pedro Santos."""
# Get structured entities with character offsets
entities = model.extract_entities(text, tokenizer)
# Print entities by type
for entity_type, mentions in entities.items():
print(f"\n{entity_type}:")
for mention in mentions:
print(f" - {mention['text']} [{mention['start']}:{mention['end']}]")
Output:
VOTER-FAVOR:
- A Câmara Municipal [0:19]
- João Silva [95:105]
- Maria Costa [108:119]
VOTING:
- deliberou [20:29]
VOTER-AGAINST:
- Pedro Santos [152:164]
Token-Level Predictions
For low-level token-based predictions:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = "O Executivo deliberou aprovar por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
# Get raw token-level predictions (list of label IDs)
predictions = model.decode(inputs["input_ids"], inputs["attention_mask"])
# Returns: [[0, 7, 15, 8, 0, 0, 2, 10, 0]]
# Convert to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred_id in zip(tokens, predictions[0]):
if token not in ['<s>', '</s>', '<pad>']:
print(f"{token:20} {id2label[pred_id]}")
With Character Offsets
Useful for highlighting entities in your UI:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions with character positions
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text, return_offsets=True)
# Show only entities (non-O tags)
for pred in predictions:
if pred['label'] != 'O':
print(f"{pred['word']:20} {pred['label']:25} [{pred['start']}:{pred['end']}]")
Output:
O B-VOTER-FAVOR [0:1]
Executivo I-VOTER-FAVOR [2:11]
deliberou B-VOTING [12:21]
por B-COUNTING-UNANIMITY [39:42]
unanimidade. I-COUNTING-UNANIMITY [43:55]
Batch Processing
For processing multiple documents:
from transformers import AutoTokenizer, AutoModel
model_name = "Anonymous3445/XLM-RoBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
texts = [
"A proposta foi aprovada por unanimidade.",
"Votou contra o Vereador João Silva.",
"O Presidente estava ausente na votação."
]
for text in texts:
entities = model.extract_entities(text, tokenizer, return_offsets=False)
print(f"\nText: {text}")
for entity_type, mentions in entities.items():
print(f" {entity_type}: {[m['text'] for m in mentions]}")
Limitations and Bias
Limitations
- Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
- Portuguese only: Optimized for European Portuguese
- Sequence length: Limited to 512 tokens per window (handles longer documents via windowing)
- Entity types: Limited to 8 predefined voting-related entity types
- Complex sentences: May struggle with highly complex or nested voting descriptions
Requirements
Install the required dependencies:
pip install torch>=2.0.0 transformers>=4.30.0 pytorch-crf>=0.7.2
Model Card Authors
- Anonymous Authors (for blind review)
Model Card Contact
For questions or issues, please open an issue in the GitHub repository.
Additional Resources
- GitHub Repository: https://github.com/Anonymous3445/VotIE
- Dataset: Citilink Dataset
- Paper: [Coming soon]
- Demo: VotIE Demo
License
This model is released under the Creative Commons Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0) license.
- ✅ You can: Use the model for research and commercial purposes with attribution
- ❌ You cannot: Create derivative works or modified versions
- 📝 You must: Provide attribution to the original authors
See LICENSE for full details.
Acknowledgments
This work builds upon:
- XLM-RoBERTa: Facebook AI's XLM-RoBERTa multilingual base model
- pytorch-crf: CRF implementation
- Transformers: Hugging Face Transformers library
Version: 1.0
Last Updated: 2026-01-08
- Downloads last month
- 57
Model tree for Anonymous3445/XLM-RoBERTa-CRF-VotIE
Base model
FacebookAI/xlm-roberta-baseSpaces using Anonymous3445/XLM-RoBERTa-CRF-VotIE 2
Evaluation results
- F1 Score on Citilink-Minutesself-reported0.932
- Precision on Citilink-Minutesself-reported0.910
- Recall on Citilink-Minutesself-reported0.956