jmisidro's picture
Update README.md
9a32b1a verified
---
language:
- pt
license: other
license_name: custom-agplv3-dual-license
license_link: https://huggingface.co/inesctec/Citilink-BERTimbau-large-Topic-Classification-pt/blob/main/LICENSE
tags:
- transformers
- text-classification
- multi-label-classification
- bertimbau
- bert
- portuguese
- municipal-documents
- meeting-minutes
- fine-tuned
library_name: transformers
base_model:
- neuralmind/bert-large-portuguese-cased
pipeline_tag: text-classification
---
# Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects
## Model Description
**BERTimbau Large Topic Classification Council PT** is a baseline implementation, consisting of a fine-tuned version of [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects.
## Key Features
- 🎯 **Specialized for Municipal Topics**: Fine-tuned on Portuguese council meeting minutes discussion subjects
- 🧠 **Large Transformer Model**: 335M parameters (24 layers, 1024 hidden dim, 16 attention heads)
- 📊 **Multi-Label Classification**: Identifies multiple co-occurring topics per subject
-**Dynamic Thresholds**: Optimized per-label classification thresholds (not fixed 0.5)
- 🇵🇹 **Portuguese-Native**: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus
- 🔄 **End-to-End Learning**: Direct fine-tuning on task-specific data
## Model Details
- **Architecture**: BERT Large (Transformer Encoder)
- **Base Model**: neuralmind/bert-large-portuguese-cased (BERTimbau Large)
- **Parameters**: ~335M
- 24 transformer layers
- 1024 hidden dimensions
- 16 attention heads
- 4096 intermediate size
- **Max Sequence Length**: 512 tokens
- **Learning Rate**: 5e-5
- **Warmup**: 0.1
- **Batch Size**: 16
- **Optimizer**: AdamW
- **Weight Decay**: 0.01
- **Classification Head**: Linear layer (1024 → 22 labels)
- **Loss Function**: BCEWithLogitsLoss (multi-label)
- **Optimization**: Dynamic per-label thresholds (F1-maximization)
- **Framework**: PyTorch + Hugging Face Transformers
## How It Works
The model processes Portuguese municipal texts through a fine-tuned transformer architecture:
1. **Text Preprocessing**
- Basic normalization (lowercasing, whitespace cleanup)
- Minimum text length filtering (>10 chars)
- Punctuation removal for noise reduction
2. **Tokenization**
- WordPiece tokenization (BERTimbau vocabulary)
- Max length truncation at 512 tokens
- Padding to max length for batching
3. **Transformer Encoding**
- 24-layer BERT encoder processes token sequences
- Self-attention captures contextual dependencies
- [CLS] token representation used for classification
4. **Multi-Label Prediction**
- Linear classification head outputs 22 logits
- Sigmoid activation for independent probabilities
- Dynamic per-label thresholds for final predictions
5. **Threshold Optimization**
- Each label has optimal threshold (0.10-0.90 range)
- Optimized via F1-score grid search on validation set
- Handles class imbalance better than fixed 0.5 threshold
## Usage
```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from joblib import load
# Load model and tokenizer
model_path = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Load optimal thresholds and label encoder
thresholds = np.load("optimal_thresholds.npy")
mlb = load("mlb_encoder.joblib")
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
# Classify text
text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação."
# Tokenize
inputs = tokenizer(
text,
return_tensors="pt",
max_length=256,
truncation=True,
padding="max_length"
).to(device)
# Predict
with torch.no_grad():
outputs = model(**inputs)
probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]
# Apply optimized thresholds
predictions = (probs >= thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0]
# Display results
print(f"Text: {text}")
print(f"\nPredicted Topics:")
for label in predicted_labels:
idx = list(mlb.classes_).index(label)
prob = probs[idx]
thresh = thresholds[idx]
print(f" - {label}: {prob:.4f} (threshold: {thresh:.2f})")
```
## Categories
The model classifies topics into 22 Portuguese administrative categories:
| Category | Portuguese Name |
|----------|-----------------|
| General Administration | Administração Geral, Finanças e Recursos Humanos |
| Environment | Ambiente |
| Economic Activities | Atividades Económicas |
| Social Action | Ação Social |
| Science | Ciência |
| Communication | Comunicação e Relações Públicas |
| External Cooperation | Cooperação Externa e Relações Internacionais |
| Culture | Cultura |
| Sports | Desporto |
| Education | Educação e Formação Profissional |
| Energy & Telecommunications | Energia e Telecomunicações |
| Housing | Habitação |
| Private Construction | Obras Particulares |
| Public Works | Obras Públicas |
| Territorial Planning | Ordenamento do Território |
| Other | Outros |
| Heritage | Património |
| Municipal Police | Polícia Municipal |
| Animal Protection | Proteção Animal |
| Civil Protection | Proteção Civil |
| Health | Saúde |
| Traffic & Transport | Trânsito, Transportes e Comunicações |
## Evaluation Results
### Comprehensive Performance Metrics
| Metric | Score | Description |
|--------|-------|-------------|
| **F1-macro** | **0.6419** | Macro-averaged F1 score |
| **F1-micro** | **0.8224** | Micro-averaged F1 score |
| **Accuracy** | **0.5709** | Subset accuracy (exact match) |
| **Hamming Loss** | **0.0286** | Label-wise error rate |
| **Average Precision (macro)** | **0.681** | Macro-averaged AP |
## License
This project uses a custom dual-license based on AGPL v3.
See the full license terms here: [LICENSE](./LICENSE)