---
language:
- pt
license: other
license_name: custom-agplv3-dual-license
license_link: https://huggingface.co/inesctec/Citilink-BERTimbau-large-Topic-Classification-pt/blob/main/LICENSE
tags:
- transformers
- text-classification
- multi-label-classification
- bertimbau
- bert
- portuguese
- municipal-documents
- meeting-minutes
- fine-tuned
library_name: transformers
base_model:
- neuralmind/bert-large-portuguese-cased
pipeline_tag: text-classification
---

# Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects

## Model Description

**BERTimbau Large Topic Classification Council PT** is a baseline implementation, consisting of a fine-tuned version of [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects.

## Key Features

- 🎯 **Specialized for Municipal Topics**: Fine-tuned on Portuguese council meeting minutes discussion subjects
- 🧠 **Large Transformer Model**: 335M parameters (24 layers, 1024 hidden dim, 16 attention heads)
- 📊 **Multi-Label Classification**: Identifies multiple co-occurring topics per subject
- ⚡ **Dynamic Thresholds**: Optimized per-label classification thresholds (not fixed 0.5)
- 🇵🇹 **Portuguese-Native**: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus
- 🔄 **End-to-End Learning**: Direct fine-tuning on task-specific data

## Model Details

- **Architecture**: BERT Large (Transformer Encoder)
- **Base Model**: neuralmind/bert-large-portuguese-cased (BERTimbau Large)
- **Parameters**: ~335M
  - 24 transformer layers
  - 1024 hidden dimensions
  - 16 attention heads
  - 4096 intermediate size
- **Max Sequence Length**: 512 tokens
- **Learning Rate**: 5e-5
- **Warmup**: 0.1
- **Batch Size**: 16
- **Optimizer**: AdamW
- **Weight Decay**: 0.01
- **Classification Head**: Linear layer (1024 → 22 labels)
- **Loss Function**: BCEWithLogitsLoss (multi-label)
- **Optimization**: Dynamic per-label thresholds (F1-maximization)
- **Framework**: PyTorch + Hugging Face Transformers

## How It Works

The model processes Portuguese municipal texts through a fine-tuned transformer architecture:

1. **Text Preprocessing**
   - Basic normalization (lowercasing, whitespace cleanup)
   - Minimum text length filtering (>10 chars)
   - Punctuation removal for noise reduction

2. **Tokenization**
   - WordPiece tokenization (BERTimbau vocabulary)
   - Max length truncation at 512 tokens
   - Padding to max length for batching

3. **Transformer Encoding**
   - 24-layer BERT encoder processes token sequences
   - Self-attention captures contextual dependencies
   - [CLS] token representation used for classification

4. **Multi-Label Prediction**
   - Linear classification head outputs 22 logits
   - Sigmoid activation for independent probabilities
   - Dynamic per-label thresholds for final predictions

5. **Threshold Optimization**
   - Each label has optimal threshold (0.10-0.90 range)
   - Optimized via F1-score grid search on validation set
   - Handles class imbalance better than fixed 0.5 threshold

## Usage

```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from joblib import load

# Load model and tokenizer
model_path = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Load optimal thresholds and label encoder
thresholds = np.load("optimal_thresholds.npy")
mlb = load("mlb_encoder.joblib")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Classify text
text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação."

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=256,
    truncation=True,
    padding="max_length"
).to(device)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]

# Apply optimized thresholds
predictions = (probs >= thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0]

# Display results
print(f"Text: {text}")
print(f"\nPredicted Topics:")
for label in predicted_labels:
    idx = list(mlb.classes_).index(label)
    prob = probs[idx]
    thresh = thresholds[idx]
    print(f"  - {label}: {prob:.4f} (threshold: {thresh:.2f})")
```


## Categories

The model classifies topics into 22 Portuguese administrative categories:

| Category | Portuguese Name |
|----------|-----------------|
| General Administration | Administração Geral, Finanças e Recursos Humanos |
| Environment | Ambiente |
| Economic Activities | Atividades Económicas |
| Social Action | Ação Social |
| Science | Ciência |
| Communication | Comunicação e Relações Públicas |
| External Cooperation | Cooperação Externa e Relações Internacionais |
| Culture | Cultura |
| Sports | Desporto |
| Education | Educação e Formação Profissional |
| Energy & Telecommunications | Energia e Telecomunicações |
| Housing | Habitação |
| Private Construction | Obras Particulares |
| Public Works | Obras Públicas |
| Territorial Planning | Ordenamento do Território |
| Other | Outros |
| Heritage | Património |
| Municipal Police | Polícia Municipal |
| Animal Protection | Proteção Animal |
| Civil Protection | Proteção Civil |
| Health | Saúde |
| Traffic & Transport | Trânsito, Transportes e Comunicações |

## Evaluation Results

### Comprehensive Performance Metrics

| Metric | Score | Description |
|--------|-------|-------------|
| **F1-macro** | **0.6419** | Macro-averaged F1 score |
| **F1-micro** | **0.8224** | Micro-averaged F1 score |
| **Accuracy** | **0.5709** | Subset accuracy (exact match) |
| **Hamming Loss** | **0.0286** | Label-wise error rate |
| **Average Precision (macro)** | **0.681** | Macro-averaged AP |


## License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: [LICENSE](./LICENSE)