Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects

Model Description

BERTimbau Large Topic Classification Council PT is a baseline implementation, consisting of a fine-tuned version of neuralmind/bert-large-portuguese-cased (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects.

Key Features

🎯 Specialized for Municipal Topics: Fine-tuned on Portuguese council meeting minutes discussion subjects
🧠 Large Transformer Model: 335M parameters (24 layers, 1024 hidden dim, 16 attention heads)
📊 Multi-Label Classification: Identifies multiple co-occurring topics per subject
⚡ Dynamic Thresholds: Optimized per-label classification thresholds (not fixed 0.5)
🇵🇹 Portuguese-Native: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus
🔄 End-to-End Learning: Direct fine-tuning on task-specific data

Model Details

Architecture: BERT Large (Transformer Encoder)
Base Model: neuralmind/bert-large-portuguese-cased (BERTimbau Large)
Parameters: ~335M
- 24 transformer layers
- 1024 hidden dimensions
- 16 attention heads
- 4096 intermediate size
Max Sequence Length: 512 tokens
Learning Rate: 5e-5
Warmup: 0.1
Batch Size: 16
Optimizer: AdamW
Weight Decay: 0.01
Classification Head: Linear layer (1024 → 22 labels)
Loss Function: BCEWithLogitsLoss (multi-label)
Optimization: Dynamic per-label thresholds (F1-maximization)
Framework: PyTorch + Hugging Face Transformers

How It Works

The model processes Portuguese municipal texts through a fine-tuned transformer architecture:

Text Preprocessing
- Basic normalization (lowercasing, whitespace cleanup)
- Minimum text length filtering (>10 chars)
- Punctuation removal for noise reduction
Tokenization
- WordPiece tokenization (BERTimbau vocabulary)
- Max length truncation at 512 tokens
- Padding to max length for batching
Transformer Encoding
- 24-layer BERT encoder processes token sequences
- Self-attention captures contextual dependencies
- [CLS] token representation used for classification
Multi-Label Prediction
- Linear classification head outputs 22 logits
- Sigmoid activation for independent probabilities
- Dynamic per-label thresholds for final predictions
Threshold Optimization
- Each label has optimal threshold (0.10-0.90 range)
- Optimized via F1-score grid search on validation set
- Handles class imbalance better than fixed 0.5 threshold

Usage

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from joblib import load

# Load model and tokenizer
model_path = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Load optimal thresholds and label encoder
thresholds = np.load("optimal_thresholds.npy")
mlb = load("mlb_encoder.joblib")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Classify text
text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação."

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=256,
    truncation=True,
    padding="max_length"
).to(device)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]

# Apply optimized thresholds
predictions = (probs >= thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0]

# Display results
print(f"Text: {text}")
print(f"\nPredicted Topics:")
for label in predicted_labels:
    idx = list(mlb.classes_).index(label)
    prob = probs[idx]
    thresh = thresholds[idx]
    print(f"  - {label}: {prob:.4f} (threshold: {thresh:.2f})")

Category	Portuguese Name
General Administration	Administração Geral, Finanças e Recursos Humanos
Environment	Ambiente
Economic Activities	Atividades Económicas
Social Action	Ação Social
Science	Ciência
Communication	Comunicação e Relações Públicas
External Cooperation	Cooperação Externa e Relações Internacionais
Culture	Cultura
Sports	Desporto
Education	Educação e Formação Profissional
Energy & Telecommunications	Energia e Telecomunicações
Housing	Habitação
Private Construction	Obras Particulares
Public Works	Obras Públicas
Territorial Planning	Ordenamento do Território
Other	Outros
Heritage	Património
Municipal Police	Polícia Municipal
Animal Protection	Proteção Animal
Civil Protection	Proteção Civil
Health	Saúde
Traffic & Transport	Trânsito, Transportes e Comunicações

Evaluation Results

Comprehensive Performance Metrics

Metric	Score	Description
F1-macro	0.6419	Macro-averaged F1 score
F1-micro	0.8224	Micro-averaged F1 score
Accuracy	0.5709	Subset accuracy (exact match)
Hamming Loss	0.0286	Label-wise error rate
Average Precision (macro)	0.681	Macro-averaged AP

License

This model is released under the cc-by-nc-nd-4.0 license.

Downloads last month: 21

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for liaad/Citilink-BERTimbau-large-Topic-Classification-pt-baseline

Base model

neuralmind/bert-large-portuguese-cased

Finetuned

(59)

this model

Collection including liaad/Citilink-BERTimbau-large-Topic-Classification-pt-baseline

Citilink

Collection

Citilink aims to create AI models to facilitate the understanding of city council meetings • 20 items • Updated 4 days ago

liaad
/

Citilink-BERTimbau-large-Topic-Classification-pt-baseline