Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects

Model Description

BERTimbau Large Topic Classification Council PT is a baseline implementation, consisting of a fine-tuned version of neuralmind/bert-large-portuguese-cased (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects.

Key Features

  • 🎯 Specialized for Municipal Topics: Fine-tuned on Portuguese council meeting minutes discussion subjects
  • 🧠 Large Transformer Model: 335M parameters (24 layers, 1024 hidden dim, 16 attention heads)
  • 📊 Multi-Label Classification: Identifies multiple co-occurring topics per subject
  • Dynamic Thresholds: Optimized per-label classification thresholds (not fixed 0.5)
  • 🇵🇹 Portuguese-Native: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus
  • 🔄 End-to-End Learning: Direct fine-tuning on task-specific data

Model Details

  • Architecture: BERT Large (Transformer Encoder)
  • Base Model: neuralmind/bert-large-portuguese-cased (BERTimbau Large)
  • Parameters: ~335M
    • 24 transformer layers
    • 1024 hidden dimensions
    • 16 attention heads
    • 4096 intermediate size
  • Max Sequence Length: 512 tokens
  • Learning Rate: 5e-5
  • Warmup: 0.1
  • Batch Size: 16
  • Optimizer: AdamW
  • Weight Decay: 0.01
  • Classification Head: Linear layer (1024 → 22 labels)
  • Loss Function: BCEWithLogitsLoss (multi-label)
  • Optimization: Dynamic per-label thresholds (F1-maximization)
  • Framework: PyTorch + Hugging Face Transformers

How It Works

The model processes Portuguese municipal texts through a fine-tuned transformer architecture:

  1. Text Preprocessing

    • Basic normalization (lowercasing, whitespace cleanup)
    • Minimum text length filtering (>10 chars)
    • Punctuation removal for noise reduction
  2. Tokenization

    • WordPiece tokenization (BERTimbau vocabulary)
    • Max length truncation at 512 tokens
    • Padding to max length for batching
  3. Transformer Encoding

    • 24-layer BERT encoder processes token sequences
    • Self-attention captures contextual dependencies
    • [CLS] token representation used for classification
  4. Multi-Label Prediction

    • Linear classification head outputs 22 logits
    • Sigmoid activation for independent probabilities
    • Dynamic per-label thresholds for final predictions
  5. Threshold Optimization

    • Each label has optimal threshold (0.10-0.90 range)
    • Optimized via F1-score grid search on validation set
    • Handles class imbalance better than fixed 0.5 threshold

Usage

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from joblib import load

# Load model and tokenizer
model_path = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Load optimal thresholds and label encoder
thresholds = np.load("optimal_thresholds.npy")
mlb = load("mlb_encoder.joblib")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

# Classify text
text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação."

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=256,
    truncation=True,
    padding="max_length"
).to(device)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]

# Apply optimized thresholds
predictions = (probs >= thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0]

# Display results
print(f"Text: {text}")
print(f"\nPredicted Topics:")
for label in predicted_labels:
    idx = list(mlb.classes_).index(label)
    prob = probs[idx]
    thresh = thresholds[idx]
    print(f"  - {label}: {prob:.4f} (threshold: {thresh:.2f})")

Categories

The model classifies topics into 22 Portuguese administrative categories:

Category Portuguese Name
General Administration Administração Geral, Finanças e Recursos Humanos
Environment Ambiente
Economic Activities Atividades Económicas
Social Action Ação Social
Science Ciência
Communication Comunicação e Relações Públicas
External Cooperation Cooperação Externa e Relações Internacionais
Culture Cultura
Sports Desporto
Education Educação e Formação Profissional
Energy & Telecommunications Energia e Telecomunicações
Housing Habitação
Private Construction Obras Particulares
Public Works Obras Públicas
Territorial Planning Ordenamento do Território
Other Outros
Heritage Património
Municipal Police Polícia Municipal
Animal Protection Proteção Animal
Civil Protection Proteção Civil
Health Saúde
Traffic & Transport Trânsito, Transportes e Comunicações

Evaluation Results

Comprehensive Performance Metrics

Metric Score Description
F1-macro 0.6419 Macro-averaged F1 score
F1-micro 0.8224 Micro-averaged F1 score
Accuracy 0.5709 Subset accuracy (exact match)
Hamming Loss 0.0286 Label-wise error rate
Average Precision (macro) 0.681 Macro-averaged AP

License

This model is released under the cc-by-nc-nd-4.0 license.

Downloads last month
28
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liaad/Citilink-BERTimbau-large-Topic-Classification-pt-baseline

Finetuned
(60)
this model

Collection including liaad/Citilink-BERTimbau-large-Topic-Classification-pt-baseline