| --- |
| language: |
| - pt |
| license: other |
| license_name: custom-agplv3-dual-license |
| license_link: https://huggingface.co/inesctec/Citilink-BERTimbau-large-Topic-Classification-pt/blob/main/LICENSE |
| tags: |
| - transformers |
| - text-classification |
| - multi-label-classification |
| - bertimbau |
| - bert |
| - portuguese |
| - municipal-documents |
| - meeting-minutes |
| - fine-tuned |
| library_name: transformers |
| base_model: |
| - neuralmind/bert-large-portuguese-cased |
| pipeline_tag: text-classification |
| --- |
| |
| # Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects |
|
|
| ## Model Description |
|
|
| **BERTimbau Large Topic Classification Council PT** is a baseline implementation, consisting of a fine-tuned version of [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects. |
|
|
| ## Key Features |
|
|
| - 🎯 **Specialized for Municipal Topics**: Fine-tuned on Portuguese council meeting minutes discussion subjects |
| - 🧠 **Large Transformer Model**: 335M parameters (24 layers, 1024 hidden dim, 16 attention heads) |
| - 📊 **Multi-Label Classification**: Identifies multiple co-occurring topics per subject |
| - ⚡ **Dynamic Thresholds**: Optimized per-label classification thresholds (not fixed 0.5) |
| - 🇵🇹 **Portuguese-Native**: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus |
| - 🔄 **End-to-End Learning**: Direct fine-tuning on task-specific data |
|
|
| ## Model Details |
|
|
| - **Architecture**: BERT Large (Transformer Encoder) |
| - **Base Model**: neuralmind/bert-large-portuguese-cased (BERTimbau Large) |
| - **Parameters**: ~335M |
| - 24 transformer layers |
| - 1024 hidden dimensions |
| - 16 attention heads |
| - 4096 intermediate size |
| - **Max Sequence Length**: 512 tokens |
| - **Learning Rate**: 5e-5 |
| - **Warmup**: 0.1 |
| - **Batch Size**: 16 |
| - **Optimizer**: AdamW |
| - **Weight Decay**: 0.01 |
| - **Classification Head**: Linear layer (1024 → 22 labels) |
| - **Loss Function**: BCEWithLogitsLoss (multi-label) |
| - **Optimization**: Dynamic per-label thresholds (F1-maximization) |
| - **Framework**: PyTorch + Hugging Face Transformers |
|
|
| ## How It Works |
|
|
| The model processes Portuguese municipal texts through a fine-tuned transformer architecture: |
|
|
| 1. **Text Preprocessing** |
| - Basic normalization (lowercasing, whitespace cleanup) |
| - Minimum text length filtering (>10 chars) |
| - Punctuation removal for noise reduction |
|
|
| 2. **Tokenization** |
| - WordPiece tokenization (BERTimbau vocabulary) |
| - Max length truncation at 512 tokens |
| - Padding to max length for batching |
|
|
| 3. **Transformer Encoding** |
| - 24-layer BERT encoder processes token sequences |
| - Self-attention captures contextual dependencies |
| - [CLS] token representation used for classification |
|
|
| 4. **Multi-Label Prediction** |
| - Linear classification head outputs 22 logits |
| - Sigmoid activation for independent probabilities |
| - Dynamic per-label thresholds for final predictions |
|
|
| 5. **Threshold Optimization** |
| - Each label has optimal threshold (0.10-0.90 range) |
| - Optimized via F1-score grid search on validation set |
| - Handles class imbalance better than fixed 0.5 threshold |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| import numpy as np |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| from joblib import load |
| |
| # Load model and tokenizer |
| model_path = "path/to/model" |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| model = AutoModelForSequenceClassification.from_pretrained(model_path) |
| |
| # Load optimal thresholds and label encoder |
| thresholds = np.load("optimal_thresholds.npy") |
| mlb = load("mlb_encoder.joblib") |
| |
| # Set device |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = model.to(device) |
| model.eval() |
| |
| # Classify text |
| text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação." |
| |
| # Tokenize |
| inputs = tokenizer( |
| text, |
| return_tensors="pt", |
| max_length=256, |
| truncation=True, |
| padding="max_length" |
| ).to(device) |
| |
| # Predict |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| probs = torch.sigmoid(outputs.logits).cpu().numpy()[0] |
| |
| # Apply optimized thresholds |
| predictions = (probs >= thresholds).astype(int) |
| predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0] |
| |
| # Display results |
| print(f"Text: {text}") |
| print(f"\nPredicted Topics:") |
| for label in predicted_labels: |
| idx = list(mlb.classes_).index(label) |
| prob = probs[idx] |
| thresh = thresholds[idx] |
| print(f" - {label}: {prob:.4f} (threshold: {thresh:.2f})") |
| ``` |
|
|
|
|
| ## Categories |
|
|
| The model classifies topics into 22 Portuguese administrative categories: |
|
|
| | Category | Portuguese Name | |
| |----------|-----------------| |
| | General Administration | Administração Geral, Finanças e Recursos Humanos | |
| | Environment | Ambiente | |
| | Economic Activities | Atividades Económicas | |
| | Social Action | Ação Social | |
| | Science | Ciência | |
| | Communication | Comunicação e Relações Públicas | |
| | External Cooperation | Cooperação Externa e Relações Internacionais | |
| | Culture | Cultura | |
| | Sports | Desporto | |
| | Education | Educação e Formação Profissional | |
| | Energy & Telecommunications | Energia e Telecomunicações | |
| | Housing | Habitação | |
| | Private Construction | Obras Particulares | |
| | Public Works | Obras Públicas | |
| | Territorial Planning | Ordenamento do Território | |
| | Other | Outros | |
| | Heritage | Património | |
| | Municipal Police | Polícia Municipal | |
| | Animal Protection | Proteção Animal | |
| | Civil Protection | Proteção Civil | |
| | Health | Saúde | |
| | Traffic & Transport | Trânsito, Transportes e Comunicações | |
|
|
| ## Evaluation Results |
|
|
| ### Comprehensive Performance Metrics |
|
|
| | Metric | Score | Description | |
| |--------|-------|-------------| |
| | **F1-macro** | **0.6419** | Macro-averaged F1 score | |
| | **F1-micro** | **0.8224** | Micro-averaged F1 score | |
| | **Accuracy** | **0.5709** | Subset accuracy (exact match) | |
| | **Hamming Loss** | **0.0286** | Label-wise error rate | |
| | **Average Precision (macro)** | **0.681** | Macro-averaged AP | |
|
|
|
|
| ## License |
|
|
| This project uses a custom dual-license based on AGPL v3. |
|
|
| See the full license terms here: [LICENSE](./LICENSE) |