--- language: - pt license: other license_name: custom-agplv3-dual-license license_link: https://huggingface.co/inesctec/Citilink-BERTimbau-large-Topic-Classification-pt/blob/main/LICENSE tags: - transformers - text-classification - multi-label-classification - bertimbau - bert - portuguese - municipal-documents - meeting-minutes - fine-tuned library_name: transformers base_model: - neuralmind/bert-large-portuguese-cased pipeline_tag: text-classification --- # Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects ## Model Description **BERTimbau Large Topic Classification Council PT** is a baseline implementation, consisting of a fine-tuned version of [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects. ## Key Features - 🎯 **Specialized for Municipal Topics**: Fine-tuned on Portuguese council meeting minutes discussion subjects - 🧠 **Large Transformer Model**: 335M parameters (24 layers, 1024 hidden dim, 16 attention heads) - 📊 **Multi-Label Classification**: Identifies multiple co-occurring topics per subject - ⚡ **Dynamic Thresholds**: Optimized per-label classification thresholds (not fixed 0.5) - 🇵🇹 **Portuguese-Native**: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus - 🔄 **End-to-End Learning**: Direct fine-tuning on task-specific data ## Model Details - **Architecture**: BERT Large (Transformer Encoder) - **Base Model**: neuralmind/bert-large-portuguese-cased (BERTimbau Large) - **Parameters**: ~335M - 24 transformer layers - 1024 hidden dimensions - 16 attention heads - 4096 intermediate size - **Max Sequence Length**: 512 tokens - **Learning Rate**: 5e-5 - **Warmup**: 0.1 - **Batch Size**: 16 - **Optimizer**: AdamW - **Weight Decay**: 0.01 - **Classification Head**: Linear layer (1024 → 22 labels) - **Loss Function**: BCEWithLogitsLoss (multi-label) - **Optimization**: Dynamic per-label thresholds (F1-maximization) - **Framework**: PyTorch + Hugging Face Transformers ## How It Works The model processes Portuguese municipal texts through a fine-tuned transformer architecture: 1. **Text Preprocessing** - Basic normalization (lowercasing, whitespace cleanup) - Minimum text length filtering (>10 chars) - Punctuation removal for noise reduction 2. **Tokenization** - WordPiece tokenization (BERTimbau vocabulary) - Max length truncation at 512 tokens - Padding to max length for batching 3. **Transformer Encoding** - 24-layer BERT encoder processes token sequences - Self-attention captures contextual dependencies - [CLS] token representation used for classification 4. **Multi-Label Prediction** - Linear classification head outputs 22 logits - Sigmoid activation for independent probabilities - Dynamic per-label thresholds for final predictions 5. **Threshold Optimization** - Each label has optimal threshold (0.10-0.90 range) - Optimized via F1-score grid search on validation set - Handles class imbalance better than fixed 0.5 threshold ## Usage ```python import torch import numpy as np from transformers import AutoTokenizer, AutoModelForSequenceClassification from joblib import load # Load model and tokenizer model_path = "path/to/model" tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForSequenceClassification.from_pretrained(model_path) # Load optimal thresholds and label encoder thresholds = np.load("optimal_thresholds.npy") mlb = load("mlb_encoder.joblib") # Set device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) model.eval() # Classify text text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação." # Tokenize inputs = tokenizer( text, return_tensors="pt", max_length=256, truncation=True, padding="max_length" ).to(device) # Predict with torch.no_grad(): outputs = model(**inputs) probs = torch.sigmoid(outputs.logits).cpu().numpy()[0] # Apply optimized thresholds predictions = (probs >= thresholds).astype(int) predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0] # Display results print(f"Text: {text}") print(f"\nPredicted Topics:") for label in predicted_labels: idx = list(mlb.classes_).index(label) prob = probs[idx] thresh = thresholds[idx] print(f" - {label}: {prob:.4f} (threshold: {thresh:.2f})") ``` ## Categories The model classifies topics into 22 Portuguese administrative categories: | Category | Portuguese Name | |----------|-----------------| | General Administration | Administração Geral, Finanças e Recursos Humanos | | Environment | Ambiente | | Economic Activities | Atividades Económicas | | Social Action | Ação Social | | Science | Ciência | | Communication | Comunicação e Relações Públicas | | External Cooperation | Cooperação Externa e Relações Internacionais | | Culture | Cultura | | Sports | Desporto | | Education | Educação e Formação Profissional | | Energy & Telecommunications | Energia e Telecomunicações | | Housing | Habitação | | Private Construction | Obras Particulares | | Public Works | Obras Públicas | | Territorial Planning | Ordenamento do Território | | Other | Outros | | Heritage | Património | | Municipal Police | Polícia Municipal | | Animal Protection | Proteção Animal | | Civil Protection | Proteção Civil | | Health | Saúde | | Traffic & Transport | Trânsito, Transportes e Comunicações | ## Evaluation Results ### Comprehensive Performance Metrics | Metric | Score | Description | |--------|-------|-------------| | **F1-macro** | **0.6419** | Macro-averaged F1 score | | **F1-micro** | **0.8224** | Micro-averaged F1 score | | **Accuracy** | **0.5709** | Subset accuracy (exact match) | | **Hamming Loss** | **0.0286** | Label-wise error rate | | **Average Precision (macro)** | **0.681** | Macro-averaged AP | ## License This project uses a custom dual-license based on AGPL v3. See the full license terms here: [LICENSE](./LICENSE)