anonymous12321's picture
Update README.md
e402267 verified
|
raw
history blame
7.37 kB
metadata
language:
  - pt
license: cc-by-nc-nd-4.0
colorTo: blue
sdk: streamlit
app_port: 8501
tags:
  - streamlit
  - text-classification
  - multi-label-classification
  - gradient-boosting
  - active-learning
  - bertimbau
  - municipal-documents
  - meeting-minutes
library_name: transformers
base_model:
  - neuralmind/bert-base-portuguese-cased

Municipal Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts

Model Description

Municipal Topics Classifier is an ensemble machine learning system specialized in multi-label topic classification for Portuguese municipal council meeting minutes. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within municipal discussion subbjects, making it particularly effective for categorizing complex governmental content.

🚀 Try out the model: Hugging Face Space Demo

Key Features

  • 🎯 Specialized for Municipal Topics: Trained on Portuguese council meeting minutes discussion subjects with domain-specific preprocessing
  • 🏆 Advanced Ensemble: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
  • 🧠 Deep + Classical Features: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
  • 📊 Multi-Label Classification: Identifies multiple co-occurring topics per subject
  • Optimized Thresholds: Dynamic per-label thresholds tuned on validation data
  • 🔄 Active Learning Ready: Adaptive weighting based on label frequency for continuous improvement

Model Details

  • Architecture: Ensemble (LogisticRegression + 3x GradientBoosting)
  • Base Models:
    • 1x LogisticRegression (L2 regularization, C=1.0)
    • GradientBoosting Model #1 (n_estimators=100, max_depth=3, learning_rate=0.1)
    • GradientBoosting Model #2 (n_estimators=150, max_depth=5, learning_rate=0.05)
    • GradientBoosting Model #3 (n_estimators=200, max_depth=4, learning_rate=0.1)
  • Feature Extractor: TF-IDF (n-grams 1-3, 10k features, Portuguese stopwords)
  • Embedding Model: neuralmind/bert-base-portuguese-cased (BERTimbau)
  • Total Features: 10,768 dimensions (10k TF-IDF + 768 BERT)
  • Training Method: One-vs-Rest with class weighting + Focal Loss
  • Optimization: Adaptive ensemble weighting by label frequency
  • Framework: Scikit-learn + PyTorch + Transformers

How It Works

The model processes Portuguese municipal texts through a sophisticated pipeline to identify relevant topics:

  1. Portuguese-Specific Preprocessing

    • Lowercasing and normalization
    • Municipal entity recognition (e.g., "Câmara Municipal" → "camara_municipal")
    • Legal term preservation (e.g., "Art. 5" → "artigo_5")
    • Number and currency standardization
  2. Dual Feature Extraction

    • TF-IDF: Captures term frequency patterns with n-grams (1-3)
    • BERTimbau: Provides contextual semantic embeddings
  3. Ensemble Prediction

    • Each base model predicts probabilities for all labels
    • Adaptive weighted combination based on label rarity:
      • Rare labels: Higher LogisticRegression weight
      • Common labels: Higher GradientBoosting weight
  4. Dynamic Thresholding

    • Per-label optimal thresholds (not fixed 0.5)
    • Optimized for F1-score on validation set

Example

Input:

A Câmara Municipal aprovou o orçamento de 2024 com investimentos em infraestruturas 
e transportes públicos. O vereador apresentou uma proposta para melhorar o sistema 
de recolha de resíduos.

Output:

Orçamento e Finanças (Confidence: 89%)
Obras Públicas (Confidence: 76%)
Transportes (Confidence: 68%)
Ambiente (Confidence: 54%)

Usage

import numpy as np
from joblib import load
from transformers import AutoTokenizer, AutoModel
import torch

# Load models
models_dir = 'models'
tfidf = load(f'{models_dir}/tfidf_vectorizer.joblib')
mlb = load(f'{models_dir}/mlb_encoder.joblib')
optimal_thresholds = np.load(f'{models_dir}/optimal_thresholds.npy')
adaptive_weights = np.load(f'{models_dir}/adaptive_weights.npy')
logistic_model = load(f'{models_dir}/logistic_model.joblib')
gb_models = load(f'{models_dir}/gb_models.joblib')

# Load BERTimbau
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
bert_model = AutoModel.from_pretrained("neuralmind/bert-base-portuguese-cased").to(device)

# Preprocess text
text = "A Câmara Municipal aprovou o orçamento de 2024..."
# (apply smart_preprocess function - see app.py)

# Extract features
tfidf_features = tfidf.transform([text])
# (extract BERT embeddings - see app.py)

# Combine features and predict
X_combined = np.hstack([tfidf_features.toarray(), bert_embeddings])

# Get ensemble predictions
logistic_proba = logistic_model.predict_proba(X_combined)
# (apply GB models and adaptive weighting - see app.py)

# Apply optimal thresholds
predictions = (ensemble_proba >= optimal_thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions)

print(f"Predicted Topics: {predicted_labels}")

Evaluation Results

Test Set Performance

Metric Score
Micro F1-Score 0.82
Macro F1-Score 0.74
Hamming Loss 0.08
Subset Accuracy 0.45
Average Precision 0.79

Dataset

The model was trained on a curated dataset of Portuguese municipal council meeting minutes:

  • Documents: 2,500+ meeting minutes subjects
  • Time Period: 2021-2024
  • Source: Portuguese municipalities (anonymized)
  • Labels: 22 topic categories
  • Annotation: Multi-label (avg. 1.69 labels per document)
  • Split: 60% train / 20% validation / 20% test

Training Details

Preprocessing

  • Portuguese stopword removal
  • Municipal entity recognition
  • Legal term preservation
  • N-gram extraction (1-3)

Feature Engineering

  • TF-IDF: 10,000 features with sublinear scaling
  • BERTimbau: Mean-pooled embeddings (768 dims)
  • Feature concatenation: 10,768 total dimensions

Model Training

  • Strategy: One-vs-Rest multi-label classification
  • Class Balancing: Inverse frequency weighting
  • Validation: Stratified 5-fold cross-validation
  • Threshold Optimization: Per-label F1-maximization
  • Active Learning: Adaptive ensemble weights

Hyperparameters

LogisticRegression:

{
    'penalty': 'l2',
    'C': 1.0,
    'max_iter': 1000,
    'class_weight': 'balanced'
}

GradientBoosting Models:

# Model #1
{'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1}

# Model #2
{'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.05}

# Model #3
{'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.1}

Limitations

  • Language Specificity: Optimized for Portuguese
  • Domain Focus: Best performance on municipal/administrative texts
  • Label Set: Fixed to 22 predefined categories
  • Rare Topics: Lower performance on infrequent labels (<20 training examples)
  • Ambiguous Cases: May over-predict for texts with multiple overlapping themes

License

This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).