anonymous12321
/

CouncilTopics-PT

+---
+language:
+- pt
+license: cc-by-nc-nd-4.0
+colorTo: blue
+sdk: docker
+app_port: 8501
+tags:
+- streamlit
+- text-classification
+- multilabel-classification
+- portuguese
+- administrative-documents
+- intelligent-stacking
+- ensemble-learning
+- bert
+- tfidf
+library_name: scikit-learn
+base_model:
+- neuralmind/bert-base-portuguese-cased
+---
+# Intelligent Stacking: Multilabel Portuguese Administrative Document Classifier
+## Model Description
+**Intelligent Stacking** is an advanced ensemble learning system specialized in multilabel classification of Portuguese administrative documents. The model combines 12 base models with intelligent meta-learning to achieve state-of-the-art performance on municipal and governmental document categorization tasks.
+**Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/YOUR_USERNAME/intelligent-stacking-demo)
+### Key Features
+- 🧠 **Intelligent Meta-Learning**: Advanced ensemble combination using stacked generalization
+- 📚 **12 Base Models**: 3 feature sets × 4 algorithms for robust predictions
+- 🇵🇹 **Portuguese Optimized**: Fine-tuned for Portuguese administrative language
+- ⚡ **High Performance**: F1-macro score of 0.5486 with 54.7% improvement over baseline
+- 🏢 **22 Categories**: Comprehensive municipal administrative document classification
+- 🎯 **Dynamic Thresholds**: Optimized per-category decision boundaries
+## Model Details
+- **Architecture**: Intelligent Stacking with Meta-Learning
+- **Base Models**: 12 diverse classifiers (LogReg, Random Forest, Gradient Boosting)
+- **Feature Engineering**: TF-IDF + BERTimbau embeddings + Statistical features
+- **Meta-Learner**: Advanced ensemble combination algorithm
+- **Categories**: 22 Portuguese administrative document types
+- **Training Method**: Cross-validation stacking with dynamic threshold optimization
+- **Framework**: Scikit-learn + Transformers
+## How It Works
+The Intelligent Stacking system operates in multiple stages:
+1. **Feature Extraction**: Three complementary feature sets
+   - TF-IDF vectorization (word and character n-grams)
+   - BERTimbau embeddings from `neuralmind/bert-base-portuguese-cased`
+   - Statistical text features
+2. **Base Model Ensemble**: 12 diverse classifiers trained on different feature combinations
+   - Logistic Regression (C=1.0, C=0.5)
+   - Random Forest
+   - Gradient Boosting
+3. **Meta-Learning**: Intelligent combination of base model predictions using advanced stacking
+4. **Dynamic Thresholds**: Per-category optimized decision boundaries for multilabel output
+## Usage
+### Quick Start with Python
+```python
+import joblib
+import numpy as np
+from sklearn.feature_extraction.text import TfidfVectorizer
+from scipy.sparse import hstack, csr_matrix
+# Load the model components
+tfidf_vectorizer = joblib.load("int_stacking_tfidf_vectorizer.joblib")
+meta_learner = joblib.load("int_stacking_meta_learner.joblib")
+mlb_encoder = joblib.load("int_stacking_mlb_encoder.joblib")
+base_models = joblib.load("int_stacking_base_models.joblib")
+optimal_thresholds = np.load("int_stacking_optimal_thresholds.npy")
+# Prepare text
+text = """CONTRATO DE PRESTAÇÃO DE SERVIÇOS
+Entre a Administração Pública Municipal e a empresa contratada,
+fica estabelecido o presente contrato para prestação de serviços
+de manutenção e conservação de vias públicas."""
+# Extract features
+tfidf_features = tfidf_vectorizer.transform([text])
+# Generate base model predictions
+base_predictions = np.zeros((1, len(mlb_encoder.classes_), 12))
+model_idx = 0
+for feat_name in ["TF-IDF", "BERT", "TF-IDF+BERT"]:
+    for algo_name in ["LogReg_C1", "LogReg_C05", "GradBoost", "RandomForest"]:
+        model_key = f"{feat_name}_{algo_name}"
+        if model_key in base_models:
+            model = base_models[model_key]
+            pred = model.predict_proba(tfidf_features)
+            base_predictions[0, :, model_idx] = pred[0]
+        model_idx += 1
+# Meta-learner prediction
+meta_features = base_predictions.reshape(1, -1)
+meta_pred = meta_learner.predict_proba(meta_features)[0]
+# Apply dynamic thresholds
+predicted_labels = []
+for i, (prob, threshold) in enumerate(zip(meta_pred, optimal_thresholds)):
+    if prob > threshold:
+        predicted_labels.append({
+            "label": mlb_encoder.classes_[i],
+            "probability": float(prob),
+            "confidence": "high" if prob > 0.7 else "medium" if prob > 0.4 else "low"
+        })
+# Sort by probability
+predicted_labels.sort(key=lambda x: x["probability"], reverse=True)
+print("Predicted categories:", predicted_labels)
+```
+### Streamlit Demo
+The model includes a complete Streamlit web interface for easy testing:
+```bash
+streamlit run app.py
+```
+## Categories
+The model classifies documents into 22 Portuguese administrative categories:
+| Category | Portuguese Name |
+|----------|-----------------|
+| General Administration | Administração Geral, Finanças e Recursos Humanos |
+| Environment | Ambiente |
+| Economic Activities | Atividades Económicas |
+| Social Action | Ação Social |
+| Science | Ciência |
+| Communication | Comunicação e Relações Públicas |
+| External Cooperation | Cooperação Externa e Relações Internacionais |
+| Culture | Cultura |
+| Sports | Desporto |
+| Education | Educação e Formação Profissional |
+| Energy & Telecommunications | Energia e Telecomunicações |
+| Housing | Habitação |
+| Private Construction | Obras Particulares |
+| Public Works | Obras Públicas |
+| Territorial Planning | Ordenamento do Território |
+| Other | Outros |
+| Heritage | Património |
+| Municipal Police | Polícia Municipal |
+| Animal Protection | Proteção Animal |
+| Civil Protection | Proteção Civil |
+| Health | Saúde |
+| Traffic & Transport | Trânsito, Transportes e Comunicações |
+## Evaluation Results
+### Comprehensive Performance Metrics
+| Metric | Score | Description |
+|--------|-------|-------------|
+| **F1-macro** | **0.5486** | Macro-averaged F1 score |
+| **F1-micro** | **0.7379** | Micro-averaged F1 score |
+| **F1-weighted** | **0.742** | Weighted-averaged F1 score |
+| **Accuracy** | **0.4259** | Subset accuracy (exact match) |
+| **Hamming Loss** | **0.0426** | Label-wise error rate |
+| **Average Precision (macro)** | **0.608** | Macro-averaged AP |
+| **Average Precision (micro)** | **0.785** | Micro-averaged AP |
+| **Improvement** | **+54.7%** | Over Decision Tree baseline |
+## Technical Architecture
+### Base Model Ensemble
+- **Feature Set 1**: TF-IDF (word + character n-grams)
+- **Feature Set 2**: BERTimbau embeddings (768 dimensions)
+- **Feature Set 3**: Combined TF-IDF + BERT features
+### Algorithms per Feature Set
+1. **Logistic Regression** (C=1.0)
+2. **Logistic Regression** (C=0.5)
+3. **Gradient Boosting Classifier**
+4. **Random Forest Classifier**
+### Meta-Learning Strategy
+- **Cross-validation stacking** for robust meta-features
+- **Intelligent combination**: 70% meta-learner + 30% simple ensemble
+- **Dynamic threshold optimization** per category using differential evolution
+## Training Data
+The model was trained on a curated dataset of Portuguese administrative documents including:
+- Municipal council meeting minutes
+- Administrative contracts and agreements
+- Environmental reports and assessments
+- Traffic regulations and urban planning documents
+- Public health and safety communications
+- Cultural and educational program descriptions
+## Limitations
+- **Language Specificity**: Optimized for Portuguese administrative language
+- **Domain Focus**: Best performance on governmental/municipal documents
+- **Computational Requirements**: Requires significant memory for all model components
+- **Threshold Sensitivity**: Performance depends on carefully tuned per-category thresholds
+- **Class Imbalance**: Some categories may have lower precision due to limited training examples
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@article{intelligent_stacking_2024,
+  title={Intelligent Stacking for Multilabel Portuguese Administrative Document Classification},
+  author={[Your Name]},
+  journal={[Journal Name]},
+  year={2024},
+  note={Model available at https://huggingface.co/YOUR_USERNAME/intelligent-stacking}
+}
+```
+## License
+This model is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).