anonymous12321 commited on
Commit
41b9bab
·
verified ·
1 Parent(s): 9ecca21

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +231 -3
README.md CHANGED
@@ -1,3 +1,231 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ license: cc-by-nc-nd-4.0
5
+ colorTo: blue
6
+ sdk: docker
7
+ app_port: 8501
8
+ tags:
9
+ - streamlit
10
+ - text-classification
11
+ - multilabel-classification
12
+ - portuguese
13
+ - administrative-documents
14
+ - intelligent-stacking
15
+ - ensemble-learning
16
+ - bert
17
+ - tfidf
18
+ library_name: scikit-learn
19
+ base_model:
20
+ - neuralmind/bert-base-portuguese-cased
21
+ ---
22
+
23
+ # Intelligent Stacking: Multilabel Portuguese Administrative Document Classifier
24
+
25
+ ## Model Description
26
+
27
+ **Intelligent Stacking** is an advanced ensemble learning system specialized in multilabel classification of Portuguese administrative documents. The model combines 12 base models with intelligent meta-learning to achieve state-of-the-art performance on municipal and governmental document categorization tasks.
28
+
29
+ **Try out the model**: [Hugging Face Space Demo](https://huggingface.co/spaces/YOUR_USERNAME/intelligent-stacking-demo)
30
+
31
+ ### Key Features
32
+
33
+ - 🧠 **Intelligent Meta-Learning**: Advanced ensemble combination using stacked generalization
34
+ - 📚 **12 Base Models**: 3 feature sets × 4 algorithms for robust predictions
35
+ - 🇵🇹 **Portuguese Optimized**: Fine-tuned for Portuguese administrative language
36
+ - ⚡ **High Performance**: F1-macro score of 0.5486 with 54.7% improvement over baseline
37
+ - 🏢 **22 Categories**: Comprehensive municipal administrative document classification
38
+ - 🎯 **Dynamic Thresholds**: Optimized per-category decision boundaries
39
+
40
+ ## Model Details
41
+
42
+ - **Architecture**: Intelligent Stacking with Meta-Learning
43
+ - **Base Models**: 12 diverse classifiers (LogReg, Random Forest, Gradient Boosting)
44
+ - **Feature Engineering**: TF-IDF + BERTimbau embeddings + Statistical features
45
+ - **Meta-Learner**: Advanced ensemble combination algorithm
46
+ - **Categories**: 22 Portuguese administrative document types
47
+ - **Training Method**: Cross-validation stacking with dynamic threshold optimization
48
+ - **Framework**: Scikit-learn + Transformers
49
+
50
+ ## How It Works
51
+
52
+ The Intelligent Stacking system operates in multiple stages:
53
+
54
+ 1. **Feature Extraction**: Three complementary feature sets
55
+ - TF-IDF vectorization (word and character n-grams)
56
+ - BERTimbau embeddings from `neuralmind/bert-base-portuguese-cased`
57
+ - Statistical text features
58
+
59
+ 2. **Base Model Ensemble**: 12 diverse classifiers trained on different feature combinations
60
+ - Logistic Regression (C=1.0, C=0.5)
61
+ - Random Forest
62
+ - Gradient Boosting
63
+
64
+ 3. **Meta-Learning**: Intelligent combination of base model predictions using advanced stacking
65
+
66
+ 4. **Dynamic Thresholds**: Per-category optimized decision boundaries for multilabel output
67
+
68
+ ## Usage
69
+
70
+ ### Quick Start with Python
71
+
72
+ ```python
73
+ import joblib
74
+ import numpy as np
75
+ from sklearn.feature_extraction.text import TfidfVectorizer
76
+ from scipy.sparse import hstack, csr_matrix
77
+
78
+ # Load the model components
79
+ tfidf_vectorizer = joblib.load("int_stacking_tfidf_vectorizer.joblib")
80
+ meta_learner = joblib.load("int_stacking_meta_learner.joblib")
81
+ mlb_encoder = joblib.load("int_stacking_mlb_encoder.joblib")
82
+ base_models = joblib.load("int_stacking_base_models.joblib")
83
+ optimal_thresholds = np.load("int_stacking_optimal_thresholds.npy")
84
+
85
+ # Prepare text
86
+ text = """CONTRATO DE PRESTAÇÃO DE SERVIÇOS
87
+ Entre a Administração Pública Municipal e a empresa contratada,
88
+ fica estabelecido o presente contrato para prestação de serviços
89
+ de manutenção e conservação de vias públicas."""
90
+
91
+ # Extract features
92
+ tfidf_features = tfidf_vectorizer.transform([text])
93
+
94
+ # Generate base model predictions
95
+ base_predictions = np.zeros((1, len(mlb_encoder.classes_), 12))
96
+ model_idx = 0
97
+
98
+ for feat_name in ["TF-IDF", "BERT", "TF-IDF+BERT"]:
99
+ for algo_name in ["LogReg_C1", "LogReg_C05", "GradBoost", "RandomForest"]:
100
+ model_key = f"{feat_name}_{algo_name}"
101
+ if model_key in base_models:
102
+ model = base_models[model_key]
103
+ pred = model.predict_proba(tfidf_features)
104
+ base_predictions[0, :, model_idx] = pred[0]
105
+ model_idx += 1
106
+
107
+ # Meta-learner prediction
108
+ meta_features = base_predictions.reshape(1, -1)
109
+ meta_pred = meta_learner.predict_proba(meta_features)[0]
110
+
111
+ # Apply dynamic thresholds
112
+ predicted_labels = []
113
+ for i, (prob, threshold) in enumerate(zip(meta_pred, optimal_thresholds)):
114
+ if prob > threshold:
115
+ predicted_labels.append({
116
+ "label": mlb_encoder.classes_[i],
117
+ "probability": float(prob),
118
+ "confidence": "high" if prob > 0.7 else "medium" if prob > 0.4 else "low"
119
+ })
120
+
121
+ # Sort by probability
122
+ predicted_labels.sort(key=lambda x: x["probability"], reverse=True)
123
+ print("Predicted categories:", predicted_labels)
124
+ ```
125
+
126
+ ### Streamlit Demo
127
+
128
+ The model includes a complete Streamlit web interface for easy testing:
129
+
130
+ ```bash
131
+ streamlit run app.py
132
+ ```
133
+
134
+ ## Categories
135
+
136
+ The model classifies documents into 22 Portuguese administrative categories:
137
+
138
+ | Category | Portuguese Name |
139
+ |----------|-----------------|
140
+ | General Administration | Administração Geral, Finanças e Recursos Humanos |
141
+ | Environment | Ambiente |
142
+ | Economic Activities | Atividades Económicas |
143
+ | Social Action | Ação Social |
144
+ | Science | Ciência |
145
+ | Communication | Comunicação e Relações Públicas |
146
+ | External Cooperation | Cooperação Externa e Relações Internacionais |
147
+ | Culture | Cultura |
148
+ | Sports | Desporto |
149
+ | Education | Educação e Formação Profissional |
150
+ | Energy & Telecommunications | Energia e Telecomunicações |
151
+ | Housing | Habitação |
152
+ | Private Construction | Obras Particulares |
153
+ | Public Works | Obras Públicas |
154
+ | Territorial Planning | Ordenamento do Território |
155
+ | Other | Outros |
156
+ | Heritage | Património |
157
+ | Municipal Police | Polícia Municipal |
158
+ | Animal Protection | Proteção Animal |
159
+ | Civil Protection | Proteção Civil |
160
+ | Health | Saúde |
161
+ | Traffic & Transport | Trânsito, Transportes e Comunicações |
162
+
163
+ ## Evaluation Results
164
+
165
+ ### Comprehensive Performance Metrics
166
+
167
+ | Metric | Score | Description |
168
+ |--------|-------|-------------|
169
+ | **F1-macro** | **0.5486** | Macro-averaged F1 score |
170
+ | **F1-micro** | **0.7379** | Micro-averaged F1 score |
171
+ | **F1-weighted** | **0.742** | Weighted-averaged F1 score |
172
+ | **Accuracy** | **0.4259** | Subset accuracy (exact match) |
173
+ | **Hamming Loss** | **0.0426** | Label-wise error rate |
174
+ | **Average Precision (macro)** | **0.608** | Macro-averaged AP |
175
+ | **Average Precision (micro)** | **0.785** | Micro-averaged AP |
176
+ | **Improvement** | **+54.7%** | Over Decision Tree baseline |
177
+
178
+
179
+ ## Technical Architecture
180
+
181
+ ### Base Model Ensemble
182
+ - **Feature Set 1**: TF-IDF (word + character n-grams)
183
+ - **Feature Set 2**: BERTimbau embeddings (768 dimensions)
184
+ - **Feature Set 3**: Combined TF-IDF + BERT features
185
+
186
+ ### Algorithms per Feature Set
187
+ 1. **Logistic Regression** (C=1.0)
188
+ 2. **Logistic Regression** (C=0.5)
189
+ 3. **Gradient Boosting Classifier**
190
+ 4. **Random Forest Classifier**
191
+
192
+ ### Meta-Learning Strategy
193
+ - **Cross-validation stacking** for robust meta-features
194
+ - **Intelligent combination**: 70% meta-learner + 30% simple ensemble
195
+ - **Dynamic threshold optimization** per category using differential evolution
196
+
197
+ ## Training Data
198
+
199
+ The model was trained on a curated dataset of Portuguese administrative documents including:
200
+ - Municipal council meeting minutes
201
+ - Administrative contracts and agreements
202
+ - Environmental reports and assessments
203
+ - Traffic regulations and urban planning documents
204
+ - Public health and safety communications
205
+ - Cultural and educational program descriptions
206
+
207
+ ## Limitations
208
+
209
+ - **Language Specificity**: Optimized for Portuguese administrative language
210
+ - **Domain Focus**: Best performance on governmental/municipal documents
211
+ - **Computational Requirements**: Requires significant memory for all model components
212
+ - **Threshold Sensitivity**: Performance depends on carefully tuned per-category thresholds
213
+ - **Class Imbalance**: Some categories may have lower precision due to limited training examples
214
+
215
+ ## Citation
216
+
217
+ If you use this model in your research, please cite:
218
+
219
+ ```bibtex
220
+ @article{intelligent_stacking_2024,
221
+ title={Intelligent Stacking for Multilabel Portuguese Administrative Document Classification},
222
+ author={[Your Name]},
223
+ journal={[Journal Name]},
224
+ year={2024},
225
+ note={Model available at https://huggingface.co/YOUR_USERNAME/intelligent-stacking}
226
+ }
227
+ ```
228
+
229
+ ## License
230
+
231
+ This model is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).