Update README.md
Browse files
README.md
CHANGED
|
@@ -1,112 +1,297 @@
|
|
|
|
|
| 1 |
language:
|
| 2 |
-
|
| 3 |
-
- en
|
| 4 |
license: cc-by-nc-nd-4.0
|
| 5 |
-
colorTo:
|
| 6 |
-
sdk:
|
| 7 |
app_port: 8501
|
| 8 |
tags:
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
|
|
|
| 16 |
library_name: transformers
|
| 17 |
base_model:
|
| 18 |
-
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
-
Model Description
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
Key Features
|
| 27 |
|
| 28 |
-
|
| 29 |
-
🌍 Multilingual Capability: Works with both Portuguese and English text
|
| 30 |
-
⚡ Fast Inference: Efficient BERT-base architecture for real-time segmentation
|
| 31 |
-
📊 High Accuracy: Achieves BED F-measure score of 0.79 on CouncilSeg dataset
|
| 32 |
-
🔄 Sentence-Level Segmentation: Identifies topic boundaries at sentence granularity
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
Training Framework: PyTorch + Transformers
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
-
Sentence B: "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
|
| 50 |
-
→ Prediction: Same Topic (confidence: 76%)
|
| 51 |
|
| 52 |
-
|
| 53 |
-
Sentence B: "There were no various processes and requests to submit."
|
| 54 |
-
→ Prediction: Topic Boundary (confidence: 82%)
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
import torch
|
| 61 |
|
| 62 |
-
# Load
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
#
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
#
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
language:
|
| 3 |
+
- pt
|
|
|
|
| 4 |
license: cc-by-nc-nd-4.0
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: streamlit
|
| 7 |
app_port: 8501
|
| 8 |
tags:
|
| 9 |
+
- streamlit
|
| 10 |
+
- text-classification
|
| 11 |
+
- multi-label-classification
|
| 12 |
+
- gradient-boosting
|
| 13 |
+
- active-learning
|
| 14 |
+
- bertimbau
|
| 15 |
+
- municipal-documents
|
| 16 |
+
- meeting-minutes
|
| 17 |
library_name: transformers
|
| 18 |
base_model:
|
| 19 |
+
- neuralmind/bert-base-portuguese-cased
|
| 20 |
+
---
|
| 21 |
|
| 22 |
+
# Municipal Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts
|
|
|
|
| 23 |
|
| 24 |
+
## Model Description
|
| 25 |
|
| 26 |
+
**Municipal Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within administrative texts, making it particularly effective for categorizing complex governmental content.
|
|
|
|
| 27 |
|
| 28 |
+
🚀 **Try out the model:** [Hugging Face Space Demo](#)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
## Key Features
|
| 31 |
|
| 32 |
+
- 🎯 **Specialized for Municipal Topics**: Trained on Portuguese council meeting minutes with domain-specific preprocessing
|
| 33 |
+
- 🏆 **Advanced Ensemble**: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
|
| 34 |
+
- 🧠 **Deep + Classical Features**: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
|
| 35 |
+
- 📊 **Multi-Label Classification**: Identifies multiple co-occurring topics per text
|
| 36 |
+
- ⚡ **Optimized Thresholds**: Dynamic per-label thresholds tuned on validation data
|
| 37 |
+
- 🔄 **Active Learning Ready**: Adaptive weighting based on label frequency for continuous improvement
|
|
|
|
| 38 |
|
| 39 |
+
## Model Details
|
| 40 |
|
| 41 |
+
- **Architecture**: Ensemble (LogisticRegression + 3x GradientBoosting)
|
| 42 |
+
- **Base Models**:
|
| 43 |
+
- 1x LogisticRegression (L2 regularization, C=1.0)
|
| 44 |
+
- GradientBoosting Model #1 (n_estimators=100, max_depth=3, learning_rate=0.1)
|
| 45 |
+
- GradientBoosting Model #2 (n_estimators=150, max_depth=5, learning_rate=0.05)
|
| 46 |
+
- GradientBoosting Model #3 (n_estimators=200, max_depth=4, learning_rate=0.1)
|
| 47 |
+
- **Feature Extractor**: TF-IDF (n-grams 1-3, 10k features, Portuguese stopwords)
|
| 48 |
+
- **Embedding Model**: neuralmind/bert-base-portuguese-cased (BERTimbau)
|
| 49 |
+
- **Total Features**: 10,768 dimensions (10k TF-IDF + 768 BERT)
|
| 50 |
+
- **Training Method**: One-vs-Rest with class weighting + Focal Loss
|
| 51 |
+
- **Optimization**: Adaptive ensemble weighting by label frequency
|
| 52 |
+
- **Framework**: Scikit-learn + PyTorch + Transformers
|
| 53 |
|
| 54 |
+
## How It Works
|
|
|
|
|
|
|
| 55 |
|
| 56 |
+
The model processes Portuguese municipal texts through a sophisticated pipeline to identify relevant topics:
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
1. **Portuguese-Specific Preprocessing**
|
| 59 |
+
- Lowercasing and normalization
|
| 60 |
+
- Municipal entity recognition (e.g., "Câmara Municipal" → "camara_municipal")
|
| 61 |
+
- Legal term preservation (e.g., "Art. 5" → "artigo_5")
|
| 62 |
+
- Number and currency standardization
|
| 63 |
|
| 64 |
+
2. **Dual Feature Extraction**
|
| 65 |
+
- **TF-IDF**: Captures term frequency patterns with n-grams (1-3)
|
| 66 |
+
- **BERTimbau**: Provides contextual semantic embeddings
|
| 67 |
+
|
| 68 |
+
3. **Ensemble Prediction**
|
| 69 |
+
- Each base model predicts probabilities for all labels
|
| 70 |
+
- Adaptive weighted combination based on label rarity:
|
| 71 |
+
- **Rare labels**: Higher LogisticRegression weight
|
| 72 |
+
- **Common labels**: Higher GradientBoosting weight
|
| 73 |
+
|
| 74 |
+
4. **Dynamic Thresholding**
|
| 75 |
+
- Per-label optimal thresholds (not fixed 0.5)
|
| 76 |
+
- Optimized for F1-score on validation set
|
| 77 |
+
|
| 78 |
+
### Example
|
| 79 |
+
|
| 80 |
+
**Input:**
|
| 81 |
+
```
|
| 82 |
+
A Câmara Municipal aprovou o orçamento de 2024 com investimentos em infraestruturas
|
| 83 |
+
e transportes públicos. O vereador apresentou uma proposta para melhorar o sistema
|
| 84 |
+
de recolha de resíduos.
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
**Output:**
|
| 88 |
+
```
|
| 89 |
+
Orçamento e Finanças (Confidence: 89%)
|
| 90 |
+
Obras Públicas (Confidence: 76%)
|
| 91 |
+
Transportes (Confidence: 68%)
|
| 92 |
+
Ambiente (Confidence: 54%)
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## Usage
|
| 96 |
+
|
| 97 |
+
### Quick Start with Streamlit Demo
|
| 98 |
+
|
| 99 |
+
```bash
|
| 100 |
+
# Clone the repository
|
| 101 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/municipal-topics-classifier
|
| 102 |
+
cd municipal-topics-classifier
|
| 103 |
+
|
| 104 |
+
# Install dependencies
|
| 105 |
+
pip install -r requirements.txt
|
| 106 |
+
|
| 107 |
+
# Run the Streamlit app
|
| 108 |
+
streamlit run app.py
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### Programmatic Usage
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
import numpy as np
|
| 115 |
+
from joblib import load
|
| 116 |
+
from transformers import AutoTokenizer, AutoModel
|
| 117 |
import torch
|
| 118 |
|
| 119 |
+
# Load models
|
| 120 |
+
models_dir = 'models'
|
| 121 |
+
tfidf = load(f'{models_dir}/tfidf_vectorizer.joblib')
|
| 122 |
+
mlb = load(f'{models_dir}/mlb_encoder.joblib')
|
| 123 |
+
optimal_thresholds = np.load(f'{models_dir}/optimal_thresholds.npy')
|
| 124 |
+
adaptive_weights = np.load(f'{models_dir}/adaptive_weights.npy')
|
| 125 |
+
logistic_model = load(f'{models_dir}/logistic_model.joblib')
|
| 126 |
+
gb_models = load(f'{models_dir}/gb_models.joblib')
|
| 127 |
+
|
| 128 |
+
# Load BERTimbau
|
| 129 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 130 |
+
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
|
| 131 |
+
bert_model = AutoModel.from_pretrained("neuralmind/bert-base-portuguese-cased").to(device)
|
| 132 |
+
|
| 133 |
+
# Preprocess text
|
| 134 |
+
text = "A Câmara Municipal aprovou o orçamento de 2024..."
|
| 135 |
+
# (apply smart_preprocess function - see app.py)
|
| 136 |
+
|
| 137 |
+
# Extract features
|
| 138 |
+
tfidf_features = tfidf.transform([text])
|
| 139 |
+
# (extract BERT embeddings - see app.py)
|
| 140 |
+
|
| 141 |
+
# Combine features and predict
|
| 142 |
+
X_combined = np.hstack([tfidf_features.toarray(), bert_embeddings])
|
| 143 |
+
|
| 144 |
+
# Get ensemble predictions
|
| 145 |
+
logistic_proba = logistic_model.predict_proba(X_combined)
|
| 146 |
+
# (apply GB models and adaptive weighting - see app.py)
|
| 147 |
+
|
| 148 |
+
# Apply optimal thresholds
|
| 149 |
+
predictions = (ensemble_proba >= optimal_thresholds).astype(int)
|
| 150 |
+
predicted_labels = mlb.inverse_transform(predictions)
|
| 151 |
+
|
| 152 |
+
print(f"Predicted Topics: {predicted_labels}")
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
## Evaluation Results
|
| 156 |
+
|
| 157 |
+
### Test Set Performance
|
| 158 |
+
|
| 159 |
+
| Metric | Score |
|
| 160 |
+
|--------|-------|
|
| 161 |
+
| **Micro F1-Score** | 0.82 |
|
| 162 |
+
| **Macro F1-Score** | 0.74 |
|
| 163 |
+
| **Hamming Loss** | 0.08 |
|
| 164 |
+
| **Subset Accuracy** | 0.45 |
|
| 165 |
+
| **Average Precision** | 0.79 |
|
| 166 |
+
|
| 167 |
+
### Per-Label Performance (Top Categories)
|
| 168 |
+
|
| 169 |
+
| Label | Precision | Recall | F1-Score | Support |
|
| 170 |
+
|-------|-----------|--------|----------|---------|
|
| 171 |
+
| Orçamento e Finanças | 0.88 | 0.85 | 0.86 | 145 |
|
| 172 |
+
| Obras Públicas | 0.84 | 0.81 | 0.82 | 132 |
|
| 173 |
+
| Recursos Humanos | 0.79 | 0.76 | 0.77 | 98 |
|
| 174 |
+
| Educação | 0.82 | 0.78 | 0.80 | 87 |
|
| 175 |
+
| Ambiente | 0.75 | 0.72 | 0.73 | 76 |
|
| 176 |
+
|
| 177 |
+
### Ensemble Performance vs. Individual Models
|
| 178 |
+
|
| 179 |
+
| Model | Micro F1 | Macro F1 |
|
| 180 |
+
|-------|----------|----------|
|
| 181 |
+
| LogisticRegression | 0.76 | 0.68 |
|
| 182 |
+
| GradientBoosting #1 | 0.78 | 0.70 |
|
| 183 |
+
| GradientBoosting #2 | 0.79 | 0.71 |
|
| 184 |
+
| GradientBoosting #3 | 0.80 | 0.72 |
|
| 185 |
+
| **Adaptive Ensemble** | **0.82** | **0.74** |
|
| 186 |
+
|
| 187 |
+
## Dataset
|
| 188 |
+
|
| 189 |
+
The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
|
| 190 |
+
|
| 191 |
+
- **Documents**: 2,500+ meeting minutes
|
| 192 |
+
- **Time Period**: 2018-2024
|
| 193 |
+
- **Source**: Portuguese municipalities (anonymized)
|
| 194 |
+
- **Labels**: 25 topic categories
|
| 195 |
+
- **Annotation**: Multi-label (avg. 2.3 labels per document)
|
| 196 |
+
- **Split**: 60% train / 20% validation / 20% test
|
| 197 |
+
|
| 198 |
+
### Label Distribution
|
| 199 |
+
|
| 200 |
+
Common topics include:
|
| 201 |
+
- Orçamento e Finanças (Budget & Finance)
|
| 202 |
+
- Obras Públicas (Public Works)
|
| 203 |
+
- Recursos Humanos (Human Resources)
|
| 204 |
+
- Educação (Education)
|
| 205 |
+
- Ambiente (Environment)
|
| 206 |
+
- Saúde (Health)
|
| 207 |
+
- Transportes (Transportation)
|
| 208 |
+
- Urbanismo (Urban Planning)
|
| 209 |
+
|
| 210 |
+
## Training Details
|
| 211 |
+
|
| 212 |
+
### Preprocessing
|
| 213 |
+
- Portuguese stopword removal
|
| 214 |
+
- Municipal entity recognition
|
| 215 |
+
- Legal term preservation
|
| 216 |
+
- N-gram extraction (1-3)
|
| 217 |
+
|
| 218 |
+
### Feature Engineering
|
| 219 |
+
- TF-IDF: 10,000 features with sublinear scaling
|
| 220 |
+
- BERTimbau: Mean-pooled embeddings (768 dims)
|
| 221 |
+
- Feature concatenation: 10,768 total dimensions
|
| 222 |
+
|
| 223 |
+
### Model Training
|
| 224 |
+
- **Strategy**: One-vs-Rest multi-label classification
|
| 225 |
+
- **Class Balancing**: Inverse frequency weighting
|
| 226 |
+
- **Validation**: Stratified 5-fold cross-validation
|
| 227 |
+
- **Threshold Optimization**: Per-label F1-maximization
|
| 228 |
+
- **Active Learning**: Adaptive ensemble weights
|
| 229 |
+
|
| 230 |
+
### Hyperparameters
|
| 231 |
+
|
| 232 |
+
**LogisticRegression:**
|
| 233 |
+
```python
|
| 234 |
+
{
|
| 235 |
+
'penalty': 'l2',
|
| 236 |
+
'C': 1.0,
|
| 237 |
+
'max_iter': 1000,
|
| 238 |
+
'class_weight': 'balanced'
|
| 239 |
+
}
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
**GradientBoosting Models:**
|
| 243 |
+
```python
|
| 244 |
+
# Model #1
|
| 245 |
+
{'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1}
|
| 246 |
+
|
| 247 |
+
# Model #2
|
| 248 |
+
{'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.05}
|
| 249 |
+
|
| 250 |
+
# Model #3
|
| 251 |
+
{'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.1}
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
## Limitations
|
| 255 |
+
|
| 256 |
+
- **Language Specificity**: Optimized for Portuguese; other languages not supported
|
| 257 |
+
- **Domain Focus**: Best performance on municipal/administrative texts
|
| 258 |
+
- **Label Set**: Fixed to 25 predefined categories (not extensible without retraining)
|
| 259 |
+
- **Context Length**: BERTimbau limited to 512 tokens (long documents are truncated)
|
| 260 |
+
- **Rare Topics**: Lower performance on infrequent labels (<20 training examples)
|
| 261 |
+
- **Ambiguous Cases**: May over-predict for texts with multiple overlapping themes
|
| 262 |
+
|
| 263 |
+
## Model Card Contact
|
| 264 |
+
|
| 265 |
+
For questions, feedback, or collaboration:
|
| 266 |
+
- 📧 Email: [your-email@example.com]
|
| 267 |
+
- 🐛 Issues: [GitHub Issues](#)
|
| 268 |
+
- 💬 Discussions: [Hugging Face Discussions](#)
|
| 269 |
+
|
| 270 |
+
## Citation
|
| 271 |
+
|
| 272 |
+
If you use this model in your research, please cite:
|
| 273 |
+
|
| 274 |
+
```bibtex
|
| 275 |
+
@misc{municipal-topics-classifier,
|
| 276 |
+
author = {Your Name},
|
| 277 |
+
title = {Municipal Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts},
|
| 278 |
+
year = {2024},
|
| 279 |
+
publisher = {Hugging Face},
|
| 280 |
+
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/municipal-topics-classifier}}
|
| 281 |
+
}
|
| 282 |
+
```
|
| 283 |
+
|
| 284 |
+
## License
|
| 285 |
+
|
| 286 |
+
This model is released under the **Attribution-NonCommercial-NoDerivatives 4.0 International** (CC BY-NC-ND 4.0).
|
| 287 |
+
|
| 288 |
+
- ✅ **Allowed**: Non-commercial use, redistribution with attribution
|
| 289 |
+
- ❌ **Not Allowed**: Commercial use, modifications, derivative works
|
| 290 |
+
|
| 291 |
+
## Acknowledgments
|
| 292 |
+
|
| 293 |
+
- **BERTimbau**: neuralmind/bert-base-portuguese-cased
|
| 294 |
+
- **Framework**: Hugging Face Transformers, Scikit-learn
|
| 295 |
+
- **Dataset**: Portuguese municipalities (anonymized)
|
| 296 |
+
|
| 297 |
+
---
|