Update README.md
Browse files
README.md
CHANGED
|
@@ -23,16 +23,16 @@ base_model:
|
|
| 23 |
|
| 24 |
## Model Description
|
| 25 |
|
| 26 |
-
**Municipal Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within
|
| 27 |
|
| 28 |
-
π **Try out the model:** [Hugging Face Space Demo](
|
| 29 |
|
| 30 |
## Key Features
|
| 31 |
|
| 32 |
-
- π― **Specialized for Municipal Topics**: Trained on Portuguese council meeting minutes with domain-specific preprocessing
|
| 33 |
- π **Advanced Ensemble**: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
|
| 34 |
- π§ **Deep + Classical Features**: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
|
| 35 |
-
- π **Multi-Label Classification**: Identifies multiple co-occurring topics per
|
| 36 |
- β‘ **Optimized Thresholds**: Dynamic per-label thresholds tuned on validation data
|
| 37 |
- π **Active Learning Ready**: Adaptive weighting based on label frequency for continuous improvement
|
| 38 |
|
|
@@ -94,22 +94,6 @@ Ambiente (Confidence: 54%)
|
|
| 94 |
|
| 95 |
## Usage
|
| 96 |
|
| 97 |
-
### Quick Start with Streamlit Demo
|
| 98 |
-
|
| 99 |
-
```bash
|
| 100 |
-
# Clone the repository
|
| 101 |
-
git clone https://huggingface.co/spaces/YOUR_USERNAME/municipal-topics-classifier
|
| 102 |
-
cd municipal-topics-classifier
|
| 103 |
-
|
| 104 |
-
# Install dependencies
|
| 105 |
-
pip install -r requirements.txt
|
| 106 |
-
|
| 107 |
-
# Run the Streamlit app
|
| 108 |
-
streamlit run app.py
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
### Programmatic Usage
|
| 112 |
-
|
| 113 |
```python
|
| 114 |
import numpy as np
|
| 115 |
from joblib import load
|
|
@@ -164,49 +148,18 @@ print(f"Predicted Topics: {predicted_labels}")
|
|
| 164 |
| **Subset Accuracy** | 0.45 |
|
| 165 |
| **Average Precision** | 0.79 |
|
| 166 |
|
| 167 |
-
### Per-Label Performance (Top Categories)
|
| 168 |
-
|
| 169 |
-
| Label | Precision | Recall | F1-Score | Support |
|
| 170 |
-
|-------|-----------|--------|----------|---------|
|
| 171 |
-
| OrΓ§amento e FinanΓ§as | 0.88 | 0.85 | 0.86 | 145 |
|
| 172 |
-
| Obras PΓΊblicas | 0.84 | 0.81 | 0.82 | 132 |
|
| 173 |
-
| Recursos Humanos | 0.79 | 0.76 | 0.77 | 98 |
|
| 174 |
-
| EducaΓ§Γ£o | 0.82 | 0.78 | 0.80 | 87 |
|
| 175 |
-
| Ambiente | 0.75 | 0.72 | 0.73 | 76 |
|
| 176 |
-
|
| 177 |
-
### Ensemble Performance vs. Individual Models
|
| 178 |
-
|
| 179 |
-
| Model | Micro F1 | Macro F1 |
|
| 180 |
-
|-------|----------|----------|
|
| 181 |
-
| LogisticRegression | 0.76 | 0.68 |
|
| 182 |
-
| GradientBoosting #1 | 0.78 | 0.70 |
|
| 183 |
-
| GradientBoosting #2 | 0.79 | 0.71 |
|
| 184 |
-
| GradientBoosting #3 | 0.80 | 0.72 |
|
| 185 |
-
| **Adaptive Ensemble** | **0.82** | **0.74** |
|
| 186 |
|
| 187 |
## Dataset
|
| 188 |
|
| 189 |
The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
|
| 190 |
|
| 191 |
-
- **Documents**: 2,500+ meeting minutes
|
| 192 |
-
- **Time Period**:
|
| 193 |
- **Source**: Portuguese municipalities (anonymized)
|
| 194 |
-
- **Labels**:
|
| 195 |
-
- **Annotation**: Multi-label (avg.
|
| 196 |
- **Split**: 60% train / 20% validation / 20% test
|
| 197 |
|
| 198 |
-
### Label Distribution
|
| 199 |
-
|
| 200 |
-
Common topics include:
|
| 201 |
-
- OrΓ§amento e FinanΓ§as (Budget & Finance)
|
| 202 |
-
- Obras PΓΊblicas (Public Works)
|
| 203 |
-
- Recursos Humanos (Human Resources)
|
| 204 |
-
- EducaΓ§Γ£o (Education)
|
| 205 |
-
- Ambiente (Environment)
|
| 206 |
-
- SaΓΊde (Health)
|
| 207 |
-
- Transportes (Transportation)
|
| 208 |
-
- Urbanismo (Urban Planning)
|
| 209 |
-
|
| 210 |
## Training Details
|
| 211 |
|
| 212 |
### Preprocessing
|
|
@@ -253,45 +206,15 @@ Common topics include:
|
|
| 253 |
|
| 254 |
## Limitations
|
| 255 |
|
| 256 |
-
- **Language Specificity**: Optimized for Portuguese
|
| 257 |
- **Domain Focus**: Best performance on municipal/administrative texts
|
| 258 |
-
- **Label Set**: Fixed to
|
| 259 |
-
- **Context Length**: BERTimbau limited to 512 tokens (long documents are truncated)
|
| 260 |
- **Rare Topics**: Lower performance on infrequent labels (<20 training examples)
|
| 261 |
- **Ambiguous Cases**: May over-predict for texts with multiple overlapping themes
|
| 262 |
|
| 263 |
-
## Model Card Contact
|
| 264 |
-
|
| 265 |
-
For questions, feedback, or collaboration:
|
| 266 |
-
- π§ Email: [your-email@example.com]
|
| 267 |
-
- π Issues: [GitHub Issues](#)
|
| 268 |
-
- π¬ Discussions: [Hugging Face Discussions](#)
|
| 269 |
-
|
| 270 |
-
## Citation
|
| 271 |
-
|
| 272 |
-
If you use this model in your research, please cite:
|
| 273 |
-
|
| 274 |
-
```bibtex
|
| 275 |
-
@misc{municipal-topics-classifier,
|
| 276 |
-
author = {Your Name},
|
| 277 |
-
title = {Municipal Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts},
|
| 278 |
-
year = {2024},
|
| 279 |
-
publisher = {Hugging Face},
|
| 280 |
-
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/municipal-topics-classifier}}
|
| 281 |
-
}
|
| 282 |
-
```
|
| 283 |
|
| 284 |
## License
|
| 285 |
|
| 286 |
This model is released under the **Attribution-NonCommercial-NoDerivatives 4.0 International** (CC BY-NC-ND 4.0).
|
| 287 |
|
| 288 |
-
- β
**Allowed**: Non-commercial use, redistribution with attribution
|
| 289 |
-
- β **Not Allowed**: Commercial use, modifications, derivative works
|
| 290 |
-
|
| 291 |
-
## Acknowledgments
|
| 292 |
-
|
| 293 |
-
- **BERTimbau**: neuralmind/bert-base-portuguese-cased
|
| 294 |
-
- **Framework**: Hugging Face Transformers, Scikit-learn
|
| 295 |
-
- **Dataset**: Portuguese municipalities (anonymized)
|
| 296 |
-
|
| 297 |
---
|
|
|
|
| 23 |
|
| 24 |
## Model Description
|
| 25 |
|
| 26 |
+
**Municipal Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within municipal discussion subbjects, making it particularly effective for categorizing complex governmental content.
|
| 27 |
|
| 28 |
+
π **Try out the model:** [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous12321/GB_CouncilTopics-PT)
|
| 29 |
|
| 30 |
## Key Features
|
| 31 |
|
| 32 |
+
- π― **Specialized for Municipal Topics**: Trained on Portuguese council meeting minutes discussion subjects with domain-specific preprocessing
|
| 33 |
- π **Advanced Ensemble**: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
|
| 34 |
- π§ **Deep + Classical Features**: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
|
| 35 |
+
- π **Multi-Label Classification**: Identifies multiple co-occurring topics per subject
|
| 36 |
- β‘ **Optimized Thresholds**: Dynamic per-label thresholds tuned on validation data
|
| 37 |
- π **Active Learning Ready**: Adaptive weighting based on label frequency for continuous improvement
|
| 38 |
|
|
|
|
| 94 |
|
| 95 |
## Usage
|
| 96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
```python
|
| 98 |
import numpy as np
|
| 99 |
from joblib import load
|
|
|
|
| 148 |
| **Subset Accuracy** | 0.45 |
|
| 149 |
| **Average Precision** | 0.79 |
|
| 150 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
## Dataset
|
| 153 |
|
| 154 |
The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
|
| 155 |
|
| 156 |
+
- **Documents**: 2,500+ meeting minutes subjects
|
| 157 |
+
- **Time Period**: 2021-2024
|
| 158 |
- **Source**: Portuguese municipalities (anonymized)
|
| 159 |
+
- **Labels**: 22 topic categories
|
| 160 |
+
- **Annotation**: Multi-label (avg. 1.69 labels per document)
|
| 161 |
- **Split**: 60% train / 20% validation / 20% test
|
| 162 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
## Training Details
|
| 164 |
|
| 165 |
### Preprocessing
|
|
|
|
| 206 |
|
| 207 |
## Limitations
|
| 208 |
|
| 209 |
+
- **Language Specificity**: Optimized for Portuguese
|
| 210 |
- **Domain Focus**: Best performance on municipal/administrative texts
|
| 211 |
+
- **Label Set**: Fixed to 22 predefined categories
|
|
|
|
| 212 |
- **Rare Topics**: Lower performance on infrequent labels (<20 training examples)
|
| 213 |
- **Ambiguous Cases**: May over-predict for texts with multiple overlapping themes
|
| 214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
## License
|
| 217 |
|
| 218 |
This model is released under the **Attribution-NonCommercial-NoDerivatives 4.0 International** (CC BY-NC-ND 4.0).
|
| 219 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
---
|