anonymous12321 commited on
Commit
e402267
Β·
verified Β·
1 Parent(s): 6cdd194

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -87
README.md CHANGED
@@ -23,16 +23,16 @@ base_model:
23
 
24
  ## Model Description
25
 
26
- **Municipal Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within administrative texts, making it particularly effective for categorizing complex governmental content.
27
 
28
- πŸš€ **Try out the model:** [Hugging Face Space Demo](#)
29
 
30
  ## Key Features
31
 
32
- - 🎯 **Specialized for Municipal Topics**: Trained on Portuguese council meeting minutes with domain-specific preprocessing
33
  - πŸ† **Advanced Ensemble**: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
34
  - 🧠 **Deep + Classical Features**: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
35
- - πŸ“Š **Multi-Label Classification**: Identifies multiple co-occurring topics per text
36
  - ⚑ **Optimized Thresholds**: Dynamic per-label thresholds tuned on validation data
37
  - πŸ”„ **Active Learning Ready**: Adaptive weighting based on label frequency for continuous improvement
38
 
@@ -94,22 +94,6 @@ Ambiente (Confidence: 54%)
94
 
95
  ## Usage
96
 
97
- ### Quick Start with Streamlit Demo
98
-
99
- ```bash
100
- # Clone the repository
101
- git clone https://huggingface.co/spaces/YOUR_USERNAME/municipal-topics-classifier
102
- cd municipal-topics-classifier
103
-
104
- # Install dependencies
105
- pip install -r requirements.txt
106
-
107
- # Run the Streamlit app
108
- streamlit run app.py
109
- ```
110
-
111
- ### Programmatic Usage
112
-
113
  ```python
114
  import numpy as np
115
  from joblib import load
@@ -164,49 +148,18 @@ print(f"Predicted Topics: {predicted_labels}")
164
  | **Subset Accuracy** | 0.45 |
165
  | **Average Precision** | 0.79 |
166
 
167
- ### Per-Label Performance (Top Categories)
168
-
169
- | Label | Precision | Recall | F1-Score | Support |
170
- |-------|-----------|--------|----------|---------|
171
- | OrΓ§amento e FinanΓ§as | 0.88 | 0.85 | 0.86 | 145 |
172
- | Obras PΓΊblicas | 0.84 | 0.81 | 0.82 | 132 |
173
- | Recursos Humanos | 0.79 | 0.76 | 0.77 | 98 |
174
- | EducaΓ§Γ£o | 0.82 | 0.78 | 0.80 | 87 |
175
- | Ambiente | 0.75 | 0.72 | 0.73 | 76 |
176
-
177
- ### Ensemble Performance vs. Individual Models
178
-
179
- | Model | Micro F1 | Macro F1 |
180
- |-------|----------|----------|
181
- | LogisticRegression | 0.76 | 0.68 |
182
- | GradientBoosting #1 | 0.78 | 0.70 |
183
- | GradientBoosting #2 | 0.79 | 0.71 |
184
- | GradientBoosting #3 | 0.80 | 0.72 |
185
- | **Adaptive Ensemble** | **0.82** | **0.74** |
186
 
187
  ## Dataset
188
 
189
  The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
190
 
191
- - **Documents**: 2,500+ meeting minutes
192
- - **Time Period**: 2018-2024
193
  - **Source**: Portuguese municipalities (anonymized)
194
- - **Labels**: 25 topic categories
195
- - **Annotation**: Multi-label (avg. 2.3 labels per document)
196
  - **Split**: 60% train / 20% validation / 20% test
197
 
198
- ### Label Distribution
199
-
200
- Common topics include:
201
- - OrΓ§amento e FinanΓ§as (Budget & Finance)
202
- - Obras PΓΊblicas (Public Works)
203
- - Recursos Humanos (Human Resources)
204
- - EducaΓ§Γ£o (Education)
205
- - Ambiente (Environment)
206
- - SaΓΊde (Health)
207
- - Transportes (Transportation)
208
- - Urbanismo (Urban Planning)
209
-
210
  ## Training Details
211
 
212
  ### Preprocessing
@@ -253,45 +206,15 @@ Common topics include:
253
 
254
  ## Limitations
255
 
256
- - **Language Specificity**: Optimized for Portuguese; other languages not supported
257
  - **Domain Focus**: Best performance on municipal/administrative texts
258
- - **Label Set**: Fixed to 25 predefined categories (not extensible without retraining)
259
- - **Context Length**: BERTimbau limited to 512 tokens (long documents are truncated)
260
  - **Rare Topics**: Lower performance on infrequent labels (<20 training examples)
261
  - **Ambiguous Cases**: May over-predict for texts with multiple overlapping themes
262
 
263
- ## Model Card Contact
264
-
265
- For questions, feedback, or collaboration:
266
- - πŸ“§ Email: [your-email@example.com]
267
- - πŸ› Issues: [GitHub Issues](#)
268
- - πŸ’¬ Discussions: [Hugging Face Discussions](#)
269
-
270
- ## Citation
271
-
272
- If you use this model in your research, please cite:
273
-
274
- ```bibtex
275
- @misc{municipal-topics-classifier,
276
- author = {Your Name},
277
- title = {Municipal Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts},
278
- year = {2024},
279
- publisher = {Hugging Face},
280
- howpublished = {\url{https://huggingface.co/YOUR_USERNAME/municipal-topics-classifier}}
281
- }
282
- ```
283
 
284
  ## License
285
 
286
  This model is released under the **Attribution-NonCommercial-NoDerivatives 4.0 International** (CC BY-NC-ND 4.0).
287
 
288
- - βœ… **Allowed**: Non-commercial use, redistribution with attribution
289
- - ❌ **Not Allowed**: Commercial use, modifications, derivative works
290
-
291
- ## Acknowledgments
292
-
293
- - **BERTimbau**: neuralmind/bert-base-portuguese-cased
294
- - **Framework**: Hugging Face Transformers, Scikit-learn
295
- - **Dataset**: Portuguese municipalities (anonymized)
296
-
297
  ---
 
23
 
24
  ## Model Description
25
 
26
+ **Municipal Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within municipal discussion subbjects, making it particularly effective for categorizing complex governmental content.
27
 
28
+ πŸš€ **Try out the model:** [Hugging Face Space Demo](https://huggingface.co/spaces/anonymous12321/GB_CouncilTopics-PT)
29
 
30
  ## Key Features
31
 
32
+ - 🎯 **Specialized for Municipal Topics**: Trained on Portuguese council meeting minutes discussion subjects with domain-specific preprocessing
33
  - πŸ† **Advanced Ensemble**: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
34
  - 🧠 **Deep + Classical Features**: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
35
+ - πŸ“Š **Multi-Label Classification**: Identifies multiple co-occurring topics per subject
36
  - ⚑ **Optimized Thresholds**: Dynamic per-label thresholds tuned on validation data
37
  - πŸ”„ **Active Learning Ready**: Adaptive weighting based on label frequency for continuous improvement
38
 
 
94
 
95
  ## Usage
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ```python
98
  import numpy as np
99
  from joblib import load
 
148
  | **Subset Accuracy** | 0.45 |
149
  | **Average Precision** | 0.79 |
150
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
  ## Dataset
153
 
154
  The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
155
 
156
+ - **Documents**: 2,500+ meeting minutes subjects
157
+ - **Time Period**: 2021-2024
158
  - **Source**: Portuguese municipalities (anonymized)
159
+ - **Labels**: 22 topic categories
160
+ - **Annotation**: Multi-label (avg. 1.69 labels per document)
161
  - **Split**: 60% train / 20% validation / 20% test
162
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ## Training Details
164
 
165
  ### Preprocessing
 
206
 
207
  ## Limitations
208
 
209
+ - **Language Specificity**: Optimized for Portuguese
210
  - **Domain Focus**: Best performance on municipal/administrative texts
211
+ - **Label Set**: Fixed to 22 predefined categories
 
212
  - **Rare Topics**: Lower performance on infrequent labels (<20 training examples)
213
  - **Ambiguous Cases**: May over-predict for texts with multiple overlapping themes
214
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
  ## License
217
 
218
  This model is released under the **Attribution-NonCommercial-NoDerivatives 4.0 International** (CC BY-NC-ND 4.0).
219
 
 
 
 
 
 
 
 
 
 
220
  ---