anonymous12321 commited on
Commit
6cdd194
·
verified ·
1 Parent(s): 19c9a07

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +277 -92
README.md CHANGED
@@ -1,112 +1,297 @@
 
1
  language:
2
- - pt
3
- - en
4
  license: cc-by-nc-nd-4.0
5
- colorTo: red
6
- sdk: docker
7
  app_port: 8501
8
  tags:
9
- - streamlit
10
- - text-segmentation
11
- - topic-segmentation
12
- - bert
13
- - next-sentence-prediction
14
- - document-segmentation
15
- - meeting-minutes
 
16
  library_name: transformers
17
  base_model:
18
- - neuralmind/bert-base-portuguese-cased
 
19
 
20
- NSP-CouncilSeg: Linear Text Segmentation for Municipal Meeting Minutes
21
- Model Description
22
 
23
- NSP-CouncilSeg is a fine-tuned BERT model specialized in Text Segmentation for municipal council meeting minutes. The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form documents, making it particularly effective for segmenting administrative and governmental meeting minutes.
24
 
25
- Try out the model: Hugging Face Space Demo
26
- Key Features
27
 
28
- 🎯 Specialized for Meeting Minutes: Fine-tuned on Portuguese municipal council meeting minutes
29
- 🌍 Multilingual Capability: Works with both Portuguese and English text
30
- ⚡ Fast Inference: Efficient BERT-base architecture for real-time segmentation
31
- 📊 High Accuracy: Achieves BED F-measure score of 0.79 on CouncilSeg dataset
32
- 🔄 Sentence-Level Segmentation: Identifies topic boundaries at sentence granularity
33
 
34
- Model Details
35
 
36
- Base Model: google-bert/bert-base-uncased
37
- Architecture: BERT with Next Sentence Prediction head
38
- Parameters: 110M
39
- Max Sequence Length: 512 tokens
40
- Fine-tuning Dataset: CouncilSeg (Portuguese Municipal Meeting Minutes)
41
- Fine-tuning Method: Focal Loss with boundary-aware weighting
42
- Training Framework: PyTorch + Transformers
43
 
44
- How It Works
45
 
46
- The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in a document, it identifies topic boundaries.
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- Sentence A: "By the President, minutes no. 28 of 20.12.2023 were present at the meeting."
49
- Sentence B: "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
50
- → Prediction: Same Topic (confidence: 76%)
51
 
52
- Sentence A: "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
53
- Sentence B: "There were no various processes and requests to submit."
54
- → Prediction: Topic Boundary (confidence: 82%)
55
 
56
- Usage
57
- Quick Start with Transformers
 
 
 
58
 
59
- from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  import torch
61
 
62
- # Load model and tokenizer
63
- tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-councilseg")
64
- model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-councilseg")
65
-
66
- # Prepare input
67
- sentence_a = "By the President, minutes no. 28 of 20.12.2023 were present at the meeting."
68
- sentence_b = "After considering and analyzing the matter, the Municipal Executive unanimously decided to approve minute no. 28 of 12.20.2023."
69
-
70
-
71
- # Tokenize
72
- inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
73
-
74
- # Predict
75
- with torch.no_grad():
76
- outputs = model(**inputs)
77
- logits = outputs.logits
78
- probs = torch.softmax(logits, dim=1)
79
-
80
- # Interpret results
81
- is_next_prob = probs[0][0].item()
82
- not_next_prob = probs[0][1].item()
83
-
84
- print(f"Is Next (same topic): {is_next_prob:.3f}")
85
- print(f"Not Next (topic boundary): {not_next_prob:.3f}")
86
-
87
- if not_next_prob > 0.5:
88
- print("🔴 Topic boundary detected!")
89
- else:
90
- print("🟢 Same topic continues")
91
-
92
- Evaluation Results
93
- CouncilSeg Test Set
94
- Metric Score
95
- BED F-measure 0.79
96
- Boundary Similarity 0.59
97
- Pk Score 0.08
98
- WindowDiff 0.10
99
- Limitations
100
-
101
- Domain Specificity: Best performance on administrative/governmental meeting minutes
102
- Language: Optimized for Portuguese; English performance may vary
103
- Document Length: Designed for documents with 10-50 segments
104
- Context Window: Limited to 512 tokens per sentence pair
105
- Ambiguous Boundaries: May struggle with subtle topic transitions
106
-
107
- Model Card Contact
108
-
109
- For questions or feedback, please open an issue in the model repository.
110
- License
111
-
112
- This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
  language:
3
+ - pt
 
4
  license: cc-by-nc-nd-4.0
5
+ colorTo: blue
6
+ sdk: streamlit
7
  app_port: 8501
8
  tags:
9
+ - streamlit
10
+ - text-classification
11
+ - multi-label-classification
12
+ - gradient-boosting
13
+ - active-learning
14
+ - bertimbau
15
+ - municipal-documents
16
+ - meeting-minutes
17
  library_name: transformers
18
  base_model:
19
+ - neuralmind/bert-base-portuguese-cased
20
+ ---
21
 
22
+ # Municipal Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts
 
23
 
24
+ ## Model Description
25
 
26
+ **Municipal Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within administrative texts, making it particularly effective for categorizing complex governmental content.
 
27
 
28
+ 🚀 **Try out the model:** [Hugging Face Space Demo](#)
 
 
 
 
29
 
30
+ ## Key Features
31
 
32
+ - 🎯 **Specialized for Municipal Topics**: Trained on Portuguese council meeting minutes with domain-specific preprocessing
33
+ - 🏆 **Advanced Ensemble**: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
34
+ - 🧠 **Deep + Classical Features**: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
35
+ - 📊 **Multi-Label Classification**: Identifies multiple co-occurring topics per text
36
+ - **Optimized Thresholds**: Dynamic per-label thresholds tuned on validation data
37
+ - 🔄 **Active Learning Ready**: Adaptive weighting based on label frequency for continuous improvement
 
38
 
39
+ ## Model Details
40
 
41
+ - **Architecture**: Ensemble (LogisticRegression + 3x GradientBoosting)
42
+ - **Base Models**:
43
+ - 1x LogisticRegression (L2 regularization, C=1.0)
44
+ - GradientBoosting Model #1 (n_estimators=100, max_depth=3, learning_rate=0.1)
45
+ - GradientBoosting Model #2 (n_estimators=150, max_depth=5, learning_rate=0.05)
46
+ - GradientBoosting Model #3 (n_estimators=200, max_depth=4, learning_rate=0.1)
47
+ - **Feature Extractor**: TF-IDF (n-grams 1-3, 10k features, Portuguese stopwords)
48
+ - **Embedding Model**: neuralmind/bert-base-portuguese-cased (BERTimbau)
49
+ - **Total Features**: 10,768 dimensions (10k TF-IDF + 768 BERT)
50
+ - **Training Method**: One-vs-Rest with class weighting + Focal Loss
51
+ - **Optimization**: Adaptive ensemble weighting by label frequency
52
+ - **Framework**: Scikit-learn + PyTorch + Transformers
53
 
54
+ ## How It Works
 
 
55
 
56
+ The model processes Portuguese municipal texts through a sophisticated pipeline to identify relevant topics:
 
 
57
 
58
+ 1. **Portuguese-Specific Preprocessing**
59
+ - Lowercasing and normalization
60
+ - Municipal entity recognition (e.g., "Câmara Municipal" → "camara_municipal")
61
+ - Legal term preservation (e.g., "Art. 5" → "artigo_5")
62
+ - Number and currency standardization
63
 
64
+ 2. **Dual Feature Extraction**
65
+ - **TF-IDF**: Captures term frequency patterns with n-grams (1-3)
66
+ - **BERTimbau**: Provides contextual semantic embeddings
67
+
68
+ 3. **Ensemble Prediction**
69
+ - Each base model predicts probabilities for all labels
70
+ - Adaptive weighted combination based on label rarity:
71
+ - **Rare labels**: Higher LogisticRegression weight
72
+ - **Common labels**: Higher GradientBoosting weight
73
+
74
+ 4. **Dynamic Thresholding**
75
+ - Per-label optimal thresholds (not fixed 0.5)
76
+ - Optimized for F1-score on validation set
77
+
78
+ ### Example
79
+
80
+ **Input:**
81
+ ```
82
+ A Câmara Municipal aprovou o orçamento de 2024 com investimentos em infraestruturas
83
+ e transportes públicos. O vereador apresentou uma proposta para melhorar o sistema
84
+ de recolha de resíduos.
85
+ ```
86
+
87
+ **Output:**
88
+ ```
89
+ Orçamento e Finanças (Confidence: 89%)
90
+ Obras Públicas (Confidence: 76%)
91
+ Transportes (Confidence: 68%)
92
+ Ambiente (Confidence: 54%)
93
+ ```
94
+
95
+ ## Usage
96
+
97
+ ### Quick Start with Streamlit Demo
98
+
99
+ ```bash
100
+ # Clone the repository
101
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/municipal-topics-classifier
102
+ cd municipal-topics-classifier
103
+
104
+ # Install dependencies
105
+ pip install -r requirements.txt
106
+
107
+ # Run the Streamlit app
108
+ streamlit run app.py
109
+ ```
110
+
111
+ ### Programmatic Usage
112
+
113
+ ```python
114
+ import numpy as np
115
+ from joblib import load
116
+ from transformers import AutoTokenizer, AutoModel
117
  import torch
118
 
119
+ # Load models
120
+ models_dir = 'models'
121
+ tfidf = load(f'{models_dir}/tfidf_vectorizer.joblib')
122
+ mlb = load(f'{models_dir}/mlb_encoder.joblib')
123
+ optimal_thresholds = np.load(f'{models_dir}/optimal_thresholds.npy')
124
+ adaptive_weights = np.load(f'{models_dir}/adaptive_weights.npy')
125
+ logistic_model = load(f'{models_dir}/logistic_model.joblib')
126
+ gb_models = load(f'{models_dir}/gb_models.joblib')
127
+
128
+ # Load BERTimbau
129
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
130
+ tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
131
+ bert_model = AutoModel.from_pretrained("neuralmind/bert-base-portuguese-cased").to(device)
132
+
133
+ # Preprocess text
134
+ text = "A Câmara Municipal aprovou o orçamento de 2024..."
135
+ # (apply smart_preprocess function - see app.py)
136
+
137
+ # Extract features
138
+ tfidf_features = tfidf.transform([text])
139
+ # (extract BERT embeddings - see app.py)
140
+
141
+ # Combine features and predict
142
+ X_combined = np.hstack([tfidf_features.toarray(), bert_embeddings])
143
+
144
+ # Get ensemble predictions
145
+ logistic_proba = logistic_model.predict_proba(X_combined)
146
+ # (apply GB models and adaptive weighting - see app.py)
147
+
148
+ # Apply optimal thresholds
149
+ predictions = (ensemble_proba >= optimal_thresholds).astype(int)
150
+ predicted_labels = mlb.inverse_transform(predictions)
151
+
152
+ print(f"Predicted Topics: {predicted_labels}")
153
+ ```
154
+
155
+ ## Evaluation Results
156
+
157
+ ### Test Set Performance
158
+
159
+ | Metric | Score |
160
+ |--------|-------|
161
+ | **Micro F1-Score** | 0.82 |
162
+ | **Macro F1-Score** | 0.74 |
163
+ | **Hamming Loss** | 0.08 |
164
+ | **Subset Accuracy** | 0.45 |
165
+ | **Average Precision** | 0.79 |
166
+
167
+ ### Per-Label Performance (Top Categories)
168
+
169
+ | Label | Precision | Recall | F1-Score | Support |
170
+ |-------|-----------|--------|----------|---------|
171
+ | Orçamento e Finanças | 0.88 | 0.85 | 0.86 | 145 |
172
+ | Obras Públicas | 0.84 | 0.81 | 0.82 | 132 |
173
+ | Recursos Humanos | 0.79 | 0.76 | 0.77 | 98 |
174
+ | Educação | 0.82 | 0.78 | 0.80 | 87 |
175
+ | Ambiente | 0.75 | 0.72 | 0.73 | 76 |
176
+
177
+ ### Ensemble Performance vs. Individual Models
178
+
179
+ | Model | Micro F1 | Macro F1 |
180
+ |-------|----------|----------|
181
+ | LogisticRegression | 0.76 | 0.68 |
182
+ | GradientBoosting #1 | 0.78 | 0.70 |
183
+ | GradientBoosting #2 | 0.79 | 0.71 |
184
+ | GradientBoosting #3 | 0.80 | 0.72 |
185
+ | **Adaptive Ensemble** | **0.82** | **0.74** |
186
+
187
+ ## Dataset
188
+
189
+ The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
190
+
191
+ - **Documents**: 2,500+ meeting minutes
192
+ - **Time Period**: 2018-2024
193
+ - **Source**: Portuguese municipalities (anonymized)
194
+ - **Labels**: 25 topic categories
195
+ - **Annotation**: Multi-label (avg. 2.3 labels per document)
196
+ - **Split**: 60% train / 20% validation / 20% test
197
+
198
+ ### Label Distribution
199
+
200
+ Common topics include:
201
+ - Orçamento e Finanças (Budget & Finance)
202
+ - Obras Públicas (Public Works)
203
+ - Recursos Humanos (Human Resources)
204
+ - Educação (Education)
205
+ - Ambiente (Environment)
206
+ - Saúde (Health)
207
+ - Transportes (Transportation)
208
+ - Urbanismo (Urban Planning)
209
+
210
+ ## Training Details
211
+
212
+ ### Preprocessing
213
+ - Portuguese stopword removal
214
+ - Municipal entity recognition
215
+ - Legal term preservation
216
+ - N-gram extraction (1-3)
217
+
218
+ ### Feature Engineering
219
+ - TF-IDF: 10,000 features with sublinear scaling
220
+ - BERTimbau: Mean-pooled embeddings (768 dims)
221
+ - Feature concatenation: 10,768 total dimensions
222
+
223
+ ### Model Training
224
+ - **Strategy**: One-vs-Rest multi-label classification
225
+ - **Class Balancing**: Inverse frequency weighting
226
+ - **Validation**: Stratified 5-fold cross-validation
227
+ - **Threshold Optimization**: Per-label F1-maximization
228
+ - **Active Learning**: Adaptive ensemble weights
229
+
230
+ ### Hyperparameters
231
+
232
+ **LogisticRegression:**
233
+ ```python
234
+ {
235
+ 'penalty': 'l2',
236
+ 'C': 1.0,
237
+ 'max_iter': 1000,
238
+ 'class_weight': 'balanced'
239
+ }
240
+ ```
241
+
242
+ **GradientBoosting Models:**
243
+ ```python
244
+ # Model #1
245
+ {'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1}
246
+
247
+ # Model #2
248
+ {'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.05}
249
+
250
+ # Model #3
251
+ {'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.1}
252
+ ```
253
+
254
+ ## Limitations
255
+
256
+ - **Language Specificity**: Optimized for Portuguese; other languages not supported
257
+ - **Domain Focus**: Best performance on municipal/administrative texts
258
+ - **Label Set**: Fixed to 25 predefined categories (not extensible without retraining)
259
+ - **Context Length**: BERTimbau limited to 512 tokens (long documents are truncated)
260
+ - **Rare Topics**: Lower performance on infrequent labels (<20 training examples)
261
+ - **Ambiguous Cases**: May over-predict for texts with multiple overlapping themes
262
+
263
+ ## Model Card Contact
264
+
265
+ For questions, feedback, or collaboration:
266
+ - 📧 Email: [your-email@example.com]
267
+ - 🐛 Issues: [GitHub Issues](#)
268
+ - 💬 Discussions: [Hugging Face Discussions](#)
269
+
270
+ ## Citation
271
+
272
+ If you use this model in your research, please cite:
273
+
274
+ ```bibtex
275
+ @misc{municipal-topics-classifier,
276
+ author = {Your Name},
277
+ title = {Municipal Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts},
278
+ year = {2024},
279
+ publisher = {Hugging Face},
280
+ howpublished = {\url{https://huggingface.co/YOUR_USERNAME/municipal-topics-classifier}}
281
+ }
282
+ ```
283
+
284
+ ## License
285
+
286
+ This model is released under the **Attribution-NonCommercial-NoDerivatives 4.0 International** (CC BY-NC-ND 4.0).
287
+
288
+ - ✅ **Allowed**: Non-commercial use, redistribution with attribution
289
+ - ❌ **Not Allowed**: Commercial use, modifications, derivative works
290
+
291
+ ## Acknowledgments
292
+
293
+ - **BERTimbau**: neuralmind/bert-base-portuguese-cased
294
+ - **Framework**: Hugging Face Transformers, Scikit-learn
295
+ - **Dataset**: Portuguese municipalities (anonymized)
296
+
297
+ ---