NeboTech commited on
Commit
f7d7d8a
·
verified ·
1 Parent(s): ca636de

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +276 -0
README.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - swahili
4
+ - classification
5
+ - multilabel
6
+ - roberta
7
+ - transformers
8
+ - onnx
9
+ - africa
10
+ - nlp
11
+ license: apache-2.0
12
+ language:
13
+ - sw
14
+ - swa
15
+ datasets:
16
+ - custom
17
+ metrics:
18
+ - f1_score
19
+ - precision
20
+ - recall
21
+ - hamming_loss
22
+ pipeline_tag: text-classification
23
+ task_categories:
24
+ - text-classification
25
+ task_ids:
26
+ - multi-label-classification
27
+ base_model:
28
+ - benjamin/roberta-base-wechsel-swahili
29
+ library_name: transformers
30
+ ---
31
+
32
+ # Swahili Topic Classifier - Multi-label Classification
33
+
34
+ ## Model Details
35
+
36
+ ### Model Description
37
+ A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.
38
+
39
+ - **Developed by**: NeboTech
40
+ - **Model type**: Transformer-based (RoBERTa)
41
+ - **Language(s)**: Swahili (Kiswahili)
42
+ - **License**: Apache 2.0
43
+ - **Finetuned from**: [RoBERTa-base Wechsel Swahili](https://huggingface.co/roberta-base-wechsel-swahili)
44
+ - **Model version**: v2.0 (Multi-label Classification)
45
+
46
+ ### Model Architecture
47
+ - **Base Model**: RoBERTa-base Wechsel Swahili
48
+ - **Task**: Multi-label Sequence Classification
49
+ - **Problem Type**: `multi_label_classification`
50
+ - **Number of Labels**: 8
51
+ - **Activation Function**: Sigmoid (for multi-label)
52
+ - **Loss Function**: BCEWithLogitsLoss
53
+ - **Output Format**: Binary vectors [batch_size, num_labels]
54
+
55
+ ### Model Variants
56
+ - **v2.0** (Current): Multi-label classification - Returns multiple topics with confidence scores
57
+ - **v1.0** (Legacy): Single-label classification - Returns single topic (available at `revision="v1.0-single-label"`)
58
+
59
+ ## Intended Use
60
+
61
+ ### Primary Use Cases
62
+ - **Content Classification**: Categorize Swahili text messages, reports, or documents
63
+ - **Case Management**: Automatically tag and route cases to appropriate departments
64
+ - **Content Moderation**: Identify topics requiring attention (e.g., health emergencies, violence)
65
+ - **Data Analytics**: Analyze trends and patterns in Swahili text data
66
+ - **Information Routing**: Direct messages to relevant stakeholders based on topics
67
+
68
+ ### Out-of-Scope Uses
69
+ - **Not suitable for**: Languages other than Swahili
70
+ - **Not suitable for**: Very short text (< 5 words) or very long text (> 512 tokens)
71
+ - **Not suitable for**: Real-time critical decision making without human oversight
72
+ - **Not suitable for**: Medical diagnosis or legal advice
73
+
74
+ ## Training Details
75
+
76
+ ### Training Data
77
+ - **Dataset**: Custom Swahili text dataset
78
+ - **Language**: Swahili (Kiswahili)
79
+ - **Data Collection**: U-Report platform messages and related Swahili text
80
+ - **Preprocessing**: Text cleaning, normalization, and tokenization
81
+ - **Data Balance**: Dataset balanced across 8 topics
82
+
83
+ ### Training Procedure
84
+ - **Training Type**: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
85
+ - **Optimizer**: AdamW
86
+ - **Learning Rate**: 2e-5
87
+ - **Batch Size**: Variable (with gradient accumulation)
88
+ - **Epochs**: 3
89
+ - **Gradient Accumulation**: 4 steps
90
+ - **Weight Decay**: 0.01
91
+ - **Mixed Precision**: Enabled (FP16)
92
+ - **Early Stopping**: Enabled (patience=2)
93
+
94
+ ### Training Hyperparametersl
95
+ learning_rate: 2e-5
96
+ per_device_train_batch_size: 4
97
+ gradient_accumulation_steps: 4
98
+ num_train_epochs: 3
99
+ weight_decay: 0.01
100
+ warmup_steps: 0
101
+ max_grad_norm: 1.0
102
+ fp16: true## Evaluation
103
+
104
+ ### Testing Data, Factors & Metrics
105
+ - **Evaluation Dataset**: Held-out test set from balanced dataset
106
+ - **Evaluation Metrics**:
107
+ - **F1 Score (Micro)**: Aggregated across all labels
108
+ - **F1 Score (Macro)**: Average per-label F1
109
+ - **F1 Score (Samples)**: Average per-sample F1
110
+ - **Precision (Micro/Macro)**: Classification precision
111
+ - **Recall (Micro/Macro)**: Classification recall
112
+ - **Hamming Loss**: Fraction of incorrectly predicted labels
113
+ - **Subset Accuracy**: Exact match accuracy
114
+
115
+ ### Results
116
+ | Metric | Score |
117
+ |--------|-------|
118
+ | F1 Score (Micro) | 0.96 |
119
+ | F1 Score (Macro) |0.96 |
120
+ | F1 Score (Samples) |0.96 |
121
+ | Precision (Micro) | 0.96 |
122
+ | Recall (Micro) | 0.96 |
123
+ | Hamming Loss | 0.009054 |
124
+ | Subset Accuracy | 0.962 |
125
+
126
+ ## Model Performance Characteristics
127
+
128
+ ### Strengths
129
+ - **Multi-label Capability**: Can identify multiple topics in a single text
130
+ - **Confidence Scores**: Provides probability scores for each topic
131
+ - **Swahili Language Support**: Specifically fine-tuned for Swahili text
132
+ - **Efficient Inference**: ONNX format available for fast CPU inference
133
+ - **Balanced Performance**: Trained on balanced dataset across all topics
134
+
135
+ ### Limitations
136
+ - **Language Specific**: Only works with Swahili text
137
+ - **Topic Coverage**: Limited to 8 predefined topics
138
+ - **Context Dependency**: Performance may vary with text length and context
139
+ - **Dialect Variations**: May not handle all Swahili dialects equally well
140
+ - **Threshold Sensitivity**: Requires careful threshold tuning for optimal performance
141
+
142
+ ### Known Biases
143
+ - **Training Data Bias**: Model reflects biases present in training data
144
+ - **Geographic Bias**: May perform better on texts from regions in training data
145
+ - **Topic Imbalance**: Some topics may have better representation in training data
146
+ - **Cultural Context**: May not capture all cultural nuances in Swahili communication
147
+
148
+ ## How to Get Started with the Model
149
+
150
+ ### Using Transformers (PyTorch)
151
+
152
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
153
+ import torch
154
+
155
+ # Load model
156
+ model = AutoModelForSequenceClassification.from_pretrained(
157
+ "NeboTech/swahili-text-classifier",
158
+ problem_type="multi_label_classification" # CRITICAL for multi-label
159
+ )
160
+ tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
161
+
162
+ # Prepare input
163
+ text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda"
164
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
165
+
166
+ # Get predictions
167
+ model.eval()
168
+ with torch.no_grad():
169
+ outputs = model(**inputs)
170
+ logits = outputs.logits # Shape: [1, 8]
171
+
172
+ # Apply sigmoid for multi-label
173
+ probs = torch.sigmoid(logits)
174
+
175
+ # Apply threshold
176
+ threshold = 0.5
177
+ predictions = (probs > threshold).float()
178
+
179
+ # Get applicable topics
180
+ applicable_topics = torch.where(predictions[0] == 1)[0].tolist()
181
+ print(f"Applicable topics: {applicable_topics}")
182
+ print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime
183
+
184
+ import onnxruntime as ort
185
+ import numpy as np
186
+ from transformers import AutoTokenizer
187
+
188
+ # Load tokenizer
189
+ tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
190
+
191
+ # Load ONNX model
192
+ session = ort.InferenceSession("swahili_classifier.onnx")
193
+
194
+ # Prepare input
195
+ text = "Nataka kujua dalili za COVID-19"
196
+ inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)
197
+
198
+ # Run inference
199
+ outputs = session.run(
200
+ None,
201
+ {
202
+ "input_ids": inputs["input_ids"].astype(np.int64),
203
+ "attention_mask": inputs["attention_mask"].astype(np.int64)
204
+ }
205
+ )
206
+
207
+ logits = outputs[0] # Shape: [1, 8]
208
+
209
+ # Apply sigmoid
210
+ probs = 1 / (1 + np.exp(-logits))
211
+
212
+ # Apply threshold
213
+ threshold = 0.5
214
+ predictions = (probs > threshold).astype(float)
215
+
216
+ # Get topics
217
+ applicable_topics = np.where(predictions[0] == 1)[0]
218
+ print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)
219
+
220
+ | ID | Topic | Description |
221
+ |----|-------|-------------|
222
+ | 0 | COVID | COVID-19 related topics, symptoms, prevention |
223
+ | 1 | EDUCATION | Educational content, school-related topics |
224
+ | 2 | HEALTH | General health topics, medical information |
225
+ | 3 | HIV/AIDS | HIV/AIDS related information and support |
226
+ | 4 | MENSTRUAL HYGIENE | Menstrual health and hygiene topics |
227
+ | 5 | NUTRITION | Nutrition, food, and dietary information |
228
+ | 6 | U-REPORT | U-Report platform related content |
229
+ | 7 | VIOLENCE AGAINST CHILDREN | Child protection and violence prevention |
230
+
231
+ ## Ethical Considerations
232
+
233
+ ### Ethical Use
234
+ - **Human Oversight**: Always include human review for critical decisions
235
+ - **Privacy**: Respect user privacy when processing text data
236
+ - **Transparency**: Inform users when automated classification is used
237
+ - **Fairness**: Monitor for biased outcomes across different user groups
238
+
239
+ ### Potential Risks
240
+ - **Misclassification**: Incorrect topic assignment could misroute important messages
241
+ - **False Positives/Negatives**: May miss urgent cases or flag non-urgent content
242
+ - **Privacy Concerns**: Processing sensitive health and personal information
243
+ - **Cultural Sensitivity**: May not fully capture cultural context and nuances
244
+
245
+ ### Recommendations
246
+ - **Regular Monitoring**: Continuously monitor model performance in production
247
+ - **Human Review**: Implement human review for high-stakes classifications
248
+ - **Feedback Loop**: Collect and incorporate user feedback for improvements
249
+ - **Bias Auditing**: Regularly audit for biases and fairness issues
250
+ - **Threshold Tuning**: Adjust thresholds based on use case requirements
251
+
252
+ ## Citation
253
+
254
+ @misc{swahili-topic-classifier-multilabel,
255
+ title={Swahili Topic Classifier - Multi-label Classification},
256
+ author={NeboTech},
257
+ year={2024},
258
+ publisher={Hugging Face},
259
+ howpublished={\\url{https://huggingface.co/NeboTech/swahili-text-classifier}},
260
+ note={Version 2.0 - Multi-label Classification}
261
+ }## Additional Information
262
+
263
+ ### Model Files
264
+ - `config.json`: Model configuration
265
+ - `pytorch_model.bin` or `model.safetensors`: Model weights
266
+ - `tokenizer.json`: Tokenizer model
267
+ - `tokenizer_config.json`: Tokenizer configuration
268
+ - `vocab.json`, `merges.txt`: Vocabulary files
269
+ - `swahili_classifier.onnx`: ONNX model (separate repository)
270
+
271
+ ### Version History
272
+ - **v2.0** (Current): Multi-label classification with sigmoid activation
273
+ - **v1.0** (Legacy): Single-label classification with softmax activation
274
+
275
+ ### Contact
276
+ For questions, issues, or contributions