File size: 9,656 Bytes
f7d7d8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
tags:
- swahili
- classification
- multilabel
- roberta
- transformers
- onnx
- africa
- nlp
license: apache-2.0
language:
- sw
- swa
datasets:
- custom
metrics:
- f1_score
- precision
- recall
- hamming_loss
pipeline_tag: text-classification
task_categories:
- text-classification
task_ids:
- multi-label-classification
base_model:
- benjamin/roberta-base-wechsel-swahili
library_name: transformers
---

# Swahili Topic Classifier - Multi-label Classification

## Model Details

### Model Description
A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.

- **Developed by**: NeboTech
- **Model type**: Transformer-based (RoBERTa)
- **Language(s)**: Swahili (Kiswahili)
- **License**: Apache 2.0
- **Finetuned from**: [RoBERTa-base Wechsel Swahili](https://huggingface.co/roberta-base-wechsel-swahili)
- **Model version**: v2.0 (Multi-label Classification)

### Model Architecture
- **Base Model**: RoBERTa-base Wechsel Swahili
- **Task**: Multi-label Sequence Classification
- **Problem Type**: `multi_label_classification`
- **Number of Labels**: 8
- **Activation Function**: Sigmoid (for multi-label)
- **Loss Function**: BCEWithLogitsLoss
- **Output Format**: Binary vectors [batch_size, num_labels]

### Model Variants
- **v2.0** (Current): Multi-label classification - Returns multiple topics with confidence scores
- **v1.0** (Legacy): Single-label classification - Returns single topic (available at `revision="v1.0-single-label"`)

## Intended Use

### Primary Use Cases
- **Content Classification**: Categorize Swahili text messages, reports, or documents
- **Case Management**: Automatically tag and route cases to appropriate departments
- **Content Moderation**: Identify topics requiring attention (e.g., health emergencies, violence)
- **Data Analytics**: Analyze trends and patterns in Swahili text data
- **Information Routing**: Direct messages to relevant stakeholders based on topics

### Out-of-Scope Uses
- **Not suitable for**: Languages other than Swahili
- **Not suitable for**: Very short text (< 5 words) or very long text (> 512 tokens)
- **Not suitable for**: Real-time critical decision making without human oversight
- **Not suitable for**: Medical diagnosis or legal advice

## Training Details

### Training Data
- **Dataset**: Custom Swahili text dataset
- **Language**: Swahili (Kiswahili)
- **Data Collection**: U-Report platform messages and related Swahili text
- **Preprocessing**: Text cleaning, normalization, and tokenization
- **Data Balance**: Dataset balanced across 8 topics

### Training Procedure
- **Training Type**: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Batch Size**: Variable (with gradient accumulation)
- **Epochs**: 3
- **Gradient Accumulation**: 4 steps
- **Weight Decay**: 0.01
- **Mixed Precision**: Enabled (FP16)
- **Early Stopping**: Enabled (patience=2)

### Training Hyperparametersl
learning_rate: 2e-5
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
num_train_epochs: 3
weight_decay: 0.01
warmup_steps: 0
max_grad_norm: 1.0
fp16: true## Evaluation

### Testing Data, Factors & Metrics
- **Evaluation Dataset**: Held-out test set from balanced dataset
- **Evaluation Metrics**:
  - **F1 Score (Micro)**: Aggregated across all labels
  - **F1 Score (Macro)**: Average per-label F1
  - **F1 Score (Samples)**: Average per-sample F1
  - **Precision (Micro/Macro)**: Classification precision
  - **Recall (Micro/Macro)**: Classification recall
  - **Hamming Loss**: Fraction of incorrectly predicted labels
  - **Subset Accuracy**: Exact match accuracy

### Results
| Metric | Score |
|--------|-------|
| F1 Score (Micro) | 0.96 |
| F1 Score (Macro) |0.96 |
| F1 Score (Samples) |0.96 |
| Precision (Micro) | 0.96 |
| Recall (Micro) | 0.96 |
| Hamming Loss | 0.009054 |
| Subset Accuracy | 0.962 |

## Model Performance Characteristics

### Strengths
- **Multi-label Capability**: Can identify multiple topics in a single text
- **Confidence Scores**: Provides probability scores for each topic
- **Swahili Language Support**: Specifically fine-tuned for Swahili text
- **Efficient Inference**: ONNX format available for fast CPU inference
- **Balanced Performance**: Trained on balanced dataset across all topics

### Limitations
- **Language Specific**: Only works with Swahili text
- **Topic Coverage**: Limited to 8 predefined topics
- **Context Dependency**: Performance may vary with text length and context
- **Dialect Variations**: May not handle all Swahili dialects equally well
- **Threshold Sensitivity**: Requires careful threshold tuning for optimal performance

### Known Biases
- **Training Data Bias**: Model reflects biases present in training data
- **Geographic Bias**: May perform better on texts from regions in training data
- **Topic Imbalance**: Some topics may have better representation in training data
- **Cultural Context**: May not capture all cultural nuances in Swahili communication

## How to Get Started with the Model

### Using Transformers (PyTorch)

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "NeboTech/swahili-text-classifier",
    problem_type="multi_label_classification"  # CRITICAL for multi-label
)
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

# Prepare input
text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Get predictions
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Shape: [1, 8]

# Apply sigmoid for multi-label
probs = torch.sigmoid(logits)

# Apply threshold
threshold = 0.5
predictions = (probs > threshold).float()

# Get applicable topics
applicable_topics = torch.where(predictions[0] == 1)[0].tolist()
print(f"Applicable topics: {applicable_topics}")
print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

# Load ONNX model
session = ort.InferenceSession("swahili_classifier.onnx")

# Prepare input
text = "Nataka kujua dalili za COVID-19"
inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)

# Run inference
outputs = session.run(
    None,
    {
        "input_ids": inputs["input_ids"].astype(np.int64),
        "attention_mask": inputs["attention_mask"].astype(np.int64)
    }
)

logits = outputs[0]  # Shape: [1, 8]

# Apply sigmoid
probs = 1 / (1 + np.exp(-logits))

# Apply threshold
threshold = 0.5
predictions = (probs > threshold).astype(float)

# Get topics
applicable_topics = np.where(predictions[0] == 1)[0]
print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)

| ID | Topic | Description |
|----|-------|-------------|
| 0 | COVID | COVID-19 related topics, symptoms, prevention |
| 1 | EDUCATION | Educational content, school-related topics |
| 2 | HEALTH | General health topics, medical information |
| 3 | HIV/AIDS | HIV/AIDS related information and support |
| 4 | MENSTRUAL HYGIENE | Menstrual health and hygiene topics |
| 5 | NUTRITION | Nutrition, food, and dietary information |
| 6 | U-REPORT | U-Report platform related content |
| 7 | VIOLENCE AGAINST CHILDREN | Child protection and violence prevention |

## Ethical Considerations

### Ethical Use
- **Human Oversight**: Always include human review for critical decisions
- **Privacy**: Respect user privacy when processing text data
- **Transparency**: Inform users when automated classification is used
- **Fairness**: Monitor for biased outcomes across different user groups

### Potential Risks
- **Misclassification**: Incorrect topic assignment could misroute important messages
- **False Positives/Negatives**: May miss urgent cases or flag non-urgent content
- **Privacy Concerns**: Processing sensitive health and personal information
- **Cultural Sensitivity**: May not fully capture cultural context and nuances

### Recommendations
- **Regular Monitoring**: Continuously monitor model performance in production
- **Human Review**: Implement human review for high-stakes classifications
- **Feedback Loop**: Collect and incorporate user feedback for improvements
- **Bias Auditing**: Regularly audit for biases and fairness issues
- **Threshold Tuning**: Adjust thresholds based on use case requirements

## Citation

@misc{swahili-topic-classifier-multilabel,
  title={Swahili Topic Classifier - Multi-label Classification},
  author={NeboTech},
  year={2024},
  publisher={Hugging Face},
  howpublished={\\url{https://huggingface.co/NeboTech/swahili-text-classifier}},
  note={Version 2.0 - Multi-label Classification}
}## Additional Information

### Model Files
- `config.json`: Model configuration
- `pytorch_model.bin` or `model.safetensors`: Model weights
- `tokenizer.json`: Tokenizer model
- `tokenizer_config.json`: Tokenizer configuration
- `vocab.json`, `merges.txt`: Vocabulary files
- `swahili_classifier.onnx`: ONNX model (separate repository)

### Version History
- **v2.0** (Current): Multi-label classification with sigmoid activation
- **v1.0** (Legacy): Single-label classification with softmax activation

### Contact
For questions, issues, or contributions