File size: 9,656 Bytes
f7d7d8a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
---
tags:
- swahili
- classification
- multilabel
- roberta
- transformers
- onnx
- africa
- nlp
license: apache-2.0
language:
- sw
- swa
datasets:
- custom
metrics:
- f1_score
- precision
- recall
- hamming_loss
pipeline_tag: text-classification
task_categories:
- text-classification
task_ids:
- multi-label-classification
base_model:
- benjamin/roberta-base-wechsel-swahili
library_name: transformers
---
# Swahili Topic Classifier - Multi-label Classification
## Model Details
### Model Description
A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.
- **Developed by**: NeboTech
- **Model type**: Transformer-based (RoBERTa)
- **Language(s)**: Swahili (Kiswahili)
- **License**: Apache 2.0
- **Finetuned from**: [RoBERTa-base Wechsel Swahili](https://huggingface.co/roberta-base-wechsel-swahili)
- **Model version**: v2.0 (Multi-label Classification)
### Model Architecture
- **Base Model**: RoBERTa-base Wechsel Swahili
- **Task**: Multi-label Sequence Classification
- **Problem Type**: `multi_label_classification`
- **Number of Labels**: 8
- **Activation Function**: Sigmoid (for multi-label)
- **Loss Function**: BCEWithLogitsLoss
- **Output Format**: Binary vectors [batch_size, num_labels]
### Model Variants
- **v2.0** (Current): Multi-label classification - Returns multiple topics with confidence scores
- **v1.0** (Legacy): Single-label classification - Returns single topic (available at `revision="v1.0-single-label"`)
## Intended Use
### Primary Use Cases
- **Content Classification**: Categorize Swahili text messages, reports, or documents
- **Case Management**: Automatically tag and route cases to appropriate departments
- **Content Moderation**: Identify topics requiring attention (e.g., health emergencies, violence)
- **Data Analytics**: Analyze trends and patterns in Swahili text data
- **Information Routing**: Direct messages to relevant stakeholders based on topics
### Out-of-Scope Uses
- **Not suitable for**: Languages other than Swahili
- **Not suitable for**: Very short text (< 5 words) or very long text (> 512 tokens)
- **Not suitable for**: Real-time critical decision making without human oversight
- **Not suitable for**: Medical diagnosis or legal advice
## Training Details
### Training Data
- **Dataset**: Custom Swahili text dataset
- **Language**: Swahili (Kiswahili)
- **Data Collection**: U-Report platform messages and related Swahili text
- **Preprocessing**: Text cleaning, normalization, and tokenization
- **Data Balance**: Dataset balanced across 8 topics
### Training Procedure
- **Training Type**: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Batch Size**: Variable (with gradient accumulation)
- **Epochs**: 3
- **Gradient Accumulation**: 4 steps
- **Weight Decay**: 0.01
- **Mixed Precision**: Enabled (FP16)
- **Early Stopping**: Enabled (patience=2)
### Training Hyperparametersl
learning_rate: 2e-5
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
num_train_epochs: 3
weight_decay: 0.01
warmup_steps: 0
max_grad_norm: 1.0
fp16: true## Evaluation
### Testing Data, Factors & Metrics
- **Evaluation Dataset**: Held-out test set from balanced dataset
- **Evaluation Metrics**:
- **F1 Score (Micro)**: Aggregated across all labels
- **F1 Score (Macro)**: Average per-label F1
- **F1 Score (Samples)**: Average per-sample F1
- **Precision (Micro/Macro)**: Classification precision
- **Recall (Micro/Macro)**: Classification recall
- **Hamming Loss**: Fraction of incorrectly predicted labels
- **Subset Accuracy**: Exact match accuracy
### Results
| Metric | Score |
|--------|-------|
| F1 Score (Micro) | 0.96 |
| F1 Score (Macro) |0.96 |
| F1 Score (Samples) |0.96 |
| Precision (Micro) | 0.96 |
| Recall (Micro) | 0.96 |
| Hamming Loss | 0.009054 |
| Subset Accuracy | 0.962 |
## Model Performance Characteristics
### Strengths
- **Multi-label Capability**: Can identify multiple topics in a single text
- **Confidence Scores**: Provides probability scores for each topic
- **Swahili Language Support**: Specifically fine-tuned for Swahili text
- **Efficient Inference**: ONNX format available for fast CPU inference
- **Balanced Performance**: Trained on balanced dataset across all topics
### Limitations
- **Language Specific**: Only works with Swahili text
- **Topic Coverage**: Limited to 8 predefined topics
- **Context Dependency**: Performance may vary with text length and context
- **Dialect Variations**: May not handle all Swahili dialects equally well
- **Threshold Sensitivity**: Requires careful threshold tuning for optimal performance
### Known Biases
- **Training Data Bias**: Model reflects biases present in training data
- **Geographic Bias**: May perform better on texts from regions in training data
- **Topic Imbalance**: Some topics may have better representation in training data
- **Cultural Context**: May not capture all cultural nuances in Swahili communication
## How to Get Started with the Model
### Using Transformers (PyTorch)
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
"NeboTech/swahili-text-classifier",
problem_type="multi_label_classification" # CRITICAL for multi-label
)
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
# Prepare input
text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
# Get predictions
model.eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # Shape: [1, 8]
# Apply sigmoid for multi-label
probs = torch.sigmoid(logits)
# Apply threshold
threshold = 0.5
predictions = (probs > threshold).float()
# Get applicable topics
applicable_topics = torch.where(predictions[0] == 1)[0].tolist()
print(f"Applicable topics: {applicable_topics}")
print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
# Load ONNX model
session = ort.InferenceSession("swahili_classifier.onnx")
# Prepare input
text = "Nataka kujua dalili za COVID-19"
inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)
# Run inference
outputs = session.run(
None,
{
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64)
}
)
logits = outputs[0] # Shape: [1, 8]
# Apply sigmoid
probs = 1 / (1 + np.exp(-logits))
# Apply threshold
threshold = 0.5
predictions = (probs > threshold).astype(float)
# Get topics
applicable_topics = np.where(predictions[0] == 1)[0]
print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)
| ID | Topic | Description |
|----|-------|-------------|
| 0 | COVID | COVID-19 related topics, symptoms, prevention |
| 1 | EDUCATION | Educational content, school-related topics |
| 2 | HEALTH | General health topics, medical information |
| 3 | HIV/AIDS | HIV/AIDS related information and support |
| 4 | MENSTRUAL HYGIENE | Menstrual health and hygiene topics |
| 5 | NUTRITION | Nutrition, food, and dietary information |
| 6 | U-REPORT | U-Report platform related content |
| 7 | VIOLENCE AGAINST CHILDREN | Child protection and violence prevention |
## Ethical Considerations
### Ethical Use
- **Human Oversight**: Always include human review for critical decisions
- **Privacy**: Respect user privacy when processing text data
- **Transparency**: Inform users when automated classification is used
- **Fairness**: Monitor for biased outcomes across different user groups
### Potential Risks
- **Misclassification**: Incorrect topic assignment could misroute important messages
- **False Positives/Negatives**: May miss urgent cases or flag non-urgent content
- **Privacy Concerns**: Processing sensitive health and personal information
- **Cultural Sensitivity**: May not fully capture cultural context and nuances
### Recommendations
- **Regular Monitoring**: Continuously monitor model performance in production
- **Human Review**: Implement human review for high-stakes classifications
- **Feedback Loop**: Collect and incorporate user feedback for improvements
- **Bias Auditing**: Regularly audit for biases and fairness issues
- **Threshold Tuning**: Adjust thresholds based on use case requirements
## Citation
@misc{swahili-topic-classifier-multilabel,
title={Swahili Topic Classifier - Multi-label Classification},
author={NeboTech},
year={2024},
publisher={Hugging Face},
howpublished={\\url{https://huggingface.co/NeboTech/swahili-text-classifier}},
note={Version 2.0 - Multi-label Classification}
}## Additional Information
### Model Files
- `config.json`: Model configuration
- `pytorch_model.bin` or `model.safetensors`: Model weights
- `tokenizer.json`: Tokenizer model
- `tokenizer_config.json`: Tokenizer configuration
- `vocab.json`, `merges.txt`: Vocabulary files
- `swahili_classifier.onnx`: ONNX model (separate repository)
### Version History
- **v2.0** (Current): Multi-label classification with sigmoid activation
- **v1.0** (Legacy): Single-label classification with softmax activation
### Contact
For questions, issues, or contributions |