Swahili Topic Classifier - Multi-label Classification
Model Details
Model Description
A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.
- Developed by: NeboTech
- Model type: Transformer-based (RoBERTa)
- Language(s): Swahili (Kiswahili)
- License: Apache 2.0
- Finetuned from: RoBERTa-base Wechsel Swahili
- Model version: v2.0 (Multi-label Classification)
Model Architecture
- Base Model: RoBERTa-base Wechsel Swahili
- Task: Multi-label Sequence Classification
- Problem Type:
multi_label_classification - Number of Labels: 8
- Activation Function: Sigmoid (for multi-label)
- Loss Function: BCEWithLogitsLoss
- Output Format: Binary vectors [batch_size, num_labels]
Model Variants
- v2.0 (Current): Multi-label classification - Returns multiple topics with confidence scores
- v1.0 (Legacy): Single-label classification - Returns single topic (available at
revision="v1.0-single-label")
Intended Use
Primary Use Cases
- Content Classification: Categorize Swahili text messages, reports, or documents
- Case Management: Automatically tag and route cases to appropriate departments
- Content Moderation: Identify topics requiring attention (e.g., health emergencies, violence)
- Data Analytics: Analyze trends and patterns in Swahili text data
- Information Routing: Direct messages to relevant stakeholders based on topics
Out-of-Scope Uses
- Not suitable for: Languages other than Swahili
- Not suitable for: Very short text (< 5 words) or very long text (> 512 tokens)
- Not suitable for: Real-time critical decision making without human oversight
- Not suitable for: Medical diagnosis or legal advice
Training Details
Training Data
- Dataset: Custom Swahili text dataset
- Language: Swahili (Kiswahili)
- Data Collection: U-Report platform messages and related Swahili text
- Preprocessing: Text cleaning, normalization, and tokenization
- Data Balance: Dataset balanced across 8 topics
Training Procedure
- Training Type: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: Variable (with gradient accumulation)
- Epochs: 3
- Gradient Accumulation: 4 steps
- Weight Decay: 0.01
- Mixed Precision: Enabled (FP16)
- Early Stopping: Enabled (patience=2)
Training Hyperparametersl
learning_rate: 2e-5 per_device_train_batch_size: 4 gradient_accumulation_steps: 4 num_train_epochs: 3 weight_decay: 0.01 warmup_steps: 0 max_grad_norm: 1.0 fp16: true## Evaluation
Testing Data, Factors & Metrics
- Evaluation Dataset: Held-out test set from balanced dataset
- Evaluation Metrics:
- F1 Score (Micro): Aggregated across all labels
- F1 Score (Macro): Average per-label F1
- F1 Score (Samples): Average per-sample F1
- Precision (Micro/Macro): Classification precision
- Recall (Micro/Macro): Classification recall
- Hamming Loss: Fraction of incorrectly predicted labels
- Subset Accuracy: Exact match accuracy
Results
| Metric | Score |
|---|---|
| F1 Score (Micro) | 0.96 |
| F1 Score (Macro) | 0.96 |
| F1 Score (Samples) | 0.96 |
| Precision (Micro) | 0.96 |
| Recall (Micro) | 0.96 |
| Hamming Loss | 0.009054 |
| Subset Accuracy | 0.962 |
Model Performance Characteristics
Strengths
- Multi-label Capability: Can identify multiple topics in a single text
- Confidence Scores: Provides probability scores for each topic
- Swahili Language Support: Specifically fine-tuned for Swahili text
- Efficient Inference: ONNX format available for fast CPU inference
- Balanced Performance: Trained on balanced dataset across all topics
Limitations
- Language Specific: Only works with Swahili text
- Topic Coverage: Limited to 8 predefined topics
- Context Dependency: Performance may vary with text length and context
- Dialect Variations: May not handle all Swahili dialects equally well
- Threshold Sensitivity: Requires careful threshold tuning for optimal performance
Known Biases
- Training Data Bias: Model reflects biases present in training data
- Geographic Bias: May perform better on texts from regions in training data
- Topic Imbalance: Some topics may have better representation in training data
- Cultural Context: May not capture all cultural nuances in Swahili communication
How to Get Started with the Model
Using Transformers (PyTorch)
from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch
Load model
model = AutoModelForSequenceClassification.from_pretrained( "NeboTech/swahili-text-classifier", problem_type="multi_label_classification" # CRITICAL for multi-label ) tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
Prepare input
text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
Get predictions
model.eval() with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # Shape: [1, 8]
Apply sigmoid for multi-label
probs = torch.sigmoid(logits)
Apply threshold
threshold = 0.5 predictions = (probs > threshold).float()
Get applicable topics
applicable_topics = torch.where(predictions[0] == 1)[0].tolist() print(f"Applicable topics: {applicable_topics}") print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime
import onnxruntime as ort import numpy as np from transformers import AutoTokenizer
Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
Load ONNX model
session = ort.InferenceSession("swahili_classifier.onnx")
Prepare input
text = "Nataka kujua dalili za COVID-19" inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)
Run inference
outputs = session.run( None, { "input_ids": inputs["input_ids"].astype(np.int64), "attention_mask": inputs["attention_mask"].astype(np.int64) } )
logits = outputs[0] # Shape: [1, 8]
Apply sigmoid
probs = 1 / (1 + np.exp(-logits))
Apply threshold
threshold = 0.5 predictions = (probs > threshold).astype(float)
Get topics
applicable_topics = np.where(predictions[0] == 1)[0] print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)
| ID | Topic | Description |
|---|---|---|
| 0 | COVID | COVID-19 related topics, symptoms, prevention |
| 1 | EDUCATION | Educational content, school-related topics |
| 2 | HEALTH | General health topics, medical information |
| 3 | HIV/AIDS | HIV/AIDS related information and support |
| 4 | MENSTRUAL HYGIENE | Menstrual health and hygiene topics |
| 5 | NUTRITION | Nutrition, food, and dietary information |
| 6 | U-REPORT | U-Report platform related content |
| 7 | VIOLENCE AGAINST CHILDREN | Child protection and violence prevention |
Ethical Considerations
Ethical Use
- Human Oversight: Always include human review for critical decisions
- Privacy: Respect user privacy when processing text data
- Transparency: Inform users when automated classification is used
- Fairness: Monitor for biased outcomes across different user groups
Potential Risks
- Misclassification: Incorrect topic assignment could misroute important messages
- False Positives/Negatives: May miss urgent cases or flag non-urgent content
- Privacy Concerns: Processing sensitive health and personal information
- Cultural Sensitivity: May not fully capture cultural context and nuances
Recommendations
- Regular Monitoring: Continuously monitor model performance in production
- Human Review: Implement human review for high-stakes classifications
- Feedback Loop: Collect and incorporate user feedback for improvements
- Bias Auditing: Regularly audit for biases and fairness issues
- Threshold Tuning: Adjust thresholds based on use case requirements
Citation
@misc{swahili-topic-classifier-multilabel, title={Swahili Topic Classifier - Multi-label Classification}, author={NeboTech}, year={2024}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/NeboTech/swahili-text-classifier}}, note={Version 2.0 - Multi-label Classification} }## Additional Information
Model Files
config.json: Model configurationpytorch_model.binormodel.safetensors: Model weightstokenizer.json: Tokenizer modeltokenizer_config.json: Tokenizer configurationvocab.json,merges.txt: Vocabulary filesswahili_classifier.onnx: ONNX model (separate repository)
Version History
- v2.0 (Current): Multi-label classification with sigmoid activation
- v1.0 (Legacy): Single-label classification with softmax activation
Contact
For questions, issues, or contributions
- Downloads last month
- 8
Model tree for NeboTech/swahili-text-classifier
Base model
benjamin/roberta-base-wechsel-swahili