NeboTech
/

swahili-text-classifier

+---
+tags:
+- swahili
+- classification
+- multilabel
+- roberta
+- transformers
+- onnx
+- africa
+- nlp
+license: apache-2.0
+language:
+- sw
+- swa
+datasets:
+- custom
+metrics:
+- f1_score
+- precision
+- recall
+- hamming_loss
+pipeline_tag: text-classification
+task_categories:
+- text-classification
+task_ids:
+- multi-label-classification
+base_model:
+- benjamin/roberta-base-wechsel-swahili
+library_name: transformers
+---
+# Swahili Topic Classifier - Multi-label Classification
+## Model Details
+### Model Description
+A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.
+- **Developed by**: NeboTech
+- **Model type**: Transformer-based (RoBERTa)
+- **Language(s)**: Swahili (Kiswahili)
+- **License**: Apache 2.0
+- **Finetuned from**: [RoBERTa-base Wechsel Swahili](https://huggingface.co/roberta-base-wechsel-swahili)
+- **Model version**: v2.0 (Multi-label Classification)
+### Model Architecture
+- **Base Model**: RoBERTa-base Wechsel Swahili
+- **Task**: Multi-label Sequence Classification
+- **Problem Type**: `multi_label_classification`
+- **Number of Labels**: 8
+- **Activation Function**: Sigmoid (for multi-label)
+- **Loss Function**: BCEWithLogitsLoss
+- **Output Format**: Binary vectors [batch_size, num_labels]
+### Model Variants
+- **v2.0** (Current): Multi-label classification - Returns multiple topics with confidence scores
+- **v1.0** (Legacy): Single-label classification - Returns single topic (available at `revision="v1.0-single-label"`)
+## Intended Use
+### Primary Use Cases
+- **Content Classification**: Categorize Swahili text messages, reports, or documents
+- **Case Management**: Automatically tag and route cases to appropriate departments
+- **Content Moderation**: Identify topics requiring attention (e.g., health emergencies, violence)
+- **Data Analytics**: Analyze trends and patterns in Swahili text data
+- **Information Routing**: Direct messages to relevant stakeholders based on topics
+### Out-of-Scope Uses
+- **Not suitable for**: Languages other than Swahili
+- **Not suitable for**: Very short text (< 5 words) or very long text (> 512 tokens)
+- **Not suitable for**: Real-time critical decision making without human oversight
+- **Not suitable for**: Medical diagnosis or legal advice
+## Training Details
+### Training Data
+- **Dataset**: Custom Swahili text dataset
+- **Language**: Swahili (Kiswahili)
+- **Data Collection**: U-Report platform messages and related Swahili text
+- **Preprocessing**: Text cleaning, normalization, and tokenization
+- **Data Balance**: Dataset balanced across 8 topics
+### Training Procedure
+- **Training Type**: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
+- **Optimizer**: AdamW
+- **Learning Rate**: 2e-5
+- **Batch Size**: Variable (with gradient accumulation)
+- **Epochs**: 3
+- **Gradient Accumulation**: 4 steps
+- **Weight Decay**: 0.01
+- **Mixed Precision**: Enabled (FP16)
+- **Early Stopping**: Enabled (patience=2)
+### Training Hyperparametersl
+learning_rate: 2e-5
+per_device_train_batch_size: 4
+gradient_accumulation_steps: 4
+num_train_epochs: 3
+weight_decay: 0.01
+warmup_steps: 0
+max_grad_norm: 1.0
+fp16: true## Evaluation
+### Testing Data, Factors & Metrics
+- **Evaluation Dataset**: Held-out test set from balanced dataset
+- **Evaluation Metrics**:
+  - **F1 Score (Micro)**: Aggregated across all labels
+  - **F1 Score (Macro)**: Average per-label F1
+  - **F1 Score (Samples)**: Average per-sample F1
+  - **Precision (Micro/Macro)**: Classification precision
+  - **Recall (Micro/Macro)**: Classification recall
+  - **Hamming Loss**: Fraction of incorrectly predicted labels
+  - **Subset Accuracy**: Exact match accuracy
+### Results
+| Metric | Score |
+|--------|-------|
+| F1 Score (Micro) | 0.96 |
+| F1 Score (Macro) |0.96 |
+| F1 Score (Samples) |0.96 |
+| Precision (Micro) | 0.96 |
+| Recall (Micro) | 0.96 |
+| Hamming Loss | 0.009054 |
+| Subset Accuracy | 0.962 |
+## Model Performance Characteristics
+### Strengths
+- **Multi-label Capability**: Can identify multiple topics in a single text
+- **Confidence Scores**: Provides probability scores for each topic
+- **Swahili Language Support**: Specifically fine-tuned for Swahili text
+- **Efficient Inference**: ONNX format available for fast CPU inference
+- **Balanced Performance**: Trained on balanced dataset across all topics
+### Limitations
+- **Language Specific**: Only works with Swahili text
+- **Topic Coverage**: Limited to 8 predefined topics
+- **Context Dependency**: Performance may vary with text length and context
+- **Dialect Variations**: May not handle all Swahili dialects equally well
+- **Threshold Sensitivity**: Requires careful threshold tuning for optimal performance
+### Known Biases
+- **Training Data Bias**: Model reflects biases present in training data
+- **Geographic Bias**: May perform better on texts from regions in training data
+- **Topic Imbalance**: Some topics may have better representation in training data
+- **Cultural Context**: May not capture all cultural nuances in Swahili communication
+## How to Get Started with the Model
+### Using Transformers (PyTorch)
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+# Load model
+model = AutoModelForSequenceClassification.from_pretrained(
+    "NeboTech/swahili-text-classifier",
+    problem_type="multi_label_classification"  # CRITICAL for multi-label
+)
+tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
+# Prepare input
+text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda"
+inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
+# Get predictions
+model.eval()
+with torch.no_grad():
+    outputs = model(**inputs)
+    logits = outputs.logits  # Shape: [1, 8]
+# Apply sigmoid for multi-label
+probs = torch.sigmoid(logits)
+# Apply threshold
+threshold = 0.5
+predictions = (probs > threshold).float()
+# Get applicable topics
+applicable_topics = torch.where(predictions[0] == 1)[0].tolist()
+print(f"Applicable topics: {applicable_topics}")
+print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime
+import onnxruntime as ort
+import numpy as np
+from transformers import AutoTokenizer
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
+# Load ONNX model
+session = ort.InferenceSession("swahili_classifier.onnx")
+# Prepare input
+text = "Nataka kujua dalili za COVID-19"
+inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)
+# Run inference
+outputs = session.run(
+    None,
+    {
+        "input_ids": inputs["input_ids"].astype(np.int64),
+        "attention_mask": inputs["attention_mask"].astype(np.int64)
+    }
+)
+logits = outputs[0]  # Shape: [1, 8]
+# Apply sigmoid
+probs = 1 / (1 + np.exp(-logits))
+# Apply threshold
+threshold = 0.5
+predictions = (probs > threshold).astype(float)
+# Get topics
+applicable_topics = np.where(predictions[0] == 1)[0]
+print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)
+| ID | Topic | Description |
+|----|-------|-------------|
+| 0 | COVID | COVID-19 related topics, symptoms, prevention |
+| 1 | EDUCATION | Educational content, school-related topics |
+| 2 | HEALTH | General health topics, medical information |
+| 3 | HIV/AIDS | HIV/AIDS related information and support |
+| 4 | MENSTRUAL HYGIENE | Menstrual health and hygiene topics |
+| 5 | NUTRITION | Nutrition, food, and dietary information |
+| 6 | U-REPORT | U-Report platform related content |
+| 7 | VIOLENCE AGAINST CHILDREN | Child protection and violence prevention |
+## Ethical Considerations
+### Ethical Use
+- **Human Oversight**: Always include human review for critical decisions
+- **Privacy**: Respect user privacy when processing text data
+- **Transparency**: Inform users when automated classification is used
+- **Fairness**: Monitor for biased outcomes across different user groups
+### Potential Risks
+- **Misclassification**: Incorrect topic assignment could misroute important messages
+- **False Positives/Negatives**: May miss urgent cases or flag non-urgent content
+- **Privacy Concerns**: Processing sensitive health and personal information
+- **Cultural Sensitivity**: May not fully capture cultural context and nuances
+### Recommendations
+- **Regular Monitoring**: Continuously monitor model performance in production
+- **Human Review**: Implement human review for high-stakes classifications
+- **Feedback Loop**: Collect and incorporate user feedback for improvements
+- **Bias Auditing**: Regularly audit for biases and fairness issues
+- **Threshold Tuning**: Adjust thresholds based on use case requirements
+## Citation
+@misc{swahili-topic-classifier-multilabel,
+  title={Swahili Topic Classifier - Multi-label Classification},
+  author={NeboTech},
+  year={2024},
+  publisher={Hugging Face},
+  howpublished={\\url{https://huggingface.co/NeboTech/swahili-text-classifier}},
+  note={Version 2.0 - Multi-label Classification}
+}## Additional Information
+### Model Files
+- `config.json`: Model configuration
+- `pytorch_model.bin` or `model.safetensors`: Model weights
+- `tokenizer.json`: Tokenizer model
+- `tokenizer_config.json`: Tokenizer configuration
+- `vocab.json`, `merges.txt`: Vocabulary files
+- `swahili_classifier.onnx`: ONNX model (separate repository)
+### Version History
+- **v2.0** (Current): Multi-label classification with sigmoid activation
+- **v1.0** (Legacy): Single-label classification with softmax activation
+### Contact
+For questions, issues, or contributions