NeboTech's picture
Create README.md
f7d7d8a verified
---
tags:
- swahili
- classification
- multilabel
- roberta
- transformers
- onnx
- africa
- nlp
license: apache-2.0
language:
- sw
- swa
datasets:
- custom
metrics:
- f1_score
- precision
- recall
- hamming_loss
pipeline_tag: text-classification
task_categories:
- text-classification
task_ids:
- multi-label-classification
base_model:
- benjamin/roberta-base-wechsel-swahili
library_name: transformers
---
# Swahili Topic Classifier - Multi-label Classification
## Model Details
### Model Description
A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.
- **Developed by**: NeboTech
- **Model type**: Transformer-based (RoBERTa)
- **Language(s)**: Swahili (Kiswahili)
- **License**: Apache 2.0
- **Finetuned from**: [RoBERTa-base Wechsel Swahili](https://huggingface.co/roberta-base-wechsel-swahili)
- **Model version**: v2.0 (Multi-label Classification)
### Model Architecture
- **Base Model**: RoBERTa-base Wechsel Swahili
- **Task**: Multi-label Sequence Classification
- **Problem Type**: `multi_label_classification`
- **Number of Labels**: 8
- **Activation Function**: Sigmoid (for multi-label)
- **Loss Function**: BCEWithLogitsLoss
- **Output Format**: Binary vectors [batch_size, num_labels]
### Model Variants
- **v2.0** (Current): Multi-label classification - Returns multiple topics with confidence scores
- **v1.0** (Legacy): Single-label classification - Returns single topic (available at `revision="v1.0-single-label"`)
## Intended Use
### Primary Use Cases
- **Content Classification**: Categorize Swahili text messages, reports, or documents
- **Case Management**: Automatically tag and route cases to appropriate departments
- **Content Moderation**: Identify topics requiring attention (e.g., health emergencies, violence)
- **Data Analytics**: Analyze trends and patterns in Swahili text data
- **Information Routing**: Direct messages to relevant stakeholders based on topics
### Out-of-Scope Uses
- **Not suitable for**: Languages other than Swahili
- **Not suitable for**: Very short text (< 5 words) or very long text (> 512 tokens)
- **Not suitable for**: Real-time critical decision making without human oversight
- **Not suitable for**: Medical diagnosis or legal advice
## Training Details
### Training Data
- **Dataset**: Custom Swahili text dataset
- **Language**: Swahili (Kiswahili)
- **Data Collection**: U-Report platform messages and related Swahili text
- **Preprocessing**: Text cleaning, normalization, and tokenization
- **Data Balance**: Dataset balanced across 8 topics
### Training Procedure
- **Training Type**: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Batch Size**: Variable (with gradient accumulation)
- **Epochs**: 3
- **Gradient Accumulation**: 4 steps
- **Weight Decay**: 0.01
- **Mixed Precision**: Enabled (FP16)
- **Early Stopping**: Enabled (patience=2)
### Training Hyperparametersl
learning_rate: 2e-5
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
num_train_epochs: 3
weight_decay: 0.01
warmup_steps: 0
max_grad_norm: 1.0
fp16: true## Evaluation
### Testing Data, Factors & Metrics
- **Evaluation Dataset**: Held-out test set from balanced dataset
- **Evaluation Metrics**:
- **F1 Score (Micro)**: Aggregated across all labels
- **F1 Score (Macro)**: Average per-label F1
- **F1 Score (Samples)**: Average per-sample F1
- **Precision (Micro/Macro)**: Classification precision
- **Recall (Micro/Macro)**: Classification recall
- **Hamming Loss**: Fraction of incorrectly predicted labels
- **Subset Accuracy**: Exact match accuracy
### Results
| Metric | Score |
|--------|-------|
| F1 Score (Micro) | 0.96 |
| F1 Score (Macro) |0.96 |
| F1 Score (Samples) |0.96 |
| Precision (Micro) | 0.96 |
| Recall (Micro) | 0.96 |
| Hamming Loss | 0.009054 |
| Subset Accuracy | 0.962 |
## Model Performance Characteristics
### Strengths
- **Multi-label Capability**: Can identify multiple topics in a single text
- **Confidence Scores**: Provides probability scores for each topic
- **Swahili Language Support**: Specifically fine-tuned for Swahili text
- **Efficient Inference**: ONNX format available for fast CPU inference
- **Balanced Performance**: Trained on balanced dataset across all topics
### Limitations
- **Language Specific**: Only works with Swahili text
- **Topic Coverage**: Limited to 8 predefined topics
- **Context Dependency**: Performance may vary with text length and context
- **Dialect Variations**: May not handle all Swahili dialects equally well
- **Threshold Sensitivity**: Requires careful threshold tuning for optimal performance
### Known Biases
- **Training Data Bias**: Model reflects biases present in training data
- **Geographic Bias**: May perform better on texts from regions in training data
- **Topic Imbalance**: Some topics may have better representation in training data
- **Cultural Context**: May not capture all cultural nuances in Swahili communication
## How to Get Started with the Model
### Using Transformers (PyTorch)
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
"NeboTech/swahili-text-classifier",
problem_type="multi_label_classification" # CRITICAL for multi-label
)
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
# Prepare input
text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
# Get predictions
model.eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # Shape: [1, 8]
# Apply sigmoid for multi-label
probs = torch.sigmoid(logits)
# Apply threshold
threshold = 0.5
predictions = (probs > threshold).float()
# Get applicable topics
applicable_topics = torch.where(predictions[0] == 1)[0].tolist()
print(f"Applicable topics: {applicable_topics}")
print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")
# Load ONNX model
session = ort.InferenceSession("swahili_classifier.onnx")
# Prepare input
text = "Nataka kujua dalili za COVID-19"
inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)
# Run inference
outputs = session.run(
None,
{
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64)
}
)
logits = outputs[0] # Shape: [1, 8]
# Apply sigmoid
probs = 1 / (1 + np.exp(-logits))
# Apply threshold
threshold = 0.5
predictions = (probs > threshold).astype(float)
# Get topics
applicable_topics = np.where(predictions[0] == 1)[0]
print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)
| ID | Topic | Description |
|----|-------|-------------|
| 0 | COVID | COVID-19 related topics, symptoms, prevention |
| 1 | EDUCATION | Educational content, school-related topics |
| 2 | HEALTH | General health topics, medical information |
| 3 | HIV/AIDS | HIV/AIDS related information and support |
| 4 | MENSTRUAL HYGIENE | Menstrual health and hygiene topics |
| 5 | NUTRITION | Nutrition, food, and dietary information |
| 6 | U-REPORT | U-Report platform related content |
| 7 | VIOLENCE AGAINST CHILDREN | Child protection and violence prevention |
## Ethical Considerations
### Ethical Use
- **Human Oversight**: Always include human review for critical decisions
- **Privacy**: Respect user privacy when processing text data
- **Transparency**: Inform users when automated classification is used
- **Fairness**: Monitor for biased outcomes across different user groups
### Potential Risks
- **Misclassification**: Incorrect topic assignment could misroute important messages
- **False Positives/Negatives**: May miss urgent cases or flag non-urgent content
- **Privacy Concerns**: Processing sensitive health and personal information
- **Cultural Sensitivity**: May not fully capture cultural context and nuances
### Recommendations
- **Regular Monitoring**: Continuously monitor model performance in production
- **Human Review**: Implement human review for high-stakes classifications
- **Feedback Loop**: Collect and incorporate user feedback for improvements
- **Bias Auditing**: Regularly audit for biases and fairness issues
- **Threshold Tuning**: Adjust thresholds based on use case requirements
## Citation
@misc{swahili-topic-classifier-multilabel,
title={Swahili Topic Classifier - Multi-label Classification},
author={NeboTech},
year={2024},
publisher={Hugging Face},
howpublished={\\url{https://huggingface.co/NeboTech/swahili-text-classifier}},
note={Version 2.0 - Multi-label Classification}
}## Additional Information
### Model Files
- `config.json`: Model configuration
- `pytorch_model.bin` or `model.safetensors`: Model weights
- `tokenizer.json`: Tokenizer model
- `tokenizer_config.json`: Tokenizer configuration
- `vocab.json`, `merges.txt`: Vocabulary files
- `swahili_classifier.onnx`: ONNX model (separate repository)
### Version History
- **v2.0** (Current): Multi-label classification with sigmoid activation
- **v1.0** (Legacy): Single-label classification with softmax activation
### Contact
For questions, issues, or contributions