File size: 9,656 Bytes

f7d7d8a

---
tags:
- swahili
- classification
- multilabel
- roberta
- transformers
- onnx
- africa
- nlp
license: apache-2.0
language:
- sw
- swa
datasets:
- custom
metrics:
- f1_score
- precision
- recall
- hamming_loss
pipeline_tag: text-classification
task_categories:
- text-classification
task_ids:
- multi-label-classification
base_model:
- benjamin/roberta-base-wechsel-swahili
library_name: transformers
---

# Swahili Topic Classifier - Multi-label Classification

## Model Details

### Model Description
A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.

- **Developed by**: NeboTech
- **Model type**: Transformer-based (RoBERTa)
- **Language(s)**: Swahili (Kiswahili)
- **License**: Apache 2.0
- **Finetuned from**: [RoBERTa-base Wechsel Swahili](https://huggingface.co/roberta-base-wechsel-swahili)
- **Model version**: v2.0 (Multi-label Classification)

### Model Architecture
- **Base Model**: RoBERTa-base Wechsel Swahili
- **Task**: Multi-label Sequence Classification
- **Problem Type**: `multi_label_classification`
- **Number of Labels**: 8
- **Activation Function**: Sigmoid (for multi-label)
- **Loss Function**: BCEWithLogitsLoss
- **Output Format**: Binary vectors [batch_size, num_labels]

### Model Variants
- **v2.0** (Current): Multi-label classification - Returns multiple topics with confidence scores
- **v1.0** (Legacy): Single-label classification - Returns single topic (available at `revision="v1.0-single-label"`)

## Intended Use

### Primary Use Cases
- **Content Classification**: Categorize Swahili text messages, reports, or documents
- **Case Management**: Automatically tag and route cases to appropriate departments
- **Content Moderation**: Identify topics requiring attention (e.g., health emergencies, violence)
- **Data Analytics**: Analyze trends and patterns in Swahili text data
- **Information Routing**: Direct messages to relevant stakeholders based on topics

### Out-of-Scope Uses
- **Not suitable for**: Languages other than Swahili
- **Not suitable for**: Very short text (< 5 words) or very long text (> 512 tokens)
- **Not suitable for**: Real-time critical decision making without human oversight
- **Not suitable for**: Medical diagnosis or legal advice

## Training Details

### Training Data
- **Dataset**: Custom Swahili text dataset
- **Language**: Swahili (Kiswahili)
- **Data Collection**: U-Report platform messages and related Swahili text
- **Preprocessing**: Text cleaning, normalization, and tokenization
- **Data Balance**: Dataset balanced across 8 topics

### Training Procedure
- **Training Type**: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Batch Size**: Variable (with gradient accumulation)
- **Epochs**: 3
- **Gradient Accumulation**: 4 steps
- **Weight Decay**: 0.01
- **Mixed Precision**: Enabled (FP16)
- **Early Stopping**: Enabled (patience=2)

### Training Hyperparametersl
learning_rate: 2e-5
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
num_train_epochs: 3
weight_decay: 0.01
warmup_steps: 0
max_grad_norm: 1.0
fp16: true## Evaluation

### Testing Data, Factors & Metrics
- **Evaluation Dataset**: Held-out test set from balanced dataset
- **Evaluation Metrics**:
  - **F1 Score (Micro)**: Aggregated across all labels
  - **F1 Score (Macro)**: Average per-label F1
  - **F1 Score (Samples)**: Average per-sample F1
  - **Precision (Micro/Macro)**: Classification precision
  - **Recall (Micro/Macro)**: Classification recall
  - **Hamming Loss**: Fraction of incorrectly predicted labels
  - **Subset Accuracy**: Exact match accuracy

### Results
| Metric | Score |
|--------|-------|
| F1 Score (Micro) | 0.96 |
| F1 Score (Macro) |0.96 |
| F1 Score (Samples) |0.96 |
| Precision (Micro) | 0.96 |
| Recall (Micro) | 0.96 |
| Hamming Loss | 0.009054 |
| Subset Accuracy | 0.962 |

## Model Performance Characteristics

### Strengths
- **Multi-label Capability**: Can identify multiple topics in a single text
- **Confidence Scores**: Provides probability scores for each topic
- **Swahili Language Support**: Specifically fine-tuned for Swahili text
- **Efficient Inference**: ONNX format available for fast CPU inference
- **Balanced Performance**: Trained on balanced dataset across all topics

### Limitations
- **Language Specific**: Only works with Swahili text
- **Topic Coverage**: Limited to 8 predefined topics
- **Context Dependency**: Performance may vary with text length and context
- **Dialect Variations**: May not handle all Swahili dialects equally well
- **Threshold Sensitivity**: Requires careful threshold tuning for optimal performance

### Known Biases
- **Training Data Bias**: Model reflects biases present in training data
- **Geographic Bias**: May perform better on texts from regions in training data
- **Topic Imbalance**: Some topics may have better representation in training data
- **Cultural Context**: May not capture all cultural nuances in Swahili communication

## How to Get Started with the Model

### Using Transformers (PyTorch)

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "NeboTech/swahili-text-classifier",
    problem_type="multi_label_classification"  # CRITICAL for multi-label
)
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

# Prepare input
text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Get predictions
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Shape: [1, 8]

# Apply sigmoid for multi-label
probs = torch.sigmoid(logits)

# Apply threshold
threshold = 0.5
predictions = (probs > threshold).float()

# Get applicable topics
applicable_topics = torch.where(predictions[0] == 1)[0].tolist()
print(f"Applicable topics: {applicable_topics}")
print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

# Load ONNX model
session = ort.InferenceSession("swahili_classifier.onnx")

# Prepare input
text = "Nataka kujua dalili za COVID-19"
inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)

# Run inference
outputs = session.run(
    None,
    {
        "input_ids": inputs["input_ids"].astype(np.int64),
        "attention_mask": inputs["attention_mask"].astype(np.int64)
    }
)

logits = outputs[0]  # Shape: [1, 8]

# Apply sigmoid
probs = 1 / (1 + np.exp(-logits))

# Apply threshold
threshold = 0.5
predictions = (probs > threshold).astype(float)

# Get topics
applicable_topics = np.where(predictions[0] == 1)[0]
print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)

| ID | Topic | Description |
|----|-------|-------------|
| 0 | COVID | COVID-19 related topics, symptoms, prevention |
| 1 | EDUCATION | Educational content, school-related topics |
| 2 | HEALTH | General health topics, medical information |
| 3 | HIV/AIDS | HIV/AIDS related information and support |
| 4 | MENSTRUAL HYGIENE | Menstrual health and hygiene topics |
| 5 | NUTRITION | Nutrition, food, and dietary information |
| 6 | U-REPORT | U-Report platform related content |
| 7 | VIOLENCE AGAINST CHILDREN | Child protection and violence prevention |

## Ethical Considerations

### Ethical Use
- **Human Oversight**: Always include human review for critical decisions
- **Privacy**: Respect user privacy when processing text data
- **Transparency**: Inform users when automated classification is used
- **Fairness**: Monitor for biased outcomes across different user groups

### Potential Risks
- **Misclassification**: Incorrect topic assignment could misroute important messages
- **False Positives/Negatives**: May miss urgent cases or flag non-urgent content
- **Privacy Concerns**: Processing sensitive health and personal information
- **Cultural Sensitivity**: May not fully capture cultural context and nuances

### Recommendations
- **Regular Monitoring**: Continuously monitor model performance in production
- **Human Review**: Implement human review for high-stakes classifications
- **Feedback Loop**: Collect and incorporate user feedback for improvements
- **Bias Auditing**: Regularly audit for biases and fairness issues
- **Threshold Tuning**: Adjust thresholds based on use case requirements

## Citation

@misc{swahili-topic-classifier-multilabel,
  title={Swahili Topic Classifier - Multi-label Classification},
  author={NeboTech},
  year={2024},
  publisher={Hugging Face},
  howpublished={\\url{https://huggingface.co/NeboTech/swahili-text-classifier}},
  note={Version 2.0 - Multi-label Classification}
}## Additional Information

### Model Files
- `config.json`: Model configuration
- `pytorch_model.bin` or `model.safetensors`: Model weights
- `tokenizer.json`: Tokenizer model
- `tokenizer_config.json`: Tokenizer configuration
- `vocab.json`, `merges.txt`: Vocabulary files
- `swahili_classifier.onnx`: ONNX model (separate repository)

### Version History
- **v2.0** (Current): Multi-label classification with sigmoid activation
- **v1.0** (Legacy): Single-label classification with softmax activation

### Contact
For questions, issues, or contributions