Swahili Topic Classifier - Multi-label Classification

Model Details

Model Description

A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.

  • Developed by: NeboTech
  • Model type: Transformer-based (RoBERTa)
  • Language(s): Swahili (Kiswahili)
  • License: Apache 2.0
  • Finetuned from: RoBERTa-base Wechsel Swahili
  • Model version: v2.0 (Multi-label Classification)

Model Architecture

  • Base Model: RoBERTa-base Wechsel Swahili
  • Task: Multi-label Sequence Classification
  • Problem Type: multi_label_classification
  • Number of Labels: 8
  • Activation Function: Sigmoid (for multi-label)
  • Loss Function: BCEWithLogitsLoss
  • Output Format: Binary vectors [batch_size, num_labels]

Model Variants

  • v2.0 (Current): Multi-label classification - Returns multiple topics with confidence scores
  • v1.0 (Legacy): Single-label classification - Returns single topic (available at revision="v1.0-single-label")

Intended Use

Primary Use Cases

  • Content Classification: Categorize Swahili text messages, reports, or documents
  • Case Management: Automatically tag and route cases to appropriate departments
  • Content Moderation: Identify topics requiring attention (e.g., health emergencies, violence)
  • Data Analytics: Analyze trends and patterns in Swahili text data
  • Information Routing: Direct messages to relevant stakeholders based on topics

Out-of-Scope Uses

  • Not suitable for: Languages other than Swahili
  • Not suitable for: Very short text (< 5 words) or very long text (> 512 tokens)
  • Not suitable for: Real-time critical decision making without human oversight
  • Not suitable for: Medical diagnosis or legal advice

Training Details

Training Data

  • Dataset: Custom Swahili text dataset
  • Language: Swahili (Kiswahili)
  • Data Collection: U-Report platform messages and related Swahili text
  • Preprocessing: Text cleaning, normalization, and tokenization
  • Data Balance: Dataset balanced across 8 topics

Training Procedure

  • Training Type: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: Variable (with gradient accumulation)
  • Epochs: 3
  • Gradient Accumulation: 4 steps
  • Weight Decay: 0.01
  • Mixed Precision: Enabled (FP16)
  • Early Stopping: Enabled (patience=2)

Training Hyperparametersl

learning_rate: 2e-5 per_device_train_batch_size: 4 gradient_accumulation_steps: 4 num_train_epochs: 3 weight_decay: 0.01 warmup_steps: 0 max_grad_norm: 1.0 fp16: true## Evaluation

Testing Data, Factors & Metrics

  • Evaluation Dataset: Held-out test set from balanced dataset
  • Evaluation Metrics:
    • F1 Score (Micro): Aggregated across all labels
    • F1 Score (Macro): Average per-label F1
    • F1 Score (Samples): Average per-sample F1
    • Precision (Micro/Macro): Classification precision
    • Recall (Micro/Macro): Classification recall
    • Hamming Loss: Fraction of incorrectly predicted labels
    • Subset Accuracy: Exact match accuracy

Results

Metric Score
F1 Score (Micro) 0.96
F1 Score (Macro) 0.96
F1 Score (Samples) 0.96
Precision (Micro) 0.96
Recall (Micro) 0.96
Hamming Loss 0.009054
Subset Accuracy 0.962

Model Performance Characteristics

Strengths

  • Multi-label Capability: Can identify multiple topics in a single text
  • Confidence Scores: Provides probability scores for each topic
  • Swahili Language Support: Specifically fine-tuned for Swahili text
  • Efficient Inference: ONNX format available for fast CPU inference
  • Balanced Performance: Trained on balanced dataset across all topics

Limitations

  • Language Specific: Only works with Swahili text
  • Topic Coverage: Limited to 8 predefined topics
  • Context Dependency: Performance may vary with text length and context
  • Dialect Variations: May not handle all Swahili dialects equally well
  • Threshold Sensitivity: Requires careful threshold tuning for optimal performance

Known Biases

  • Training Data Bias: Model reflects biases present in training data
  • Geographic Bias: May perform better on texts from regions in training data
  • Topic Imbalance: Some topics may have better representation in training data
  • Cultural Context: May not capture all cultural nuances in Swahili communication

How to Get Started with the Model

Using Transformers (PyTorch)

from transformers import AutoModelForSequenceClassification, AutoTokenizer import torch

Load model

model = AutoModelForSequenceClassification.from_pretrained( "NeboTech/swahili-text-classifier", problem_type="multi_label_classification" # CRITICAL for multi-label ) tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

Prepare input

text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

Get predictions

model.eval() with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # Shape: [1, 8]

Apply sigmoid for multi-label

probs = torch.sigmoid(logits)

Apply threshold

threshold = 0.5 predictions = (probs > threshold).float()

Get applicable topics

applicable_topics = torch.where(predictions[0] == 1)[0].tolist() print(f"Applicable topics: {applicable_topics}") print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime

import onnxruntime as ort import numpy as np from transformers import AutoTokenizer

Load tokenizer

tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

Load ONNX model

session = ort.InferenceSession("swahili_classifier.onnx")

Prepare input

text = "Nataka kujua dalili za COVID-19" inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)

Run inference

outputs = session.run( None, { "input_ids": inputs["input_ids"].astype(np.int64), "attention_mask": inputs["attention_mask"].astype(np.int64) } )

logits = outputs[0] # Shape: [1, 8]

Apply sigmoid

probs = 1 / (1 + np.exp(-logits))

Apply threshold

threshold = 0.5 predictions = (probs > threshold).astype(float)

Get topics

applicable_topics = np.where(predictions[0] == 1)[0] print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)

ID Topic Description
0 COVID COVID-19 related topics, symptoms, prevention
1 EDUCATION Educational content, school-related topics
2 HEALTH General health topics, medical information
3 HIV/AIDS HIV/AIDS related information and support
4 MENSTRUAL HYGIENE Menstrual health and hygiene topics
5 NUTRITION Nutrition, food, and dietary information
6 U-REPORT U-Report platform related content
7 VIOLENCE AGAINST CHILDREN Child protection and violence prevention

Ethical Considerations

Ethical Use

  • Human Oversight: Always include human review for critical decisions
  • Privacy: Respect user privacy when processing text data
  • Transparency: Inform users when automated classification is used
  • Fairness: Monitor for biased outcomes across different user groups

Potential Risks

  • Misclassification: Incorrect topic assignment could misroute important messages
  • False Positives/Negatives: May miss urgent cases or flag non-urgent content
  • Privacy Concerns: Processing sensitive health and personal information
  • Cultural Sensitivity: May not fully capture cultural context and nuances

Recommendations

  • Regular Monitoring: Continuously monitor model performance in production
  • Human Review: Implement human review for high-stakes classifications
  • Feedback Loop: Collect and incorporate user feedback for improvements
  • Bias Auditing: Regularly audit for biases and fairness issues
  • Threshold Tuning: Adjust thresholds based on use case requirements

Citation

@misc{swahili-topic-classifier-multilabel, title={Swahili Topic Classifier - Multi-label Classification}, author={NeboTech}, year={2024}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/NeboTech/swahili-text-classifier}}, note={Version 2.0 - Multi-label Classification} }## Additional Information

Model Files

  • config.json: Model configuration
  • pytorch_model.bin or model.safetensors: Model weights
  • tokenizer.json: Tokenizer model
  • tokenizer_config.json: Tokenizer configuration
  • vocab.json, merges.txt: Vocabulary files
  • swahili_classifier.onnx: ONNX model (separate repository)

Version History

  • v2.0 (Current): Multi-label classification with sigmoid activation
  • v1.0 (Legacy): Single-label classification with softmax activation

Contact

For questions, issues, or contributions

Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NeboTech/swahili-text-classifier

Finetuned
(1)
this model