Create README.md

f7d7d8a verified 8 days ago

9.66 kB

	---
	tags:
	- swahili
	- classification
	- multilabel
	- roberta
	- transformers
	- onnx
	- africa
	- nlp
	license: apache-2.0
	language:
	- sw
	- swa
	datasets:
	- custom
	metrics:
	- f1_score
	- precision
	- recall
	- hamming_loss
	pipeline_tag: text-classification
	task_categories:
	- text-classification
	task_ids:
	- multi-label-classification
	base_model:
	- benjamin/roberta-base-wechsel-swahili
	library_name: transformers
	---

	# Swahili Topic Classifier - Multi-label Classification

	## Model Details

	### Model Description
	A multi-label text classification model fine-tuned on RoBERTa-base Wechsel Swahili for classifying Swahili text into 8 predefined topics. The model can identify multiple applicable topics for a given text, providing confidence scores for each topic.

	- Developed by: NeboTech
	- Model type: Transformer-based (RoBERTa)
	- Language(s): Swahili (Kiswahili)
	- License: Apache 2.0
	- Finetuned from: [RoBERTa-base Wechsel Swahili](https://huggingface.co/roberta-base-wechsel-swahili)
	- Model version: v2.0 (Multi-label Classification)

	### Model Architecture
	- Base Model: RoBERTa-base Wechsel Swahili
	- Task: Multi-label Sequence Classification
	- Problem Type: `multi_label_classification`
	- Number of Labels: 8
	- Activation Function: Sigmoid (for multi-label)
	- Loss Function: BCEWithLogitsLoss
	- Output Format: Binary vectors [batch_size, num_labels]

	### Model Variants
	- v2.0 (Current): Multi-label classification - Returns multiple topics with confidence scores
	- v1.0 (Legacy): Single-label classification - Returns single topic (available at `revision="v1.0-single-label"`)

	## Intended Use

	### Primary Use Cases
	- Content Classification: Categorize Swahili text messages, reports, or documents
	- Case Management: Automatically tag and route cases to appropriate departments
	- Content Moderation: Identify topics requiring attention (e.g., health emergencies, violence)
	- Data Analytics: Analyze trends and patterns in Swahili text data
	- Information Routing: Direct messages to relevant stakeholders based on topics

	### Out-of-Scope Uses
	- Not suitable for: Languages other than Swahili
	- Not suitable for: Very short text (< 5 words) or very long text (> 512 tokens)
	- Not suitable for: Real-time critical decision making without human oversight
	- Not suitable for: Medical diagnosis or legal advice

	## Training Details

	### Training Data
	- Dataset: Custom Swahili text dataset
	- Language: Swahili (Kiswahili)
	- Data Collection: U-Report platform messages and related Swahili text
	- Preprocessing: Text cleaning, normalization, and tokenization
	- Data Balance: Dataset balanced across 8 topics

	### Training Procedure
	- Training Type: Fine-tuning from pre-trained RoBERTa-base Wechsel Swahili
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Batch Size: Variable (with gradient accumulation)
	- Epochs: 3
	- Gradient Accumulation: 4 steps
	- Weight Decay: 0.01
	- Mixed Precision: Enabled (FP16)
	- Early Stopping: Enabled (patience=2)

	### Training Hyperparametersl
	learning_rate: 2e-5
	per_device_train_batch_size: 4
	gradient_accumulation_steps: 4
	num_train_epochs: 3
	weight_decay: 0.01
	warmup_steps: 0
	max_grad_norm: 1.0
	fp16: true## Evaluation

	### Testing Data, Factors & Metrics
	- Evaluation Dataset: Held-out test set from balanced dataset
	- Evaluation Metrics:
	- F1 Score (Micro): Aggregated across all labels
	- F1 Score (Macro): Average per-label F1
	- F1 Score (Samples): Average per-sample F1
	- Precision (Micro/Macro): Classification precision
	- Recall (Micro/Macro): Classification recall
	- Hamming Loss: Fraction of incorrectly predicted labels
	- Subset Accuracy: Exact match accuracy

	### Results
	\| Metric \| Score \|
	\|--------\|-------\|
	\| F1 Score (Micro) \| 0.96 \|
	\| F1 Score (Macro) \|0.96 \|
	\| F1 Score (Samples) \|0.96 \|
	\| Precision (Micro) \| 0.96 \|
	\| Recall (Micro) \| 0.96 \|
	\| Hamming Loss \| 0.009054 \|
	\| Subset Accuracy \| 0.962 \|

	## Model Performance Characteristics

	### Strengths
	- Multi-label Capability: Can identify multiple topics in a single text
	- Confidence Scores: Provides probability scores for each topic
	- Swahili Language Support: Specifically fine-tuned for Swahili text
	- Efficient Inference: ONNX format available for fast CPU inference
	- Balanced Performance: Trained on balanced dataset across all topics

	### Limitations
	- Language Specific: Only works with Swahili text
	- Topic Coverage: Limited to 8 predefined topics
	- Context Dependency: Performance may vary with text length and context
	- Dialect Variations: May not handle all Swahili dialects equally well
	- Threshold Sensitivity: Requires careful threshold tuning for optimal performance

	### Known Biases
	- Training Data Bias: Model reflects biases present in training data
	- Geographic Bias: May perform better on texts from regions in training data
	- Topic Imbalance: Some topics may have better representation in training data
	- Cultural Context: May not capture all cultural nuances in Swahili communication

	## How to Get Started with the Model

	### Using Transformers (PyTorch)

	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	# Load model
	model = AutoModelForSequenceClassification.from_pretrained(
	"NeboTech/swahili-text-classifier",
	problem_type="multi_label_classification" # CRITICAL for multi-label
	)
	tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

	# Prepare input
	text = "Nataka kujua dalili za COVID-19 na jinsi ya kujilinda"
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

	# Get predictions
	model.eval()
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits # Shape: [1, 8]

	# Apply sigmoid for multi-label
	probs = torch.sigmoid(logits)

	# Apply threshold
	threshold = 0.5
	predictions = (probs > threshold).float()

	# Get applicable topics
	applicable_topics = torch.where(predictions[0] == 1)[0].tolist()
	print(f"Applicable topics: {applicable_topics}")
	print(f"Probabilities: {probs[0].tolist()}")### Using ONNX Runtime

	import onnxruntime as ort
	import numpy as np
	from transformers import AutoTokenizer

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("NeboTech/swahili-text-classifier")

	# Load ONNX model
	session = ort.InferenceSession("swahili_classifier.onnx")

	# Prepare input
	text = "Nataka kujua dalili za COVID-19"
	inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=256)

	# Run inference
	outputs = session.run(
	None,
	{
	"input_ids": inputs["input_ids"].astype(np.int64),
	"attention_mask": inputs["attention_mask"].astype(np.int64)
	}
	)

	logits = outputs[0] # Shape: [1, 8]

	# Apply sigmoid
	probs = 1 / (1 + np.exp(-logits))

	# Apply threshold
	threshold = 0.5
	predictions = (probs > threshold).astype(float)

	# Get topics
	applicable_topics = np.where(predictions[0] == 1)[0]
	print(f"Applicable topics: {applicable_topics}")## Topics (Label Mapping)

	\| ID \| Topic \| Description \|
	\|----\|-------\|-------------\|
	\| 0 \| COVID \| COVID-19 related topics, symptoms, prevention \|
	\| 1 \| EDUCATION \| Educational content, school-related topics \|
	\| 2 \| HEALTH \| General health topics, medical information \|
	\| 3 \| HIV/AIDS \| HIV/AIDS related information and support \|
	\| 4 \| MENSTRUAL HYGIENE \| Menstrual health and hygiene topics \|
	\| 5 \| NUTRITION \| Nutrition, food, and dietary information \|
	\| 6 \| U-REPORT \| U-Report platform related content \|
	\| 7 \| VIOLENCE AGAINST CHILDREN \| Child protection and violence prevention \|

	## Ethical Considerations

	### Ethical Use
	- Human Oversight: Always include human review for critical decisions
	- Privacy: Respect user privacy when processing text data
	- Transparency: Inform users when automated classification is used
	- Fairness: Monitor for biased outcomes across different user groups

	### Potential Risks
	- Misclassification: Incorrect topic assignment could misroute important messages
	- False Positives/Negatives: May miss urgent cases or flag non-urgent content
	- Privacy Concerns: Processing sensitive health and personal information
	- Cultural Sensitivity: May not fully capture cultural context and nuances

	### Recommendations
	- Regular Monitoring: Continuously monitor model performance in production
	- Human Review: Implement human review for high-stakes classifications
	- Feedback Loop: Collect and incorporate user feedback for improvements
	- Bias Auditing: Regularly audit for biases and fairness issues
	- Threshold Tuning: Adjust thresholds based on use case requirements

	## Citation

	@misc{swahili-topic-classifier-multilabel,
	title={Swahili Topic Classifier - Multi-label Classification},
	author={NeboTech},
	year={2024},
	publisher={Hugging Face},
	howpublished={\\url{https://huggingface.co/NeboTech/swahili-text-classifier}},
	note={Version 2.0 - Multi-label Classification}
	}## Additional Information

	### Model Files
	- `config.json`: Model configuration
	- `pytorch_model.bin` or `model.safetensors`: Model weights
	- `tokenizer.json`: Tokenizer model
	- `tokenizer_config.json`: Tokenizer configuration
	- `vocab.json`, `merges.txt`: Vocabulary files
	- `swahili_classifier.onnx`: ONNX model (separate repository)

	### Version History
	- v2.0 (Current): Multi-label classification with sigmoid activation
	- v1.0 (Legacy): Single-label classification with softmax activation

	### Contact
	For questions, issues, or contributions