Instructions to use RareElf/kiswahili-sentiment-distilbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RareElf/kiswahili-sentiment-distilbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="RareElf/kiswahili-sentiment-distilbert")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("RareElf/kiswahili-sentiment-distilbert") model = AutoModelForSequenceClassification.from_pretrained("RareElf/kiswahili-sentiment-distilbert") - Notebooks
- Google Colab
- Kaggle
Kiswahili Sentiment Analysis โ DistilBERT
A fine-tuned DistilBERT model for sentiment classification of Kiswahili (Swahili) text. This model classifies text into positive or negative sentiment categories.
Model Description
This model was developed as part of the Tubonge โ Kiswahili Speech Analytics System, a research project addressing the lack of NLP tools for low-resource African languages. Since no labeled sentiment dataset exists for Kiswahili, the model was trained using a pseudo-labeling methodology that transfers sentiment knowledge from English to Kiswahili through cross-lingual translation.
| Property | Value |
|---|---|
| Base Model | distilbert-base-multilingual-cased |
| Architecture | DistilBertForSequenceClassification |
| Parameters | ~134M (multilingual vocab: 119,547 tokens) |
| Language | Kiswahili (sw) |
| Task | Binary Sentiment Classification |
| Framework | PyTorch / Hugging Face Transformers 4.40.2 |
Training Procedure
Pseudo-Labeling Pipeline
Due to the absence of annotated sentiment data for Kiswahili, a cross-lingual pseudo-labeling approach was employed:
- Source Data: 18,629 validated Kiswahili transcriptions from the Mozilla Common Voice dataset
- Translation: Each Kiswahili text was translated to English using NLLB-200-distilled-600M
- Label Generation: A pre-trained English sentiment classifier assigned sentiment labels to the translated text
- Label Transfer: The generated labels were mapped back to the original Kiswahili text
- Fine-tuning:
distilbert-base-multilingual-casedwas fine-tuned on the pseudo-labeled Kiswahili dataset
Training Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 2e-5 |
| Batch Size | 16 |
| Epochs | 3 |
| Optimizer | AdamW |
| Warmup Steps | 500 |
| Max Sequence Length | 128 |
| Weight Decay | 0.01 |
| Data Split | 80% train / 10% val / 10% test |
Evaluation Results
| Metric | Value |
|---|---|
| Weighted F1-Score | 0.6125 |
The F1-score of 0.6125 represents a meaningful achievement for a language with zero manually annotated sentiment data. The moderate score reflects inherent limitations of the pseudo-labeling approach, including translation noise and cultural differences in sentiment expression.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("RareElf/kiswahili-sentiment-distilbert")
model = AutoModelForSequenceClassification.from_pretrained("RareElf/kiswahili-sentiment-distilbert")
# Classify sentiment
text = "Habari yako, nimefurahi sana kukutana nawe"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()
sentiment_map = {0: "negative", 1: "positive"}
print(f"Sentiment: {sentiment_map[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.2%}")
Using the Pipeline API
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="RareElf/kiswahili-sentiment-distilbert"
)
result = classifier("Hii ni siku nzuri sana")
print(result)
Intended Use
- Sentiment analysis of Kiswahili text from speech transcriptions
- Monitoring sentiment trends in Kiswahili audio content
- Research on low-resource language NLP and cross-lingual transfer learning
- Integration into Kiswahili speech analytics pipelines
Limitations
- Pseudo-label noise: Sentiment labels were generated through translation, introducing potential errors from translation inaccuracies and cultural differences in sentiment expression
- Domain specificity: Trained on Mozilla Common Voice transcriptions, which consist primarily of short read-aloud sentences; performance may vary on conversational or domain-specific text
- Binary classification: The model classifies into positive/negative categories; neutral sentiment detection is limited
- Dialect coverage: May not generalise equally to all Kiswahili dialects and regional variations
Citation
If you use this model in your research, please cite:
@misc{obote2025kiswahili-sentiment,
author = {Obote, Kevin},
title = {Kiswahili Sentiment Analysis using Pseudo-Labeled DistilBERT},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/RareElf/kiswahili-sentiment-distilbert}
}
Related Models
- RareElf/swahili-wav2vec2-asr โ Kiswahili ASR model used in the same pipeline
- facebook/nllb-200-distilled-600M โ Translation model used for pseudo-labeling
- google/mt5-small โ Summarization model used in the pipeline
Model Card Contact
Kevin Obote โ RareElf on Hugging Face
- Downloads last month
- 20
Model tree for RareElf/kiswahili-sentiment-distilbert
Dataset used to train RareElf/kiswahili-sentiment-distilbert
Evaluation results
- Weighted F1-Score on Mozilla Common Voice Kiswahili (Pseudo-labeled)test set self-reported0.613