Kiswahili Sentiment Analysis — DistilBERT

A fine-tuned DistilBERT model for sentiment classification of Kiswahili (Swahili) text. This model classifies text into positive or negative sentiment categories.

Model Description

This model was developed as part of the Tubonge — Kiswahili Speech Analytics System, a research project addressing the lack of NLP tools for low-resource African languages. Since no labeled sentiment dataset exists for Kiswahili, the model was trained using a pseudo-labeling methodology that transfers sentiment knowledge from English to Kiswahili through cross-lingual translation.

Property	Value
Base Model	`distilbert-base-multilingual-cased`
Architecture	DistilBertForSequenceClassification
Parameters	~134M (multilingual vocab: 119,547 tokens)
Language	Kiswahili (sw)
Task	Binary Sentiment Classification
Framework	PyTorch / Hugging Face Transformers 4.40.2

Training Procedure

Pseudo-Labeling Pipeline

Due to the absence of annotated sentiment data for Kiswahili, a cross-lingual pseudo-labeling approach was employed:

Source Data: 18,629 validated Kiswahili transcriptions from the Mozilla Common Voice dataset
Translation: Each Kiswahili text was translated to English using NLLB-200-distilled-600M
Label Generation: A pre-trained English sentiment classifier assigned sentiment labels to the translated text
Label Transfer: The generated labels were mapped back to the original Kiswahili text
Fine-tuning: distilbert-base-multilingual-cased was fine-tuned on the pseudo-labeled Kiswahili dataset

Training Hyperparameters

Parameter	Value
Learning Rate	2e-5
Batch Size	16
Epochs	3
Optimizer	AdamW
Warmup Steps	500
Max Sequence Length	128
Weight Decay	0.01
Data Split	80% train / 10% val / 10% test

Evaluation Results

Metric	Value
Weighted F1-Score	0.6125

The F1-score of 0.6125 represents a meaningful achievement for a language with zero manually annotated sentiment data. The moderate score reflects inherent limitations of the pseudo-labeling approach, including translation noise and cultural differences in sentiment expression.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("RareElf/kiswahili-sentiment-distilbert")
model = AutoModelForSequenceClassification.from_pretrained("RareElf/kiswahili-sentiment-distilbert")

# Classify sentiment
text = "Habari yako, nimefurahi sana kukutana nawe"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probs, dim=-1).item()

sentiment_map = {0: "negative", 1: "positive"}
print(f"Sentiment: {sentiment_map[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.2%}")

Using the Pipeline API

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="RareElf/kiswahili-sentiment-distilbert"
)

result = classifier("Hii ni siku nzuri sana")
print(result)

Intended Use

Sentiment analysis of Kiswahili text from speech transcriptions
Monitoring sentiment trends in Kiswahili audio content
Research on low-resource language NLP and cross-lingual transfer learning
Integration into Kiswahili speech analytics pipelines

Limitations

Pseudo-label noise: Sentiment labels were generated through translation, introducing potential errors from translation inaccuracies and cultural differences in sentiment expression
Domain specificity: Trained on Mozilla Common Voice transcriptions, which consist primarily of short read-aloud sentences; performance may vary on conversational or domain-specific text
Binary classification: The model classifies into positive/negative categories; neutral sentiment detection is limited
Dialect coverage: May not generalise equally to all Kiswahili dialects and regional variations

Citation

If you use this model in your research, please cite:

@misc{obote2025kiswahili-sentiment,
  author = {Obote, Kevin},
  title = {Kiswahili Sentiment Analysis using Pseudo-Labeled DistilBERT},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/RareElf/kiswahili-sentiment-distilbert}
}

Related Models

RareElf/swahili-wav2vec2-asr — Kiswahili ASR model used in the same pipeline
facebook/nllb-200-distilled-600M — Translation model used for pseudo-labeling
google/mt5-small — Summarization model used in the pipeline

Model Card Contact

Kevin Obote — RareElf on Hugging Face

Downloads last month: 20

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for RareElf/kiswahili-sentiment-distilbert

Base model

distilbert/distilbert-base-multilingual-cased

Finetuned

(447)

this model

Dataset used to train RareElf/kiswahili-sentiment-distilbert

Evaluation results

Weighted F1-Score on Mozilla Common Voice Kiswahili (Pseudo-labeled)
test set self-reported

0.613