metadata
language:
- en
- ny
- bem
tags:
- text-classification
- multilingual
- transformer
- zambia
- lusaka
- code-switching
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model:
- Kelvinmbewe/mbert_Lusaka_Language_Analysis
- google-bert/bert-base-multilingual-cased
metrics:
- accuracy
- precision
- recall
- macro_f1
- micro_f1
- validation_loss
- confusion_matrix
model-index:
- name: LusakaLang
results:
- task:
type: text-classification
name: Topic Classification
dataset:
name: LusakaLang Topic Dataset
type: lusakalang
config: default
split: validation
metrics:
- type: accuracy
value: 0.99259
name: accuracy
- type: precision
value: 0.9873
name: precision
- type: recall
value: 0.99128
name: recall
- type: f1
value: 0.98926
name: macro_f1
- type: f1
value: 0.99259
name: micro_f1
- type: loss
value: 0.05233
name: validation_loss
LusakaLang Topic Analysis Model
This model was trained using its sister model, mbert_LusakaLang_Sentiment_Analysis, which was fine‑tuned on sentiment data
spanning English, Bemba, Nyanja, Zambian slang, and mixed Zambian language varieties commonly used in everyday communication.
Training Details
- Base model: `mbert_LusakaLang_Sentiment_Analysis`
- Epochs: 20
- Class weights: enabled (to correct class imbalance)
- Optimizer: AdamW
- Loss: Weighted cross‑entropy
- Temperature scaling: T = 2.3 (applied at inference time)
Why Temperature Scaling?
Class‑weighted training sharpens logits.
Temperature scaling at T = 2.3 improves:
- Confidence calibration
- Noise robustness
- Handling of positive/neutral text
- Foreign‑language generalization
- Reduction of overconfident misclassifications
Training Data
The dataset was primarily synthetic, generated to simulate realistic ride‑hailing feedback in Zambia.
To ensure authenticity:
- All samples were reviewed by a native Zambian speaker
- Mixed langauge and slang patterns were corrected
- Local idioms and slang were added
- Unnatural AI‑generated phrasing was removed
- Bemba/Nyanja grammars and tone were validated
This hybrid approach ensures tha the dataset reflects real Zambian communication style.


