|
|
--- |
|
|
language: |
|
|
- en |
|
|
- ny |
|
|
- bem |
|
|
tags: |
|
|
- text-classification |
|
|
- multilingual |
|
|
- transformer |
|
|
- zambia |
|
|
- lusaka |
|
|
- code-switching |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-classification |
|
|
base_model: |
|
|
- Kelvinmbewe/mbert_Lusaka_Language_Analysis |
|
|
- google-bert/bert-base-multilingual-cased |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- macro_f1 |
|
|
- micro_f1 |
|
|
- validation_loss |
|
|
- confusion_matrix |
|
|
model-index: |
|
|
- name: LusakaLang |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Topic Classification |
|
|
dataset: |
|
|
name: LusakaLang Topic Dataset |
|
|
type: lusakalang |
|
|
config: default |
|
|
split: validation |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.99259 |
|
|
name: accuracy |
|
|
- type: precision |
|
|
value: 0.98730 |
|
|
name: precision |
|
|
- type: recall |
|
|
value: 0.99128 |
|
|
name: recall |
|
|
- type: f1 |
|
|
value: 0.98926 |
|
|
name: macro_f1 |
|
|
- type: f1 |
|
|
value: 0.99259 |
|
|
name: micro_f1 |
|
|
- type: loss |
|
|
value: 0.05233 |
|
|
name: validation_loss |
|
|
--- |
|
|
|
|
|
# **LusakaLang Topic Analysis Model** |
|
|
|
|
|
|
|
|
|
|
|
This model was trained using its sister model, `mbert_LusakaLang_Sentiment_Analysis`, which was fine‑tuned on sentiment data |
|
|
spanning English, Bemba, Nyanja, Zambian slang, and mixed Zambian language varieties commonly used in everyday communication. |
|
|
|
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
```python |
|
|
- Base model: `mbert_LusakaLang_Sentiment_Analysis` |
|
|
- Epochs: 20 |
|
|
- Class weights: enabled (to correct class imbalance) |
|
|
- Optimizer: AdamW |
|
|
- Loss: Weighted cross‑entropy |
|
|
- Temperature scaling: T = 2.3 (applied at inference time) |
|
|
``` |
|
|
|
|
|
## **Why Temperature Scaling?** |
|
|
```python |
|
|
Class‑weighted training sharpens logits. |
|
|
Temperature scaling at T = 2.3 improves: |
|
|
|
|
|
- Confidence calibration |
|
|
- Noise robustness |
|
|
- Handling of positive/neutral text |
|
|
- Foreign‑language generalization |
|
|
- Reduction of overconfident misclassifications |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
```python |
|
|
The dataset was primarily synthetic, generated to simulate realistic ride‑hailing feedback in Zambia. |
|
|
To ensure authenticity: |
|
|
|
|
|
- All samples were reviewed by a native Zambian speaker |
|
|
- Mixed langauge and slang patterns were corrected |
|
|
- Local idioms and slang were added |
|
|
- Unnatural AI‑generated phrasing was removed |
|
|
- Bemba/Nyanja grammars and tone were validated |
|
|
|
|
|
This hybrid approach ensures tha the dataset reflects real Zambian communication style. |
|
|
``` |
|
|
|
|
|
|
|
|
## Train and Validation Loss |
|
|
 |
|
|
|
|
|
## Confusion Matrix |
|
|
 |
|
|
|
|
|
## Word Cloud |
|
|
 |
|
|
|
|
|
|