README.md · Kelvinmbewe/mbert_LusakaLang

File size: 2,818 Bytes

a619401
89824f3
 
 
 
 
 
 
 
 
 
 
 
a619401
89824f3
bfe1017
c20fc13
bfe1017
 
 
 
 
 
 
 
 
89824f3
 
 
 
 
bfe1017
89824f3
 
 
 
 
 
bfe1017
e811e6d
bfe1017
 
e811e6d
bfe1017
 
e811e6d
bfe1017
 
e811e6d
bfe1017
 
e811e6d
bfe1017
 
e811e6d
bfe1017
a619401
 
c20fc13
a619401
 
7e95745
 
 
e811e6d
a619401
 
c20fc13
a619401
21c2ceb
c20fc13
e811e6d
c20fc13
e811e6d
 
c20fc13
21c2ceb
a619401
21c2ceb
 
e811e6d
c20fc13
a619401
e811e6d
 
 
 
 
21c2ceb
a619401
c20fc13
21c2ceb
c20fc13
e811e6d
a619401
c20fc13
 
e811e6d
 
c20fc13
a619401
c20fc13
21c2ceb
a619401
 
1a61a48
e811e6d
a619401
1a61a48
e811e6d
60317aa
1a61a48
e811e6d
a619401

---
language:
- en
- ny
- bem
tags:
- text-classification
- multilingual
- transformer
- zambia
- lusaka
- code-switching
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model:
- Kelvinmbewe/mbert_Lusaka_Language_Analysis
- google-bert/bert-base-multilingual-cased
metrics:
- accuracy
- precision
- recall
- macro_f1
- micro_f1
- validation_loss
- confusion_matrix
model-index:
- name: LusakaLang
  results:
  - task:
      type: text-classification
      name: Topic Classification
    dataset:
      name: LusakaLang Topic Dataset
      type: lusakalang
      config: default
      split: validation
    metrics:
    - type: accuracy
      value: 0.99259
      name: accuracy
    - type: precision
      value: 0.98730
      name: precision
    - type: recall
      value: 0.99128
      name: recall
    - type: f1
      value: 0.98926
      name: macro_f1
    - type: f1
      value: 0.99259
      name: micro_f1
    - type: loss
      value: 0.05233
      name: validation_loss
---

# **LusakaLang Topic Analysis Model**



This model was trained using its sister model, `mbert_LusakaLang_Sentiment_Analysis`, which was fine‑tuned on sentiment data 
spanning English, Bemba, Nyanja, Zambian slang, and mixed Zambian language varieties commonly used in everyday communication.



## Training Details

```python
- Base model: `mbert_LusakaLang_Sentiment_Analysis`
- Epochs: 20  
- Class weights: enabled (to correct class imbalance)  
- Optimizer: AdamW  
- Loss: Weighted cross‑entropy  
- Temperature scaling: T = 2.3 (applied at inference time)
```

## **Why Temperature Scaling?**
```python
Class‑weighted training sharpens logits.  
Temperature scaling at T = 2.3 improves:

- Confidence calibration  
- Noise robustness  
- Handling of positive/neutral text  
- Foreign‑language generalization  
- Reduction of overconfident misclassifications  
```

## Training Data
```python
The dataset was primarily synthetic, generated to simulate realistic ride‑hailing feedback in Zambia.  
To ensure authenticity:

- All samples were reviewed by a native Zambian speaker  
- Mixed langauge and slang patterns were corrected  
- Local idioms and slang were added  
- Unnatural AI‑generated phrasing was removed  
- Bemba/Nyanja grammars and tone were validated  

This hybrid approach ensures tha the dataset reflects real Zambian communication style.
```


## Train and Validation Loss
![image](https://cdn-uploads.huggingface.co/production/uploads/674ed988f86d2ca07fa23abe/OnagZY8nhxv-bOejq2m0B.png)

## Confusion Matrix
![image](https://cdn-uploads.huggingface.co/production/uploads/674ed988f86d2ca07fa23abe/Qk6rvSrTyeWHl90BrpNQZ.png)

## Word Cloud
![image](https://cdn-uploads.huggingface.co/production/uploads/674ed988f86d2ca07fa23abe/dZb3Tq2FBAKztlIp9asCs.png)