Kelvinmbewe's picture
Update README.md
7e95745 verified
metadata
language:
  - en
  - ny
  - bem
tags:
  - text-classification
  - multilingual
  - transformer
  - zambia
  - lusaka
  - code-switching
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
base_model:
  - Kelvinmbewe/mbert_Lusaka_Language_Analysis
  - google-bert/bert-base-multilingual-cased
metrics:
  - accuracy
  - precision
  - recall
  - macro_f1
  - micro_f1
  - validation_loss
  - confusion_matrix
model-index:
  - name: LusakaLang
    results:
      - task:
          type: text-classification
          name: Topic Classification
        dataset:
          name: LusakaLang Topic Dataset
          type: lusakalang
          config: default
          split: validation
        metrics:
          - type: accuracy
            value: 0.99259
            name: accuracy
          - type: precision
            value: 0.9873
            name: precision
          - type: recall
            value: 0.99128
            name: recall
          - type: f1
            value: 0.98926
            name: macro_f1
          - type: f1
            value: 0.99259
            name: micro_f1
          - type: loss
            value: 0.05233
            name: validation_loss

LusakaLang Topic Analysis Model

This model was trained using its sister model, mbert_LusakaLang_Sentiment_Analysis, which was fine‑tuned on sentiment data spanning English, Bemba, Nyanja, Zambian slang, and mixed Zambian language varieties commonly used in everyday communication.

Training Details

- Base model: `mbert_LusakaLang_Sentiment_Analysis`
- Epochs: 20  
- Class weights: enabled (to correct class imbalance)  
- Optimizer: AdamW  
- Loss: Weighted cross‑entropy  
- Temperature scaling: T = 2.3 (applied at inference time)

Why Temperature Scaling?

Class‑weighted training sharpens logits.  
Temperature scaling at T = 2.3 improves:

- Confidence calibration  
- Noise robustness  
- Handling of positive/neutral text  
- Foreign‑language generalization  
- Reduction of overconfident misclassifications  

Training Data

The dataset was primarily synthetic, generated to simulate realistic ride‑hailing feedback in Zambia.  
To ensure authenticity:

- All samples were reviewed by a native Zambian speaker  
- Mixed langauge and slang patterns were corrected  
- Local idioms and slang were added  
- Unnatural AI‑generated phrasing was removed  
- Bemba/Nyanja grammars and tone were validated  

This hybrid approach ensures tha the dataset reflects real Zambian communication style.

Train and Validation Loss

image

Confusion Matrix

image

Word Cloud

image