LusakaLang – Multilingual Topic Classification Model
🧠 Model Description
mbert_LusakaLang_Topic is a fine-tuned version of Kelvinmbewe/mbert_LusakaLang designed for topic classification in multilingual Zambian text.
The model focuses on Lusaka-style language, where English is frequently mixed with Bemba and Nyanja, particularly in informal digital communication such as ride-hailing reviews, customer feedback, and social media comments.
LusakaLang captures code-switching patterns, local idioms, and pragmatic expressions unique to Zambia’s urban linguistic environment, enabling accurate classification of real-world, mixed-language text.
🎯 Task
Text Classification (Topic Classification)
Supported topics include:
customer_supportdriver_behaviourpayment_issues
🧪 Training Data Creation & Local Review Process
The training data used for LusakaLang was primarily AI-generated synthetic text, created to simulate ride-hailing user reviews and feedback common in Zambian contexts (e.g. complaints, compliments, and service issues).
To ensure linguistic authenticity and cultural relevance, all synthetic samples were reviewed, corrected, and refined by a native Zambian speaker. This human-in-the-loop review process focused on:
- Correcting unnatural or non-local phrasing introduced by AI generation
- Aligning expressions with Lusaka-style English, Bemba, and Nyanja usage
- Ensuring realistic code-switching patterns (English ↔ Bemba ↔ Nyanja)
- Improving local idioms, slang, and pragmatic meaning
This hybrid approach combines the scalability of AI-generated data with human linguistic expertise, resulting in training samples that better reflect real-world ride-hailing communication in Lusaka.
Note: While the dataset is synthetic, linguistic patterns were intentionally grounded in local Zambian speech norms through native-speaker validation.
📊 Evaluation Results (Validation Set)
The model was evaluated after 0 training epochs on a held-out validation set.
| Metric | Score |
|---|---|
| Accuracy | 99.1% |
| Precision | 99.0% |
| Recall | 99.0% |
| Macro F1 | 99.0% |
| Micro F1 | 99.1% |
| Val Loss | 0.10 |
These results demonstrate excellent generalization with no signs of overfitting.
Macro and Micro F1 scores are closely aligned, indicating balanced performance across all topic classes.
💡 Kelvinmbewe/mbert_LusakaLang_Topic?
✅ Better Understanding of Zambian English
Examples:
- “I’m just there”
- “I’m not fine but I’m okay”
- “I’m feeling somehow”
- “Believe you me”
- “Me I tell you the truth”
- “It’s just temporal”
✅ Better Handling of Bemba & Nyanja Idioms
Examples:
- “Nimvela bwino” → positive context
- “Nimvelako bwino pangono pangono” → neutral context
- “Nima one naiwe” → negative context
- “Sima one naiwe” → positive context
✅ Strong Code-Switching Support
Common patterns:
- English + Bemba
- English + Nyanja
- English + slang
- English + Bemba + Nyanja
🚀 Intended Use
mbert_LusakaLang_Topic is intended for:
- Ride-hailing customer feedback analysis
- Topic classification of Zambian social media text
- Customer support automation
- Research on African multilingual NLP and code-switching
⚠️ Limitations
- The training data is partially synthetic, despite native-speaker review.
- Performance may degrade on:
- Slang or expressions not represented in the dataset
- Text from regions outside Lusaka
- Domains unrelated to ride-hailing or customer feedback
Future versions aim to incorporate larger volumes of real-world annotated data.
🙌 Acknowledgements
Special thanks to native Zambian language contributors who helped ensure local linguistic accuracy and cultural relevance in the training data.
- Downloads last month
- 28
Model tree for Kelvinmbewe/mbert_LusakaLang_Topic
Base model
google-bert/bert-base-multilingual-casedDatasets used to train Kelvinmbewe/mbert_LusakaLang_Topic
Evaluation results
- accuracy on LusakaLang Topic Datasetvalidation set self-reported0.991
- precision on LusakaLang Topic Datasetvalidation set self-reported0.990
- recall on LusakaLang Topic Datasetvalidation set self-reported0.990
- macro_f1 on LusakaLang Topic Datasetvalidation set self-reported0.990
- micro_f1 on LusakaLang Topic Datasetvalidation set self-reported0.991
- validation_loss on LusakaLang Topic Datasetvalidation set self-reported0.103



