|
|
--- |
|
|
language: |
|
|
- en |
|
|
- bem |
|
|
- ny |
|
|
tags: |
|
|
- multi-task |
|
|
- sentiment-analysis |
|
|
- topic-classification |
|
|
- language-identification |
|
|
- multilingual |
|
|
- transformer |
|
|
- zambia |
|
|
- lusaka |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-classification |
|
|
model-index: |
|
|
- name: LusakaLang-MultiTask |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Language Identification |
|
|
dataset: |
|
|
name: LusakaLang Language Data |
|
|
type: lusakalang |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.97 |
|
|
name: accuracy |
|
|
- type: f1 |
|
|
value: 0.96 |
|
|
name: f1_macro |
|
|
- type: accuracy |
|
|
value: 0.9322 |
|
|
name: accuracy |
|
|
- type: f1 |
|
|
value: 0.9216 |
|
|
name: f1_macro |
|
|
- type: f1 |
|
|
value: 0.8649 |
|
|
name: f1_negative |
|
|
- type: f1 |
|
|
value: 0.95 |
|
|
name: f1_neutral |
|
|
- type: f1 |
|
|
value: 0.95 |
|
|
name: f1_positive |
|
|
- type: accuracy |
|
|
value: 0.91 |
|
|
name: accuracy |
|
|
- type: f1 |
|
|
value: 0.9 |
|
|
name: f1_macro |
|
|
base_model: |
|
|
- Kelvinmbewe/mbert_Lusaka_Language_Analysis |
|
|
- Kelvinmbewe/mbert_LusakaLang_Sentiment_Analysis |
|
|
- Kelvinmbewe/mbert_LusakaLang_Topic |
|
|
--- |
|
|
|
|
|
## **LusakaLang MultiTask Model** |
|
|
|
|
|
This model is a unified transformer architecture built on top of `bert-base-multilingual-cased`, designed to perform three tasks simultaneously: |
|
|
|
|
|
1. Language Identification |
|
|
2. Sentiment Analysis |
|
|
3. Topic Classification |
|
|
|
|
|
The system integrates three fineโtuned LusakaLang checkpoints: |
|
|
|
|
|
- mbert_Lusaka_Language_Analysis |
|
|
- mbert_LusakaLang_Sentiment_Analysis |
|
|
- mbert_LusakaLang_Topic |
|
|
|
|
|
All tasks share a single mBERT encoder, supported by three independent classifier heads. This architecture enhances computational efficiency, reduces memory overhead |
|
|
and promotes consistent, harmonized predictions across all tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## **Why This Model Matters** |
|
|
|
|
|
Zambian communication is inherently multilingual, fluid, and deeply shaped by context. A single message may blend English, Bemba, Nyanja, local slang, |
|
|
and frequent codeโswitching, often expressed through culturally grounded idioms and subtle emotional cues. This model is designed specifically for that |
|
|
environment, where meaning depends not only on the words used but on how languages interact within a single utterance. |
|
|
|
|
|
It excels at identifying the dominant language or detecting when multiple languages are being used together, interpreting sentiment even when it |
|
|
is conveyed indirectly or through culturally specific phrasing, and classifying text into practical topics such as driver behaviour, payment issues, |
|
|
app performance, customer support, and ride availability. By capturing these nuances, the model provides a more accurate and contextโaware |
|
|
understanding of real Zambian communication. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## **How to Use This Model** |
|
|
|
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
import torch |
|
|
|
|
|
class LusakaLangMultiTask: |
|
|
def __init__(self, path="Kelvinmbewe/LusakaLang-MultiTask"): |
|
|
self.tokenizer = AutoTokenizer.from_pretrained(path) |
|
|
self.model = torch.load(f"{path}/model.pt").eval() |
|
|
|
|
|
def predict_language(self, texts): pass |
|
|
def predict_sentiment(self, texts): pass |
|
|
def predict_topic(self, texts): pass |
|
|
|
|
|
llm = LusakaLangMultiTask() |
|
|
|
|
|
print(llm.predict_language([...])) |
|
|
print(llm.predict_sentiment([...])) |
|
|
print(llm.predict_topic([...])) |
|
|
|
|
|
``` |
|
|
|
|
|
## Sample Output |
|
|
|
|
|
```python |
|
|
# Language Identification ๐ |
|
|
[ |
|
|
{"lang": "Bemba", "conf": 0.96}, |
|
|
{"lang": "Nyanja", "conf": 0.95}, |
|
|
{"lang": "English","conf": 0.99} |
|
|
] |
|
|
# Sentiment โค๏ธ |
|
|
[ |
|
|
{"sent": "Negative", "conf": 0.98}, |
|
|
{"sent": "Positive", "conf": 0.95}, |
|
|
{"sent": "Neutral", "conf": 0.87} |
|
|
] |
|
|
# Topic ๐๏ธ |
|
|
[ |
|
|
{"topic": "Payment Issue", "conf": 0.97}, |
|
|
{"topic": "Customer Support", "conf": 0.95}, |
|
|
{"topic": "Driver Behaviour", "conf": 0.96} |
|
|
] |
|
|
``` |
|
|
|
|
|
|
|
|
``` |
|
|
=========================== Training Architecture =========================== |
|
|
|
|
|
๐ฅ Input โ ๐ง Core Engine โ ๐ Output |
|
|
------------------------------------------------------------------------------------ |
|
|
Text (Any Language) โ Tokenizer ๐ค โ Language ๐ |
|
|
โ Shared mBERT Encoder ๐ง โ Bemba / Nyanja / |
|
|
โ CLS Vector ๐ฏ โ English / Mixed |
|
|
------------------------------------------------------------------------------------ |
|
|
User Feedback ๐ฌ โ Tokenizer ๐ค โ Sentiment โค๏ธ |
|
|
โ Shared Encoder ๐ง โ Negative / Neutral / |
|
|
โ CLS Vector ๐ฏ โ Positive |
|
|
------------------------------------------------------------------------------------ |
|
|
Ride Context ๐ โ Tokenizer ๐ค โ Topic ๐๏ธ |
|
|
โ Shared Encoder ๐ง โ Driver / Payment / |
|
|
โ CLS Vector ๐ฏ โ Support / App / Availability |
|
|
------------------------------------------------------------------------------------ |
|
|
``` |