mbert LusakaLang Language Analysis Model
Model Overview
LusakaLang is a multilingual sentiment classification model fine-tuned fromgoogle-bert/bert-base-multilingual-cased (mBERT).
It is designed specifically for Zambian language usage, with a focus on:
- Zambian English (Lusaka variety)
- Bemba
- Nyanja (Chichewa)
The model captures code-switching, local idioms, indirect expressions, and sarcasm commonly used in everyday communication and social media in Zambia.
Supported Languages
- English (Zambian English)
- Bemba
- Nyanja (Chichewa)
Task
def classify_text(text):
"""
Run inference on a single text input using the fine‑tuned LusakaLang model.
Returns the predicted label and confidence score.
"""
result = classifier(text)[0]
label = result["label"]
score = round(result["score"], 4)
return label, score
samples = [
"Muli shani bane, nalishiba bwino.",
"How are you doing today?",
"Tili bwino, zikomo kwambiri."
]
for s in samples:
label, score = classify_text(s)
print(f"Text: {s}\nPrediction: {label} (confidence={score})\n")
Training Data
LusakaLang was fine-tuned using Zambia-focused multilingual datasets:
- English–Chichewa Sentence Pairs (MT560)
- Code-170k-Bemba
- BEMBA_big_c
These datasets enable strong performance on:
- Informal and conversational text
- Code-switched language
- Culturally specific idioms and phrasing
📊 Evaluation Results (mBERT)
The model was evaluated on the test split with multilingual data.
Overall Performance
- Test Loss: 0.0039
- Accuracy: 99.73%
- Precision: 99.73%
- Recall: 99.73%
- Macro F1: 99.78%
Language-Specific F1 Scores
- Bemba: 99.95%
- Nyanja: 99.69%
- English: 99.70%
Training Configuration
- num_train_epochs: 1
- learning_rate: 2e-5
- batch_size: 32
Absolutely! Here's a clear, easy-to-read section you can add to your Hugging Face model card README to explain runtime expectations and GPU usage. You can paste it directly under your Use Cases or How to Use This Model sections.
âš¡ Model Runtime and Performance Notes
Model size: mbert_LusakaLang is based on "google-bert/bert-base-multilingual-cased" (~178M parameters). Because of this, inference on large datasets can take longer than smaller models.
Typical runtime:
- CPU: 10–30 examples per second. Predicting thousands of examples may take 20–30+ minutes.
- GPU: 200–600 examples per second. Full test datasets usually process in under a minute.
How to speed up inference:
Use a GPU: Ensure PyTorch detects your GPU:
import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device)Increase batch size: Adjust the evaluation batch size to make better use of GPU memory:
from transformers import TrainingArguments eval_args = TrainingArguments(per_device_eval_batch_size=64)Disable gradients during prediction: This prevents unnecessary computation:
with torch.no_grad(): predictions = trainer.predict(dataset["test"])Use mixed precision (fp16) on GPU: Speeds up computation and reduces memory usage:
model.half()
Summary: If you see long inference times (e.g., 30 minutes), it’s likely because predictions are running on CPU with small batch sizes. Using a GPU with optimized batch size drastically reduces runtime.
- Downloads last month
- -
Model tree for Kelvinmbewe/mbert_LusakaLang_Language_Analysis
Base model
google-bert/bert-base-multilingual-casedDatasets used to train Kelvinmbewe/mbert_LusakaLang_Language_Analysis
Evaluation results
- accuracy on LusakaLang Test Settest set self-reported0.997
- precision on LusakaLang Test Settest set self-reported0.997
- recall on LusakaLang Test Settest set self-reported0.997
- f1 on LusakaLang Test Settest set self-reported0.998




