mbert LusakaLang Language Analysis Model

Model Overview

LusakaLang is a multilingual sentiment classification model fine-tuned from
google-bert/bert-base-multilingual-cased (mBERT).

It is designed specifically for Zambian language usage, with a focus on:

  • Zambian English (Lusaka variety)
  • Bemba
  • Nyanja (Chichewa)

The model captures code-switching, local idioms, indirect expressions, and sarcasm commonly used in everyday communication and social media in Zambia.


Supported Languages

  • English (Zambian English)
  • Bemba
  • Nyanja (Chichewa)

Task

def classify_text(text):
    """
    Run inference on a single text input using the fine‑tuned LusakaLang model.
    Returns the predicted label and confidence score.
    """
    result = classifier(text)[0]
    label = result["label"]
    score = round(result["score"], 4)
    return label, score


samples = [
    "Muli shani bane, nalishiba bwino.",
    "How are you doing today?",
    "Tili bwino, zikomo kwambiri."
]

for s in samples:
    label, score = classify_text(s)
    print(f"Text: {s}\nPrediction: {label} (confidence={score})\n")

Training Data

LusakaLang was fine-tuned using Zambia-focused multilingual datasets:

  • English–Chichewa Sentence Pairs (MT560)
  • Code-170k-Bemba
  • BEMBA_big_c

These datasets enable strong performance on:

  • Informal and conversational text
  • Code-switched language
  • Culturally specific idioms and phrasing

📊 Evaluation Results (mBERT)

The model was evaluated on the test split with multilingual data.

Overall Performance

  • Test Loss: 0.0039
  • Accuracy: 99.73%
  • Precision: 99.73%
  • Recall: 99.73%
  • Macro F1: 99.78%

Language-Specific F1 Scores

  • Bemba: 99.95%
  • Nyanja: 99.69%
  • English: 99.70%

Training Configuration

  • num_train_epochs: 1
  • learning_rate: 2e-5
  • batch_size: 32

image

image

Absolutely! Here's a clear, easy-to-read section you can add to your Hugging Face model card README to explain runtime expectations and GPU usage. You can paste it directly under your Use Cases or How to Use This Model sections.


âš¡ Model Runtime and Performance Notes

  • Model size: mbert_LusakaLang is based on "google-bert/bert-base-multilingual-cased" (~178M parameters). Because of this, inference on large datasets can take longer than smaller models.

  • Typical runtime:

    • CPU: 10–30 examples per second. Predicting thousands of examples may take 20–30+ minutes.
    • GPU: 200–600 examples per second. Full test datasets usually process in under a minute.
  • How to speed up inference:

    1. Use a GPU: Ensure PyTorch detects your GPU:

      import torch
      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
      model.to(device)
      
    2. Increase batch size: Adjust the evaluation batch size to make better use of GPU memory:

      from transformers import TrainingArguments
      eval_args = TrainingArguments(per_device_eval_batch_size=64)
      
    3. Disable gradients during prediction: This prevents unnecessary computation:

      with torch.no_grad():
          predictions = trainer.predict(dataset["test"])
      
    4. Use mixed precision (fp16) on GPU: Speeds up computation and reduces memory usage:

      model.half()
      
  • Summary: If you see long inference times (e.g., 30 minutes), it’s likely because predictions are running on CPU with small batch sizes. Using a GPU with optimized batch size drastically reduces runtime.


image

image

image

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kelvinmbewe/mbert_LusakaLang_Language_Analysis

Finetuned
(925)
this model

Datasets used to train Kelvinmbewe/mbert_LusakaLang_Language_Analysis

Evaluation results