ZambiaLLM

Version: ZambiaLLM_v2026_03_28 Base Model: bert-base-multilingual-cased Architecture: mBERT (Transformer Encoder) Fine-Tuning Method: QLoRA (4-bit Quantized Low-Rank Adaptation) Author: Kelvin Mbewe Framework: HuggingFace Transformers (PyTorch) License: Apache 2.0


Model Overview

ZambiaLLM is a multilingual language representation and classification model developed to support and preserve languages spoken in Zambia, including low-resource and endangered languages.

The project is guided by the following long-term objectives:

  • To digitally represent all 73 recognized Zambian languages within a unified AI framework.
  • To build foundational natural language processing (NLP) infrastructure for underrepresented and low-resource languages.
  • To contribute to the long-term preservation and accessibility of Zambian linguistic heritage.
  • To ensure inclusion of minority languages with limited speaker populations, preventing digital extinction.
  • To incorporate and support Zambian Sign Language as a core component of the model.
  • To enable voice generation and speech technologies across all targeted languages to support learning and accessibility.
  • To expand support for multimodal communication, including sign language representation and integration.
  • To facilitate real-world applications, including mobile platforms (iOS and Android), enabling speech-to-text and text-to-speech capabilities for improved accessibility, particularly for deaf and hard-of-hearing users.

Languages Covered (Long-Term Target: 73)

ZambiaLLM is designed with the long-term objective of digitally representing all 73 recognized Zambian languages.

⚠ Current fine-tuning covers 30 languages:

  • Bemba (Cibemba)
  • English
  • Kaonde (Kikaonde)
  • Lozi (Silozi)
  • Lunda (Cilunda)
  • Luvale (siluvale)
  • Nyanja (Cinyanja)
  • Shona (cishona)
  • Tonga (Chitonga)
  • Tumbuka (Chitumbuka)
  • Zambian Slang
  • Chokwe
  • Congo-Swahili
  • Icikuhane (Subiya)
  • Kunda (Cikunda)
  • Lala (Bisa)
  • Lambya (Chilambya)
  • Lenje (Cilenje)
  • Luchazi (Ciluchazi)
  • Mambwe-Lungu (Cimambwe)
  • Mashi (Kwandu)
  • Mbukushu (Thimbukushu)
  • Mbunda
  • Nkoya
  • Nsenga (Cinsenga)
  • Namwanga (Cinamwanga)
  • Nyiha
  • Soli (Cisoli)
  • Taabwa

Population Distribution Reference (Estimated)

Language % of Population
Bemba (Cibemba) 22.5%
Tonga (Chitonga) 9.5%
Nyanja / (Cinyanja) 8.0%
Tumbuka (Chitumbuka) 7.5%
Lozi (Silozi) 4.5%
Lunda (Cilunda) 3.5%
Luvale (siluvale) 2.8%
Nsenga (Cinsenga) 2.5%
Namwanga (Cinamwanga) 2.3%
Kaonde (Kikaonde) 1.8%
Lenje (Cilenje) 1.5%
Lala (Bisa) 1.5%
Nkoya 1.2%
Mbunda 1.2%
Mambwe-Lungu (Cimambwe) 1.2%
Chokwe 1.0%
Lambya (Chilambya) 1.0%
Luchazi (Ciluchazi) 0.9%
Soli (Cisoli) 0.8%
Nyiha 0.8%
Congo Swahili 0.7%
Taabwa 0.6%
Mbukushu (Thimbukushu) 0.6%
Mashi (Kwandu) 0.5%
Kunda (Cikunda) 0.5%
Kuhane (Subiya) 0.5%
Shona (cishona) 0.2%
English 0.5%
Zambian Slang 0.9%

Total Population Reference: 21,000,000 (100%)


Future Standardization Plan

Future releases will:

  • Align all languages with ISO 639-3 codes
  • Merge dialect duplicates
  • Provide speaker population estimates
  • Document orthographic variants
  • Identify endangered vs stable language categories

Base Architecture

ZambiaLLM is built on:

bert-base-multilingual-cased

  • 12-layer Transformer encoder
  • 768 hidden dimensions
  • 12 attention heads
  • ~178M parameters (full model)
  • Pretrained on 104 languages using masked language modeling (MLM)

Instead of full fine-tuning, ZambiaLLM uses QLoRA (4-bit quantized training with low-rank adapters) for efficient adaptation.


Intended Use

ZambiaLLM is intended for:

  • Language identification
  • Low-resource NLP experimentation
  • Linguistic preservation research
  • Academic study
  • Cultural documentation initiatives
  • Foundation model adaptation for Zambia-specific NLP tasks
  • It is not designed for generative dialogue or translation in its current form.

Sources

Other datasets were derived from:

@inproceedings{sikasote23_interspeech,
  author={Claytone Sikasote and Kalinda Siaminwe and Stanly Mwape and Bangiwe Zulu and Mofya Phiri and Martin Phiri and David Zulu and Mayumbo Nyirenda and Antonios Anastasopoulos},
  title={{Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={3984--3988},
  doi={10.21437/Interspeech.2023-1979}
}
Downloads last month
108
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kelvinmbewe/ZambiaLLM

Finetuned
(953)
this model

Datasets used to train Kelvinmbewe/ZambiaLLM