ZambiaLLM

Version: ZambiaLLM_v2026_03_28 Base Model: bert-base-multilingual-cased Architecture: mBERT (Transformer Encoder) Fine-Tuning Method: QLoRA (4-bit Quantized Low-Rank Adaptation) Author: Kelvin Mbewe Framework: HuggingFace Transformers (PyTorch) License: Apache 2.0

Model Overview

ZambiaLLM is a multilingual language representation and classification model developed to support and preserve languages spoken in Zambia, including low-resource and endangered languages.

The project is guided by the following long-term objectives:

To digitally represent all 73 recognized Zambian languages within a unified AI framework.
To build foundational natural language processing (NLP) infrastructure for underrepresented and low-resource languages.
To contribute to the long-term preservation and accessibility of Zambian linguistic heritage.
To ensure inclusion of minority languages with limited speaker populations, preventing digital extinction.
To incorporate and support Zambian Sign Language as a core component of the model.
To enable voice generation and speech technologies across all targeted languages to support learning and accessibility.
To expand support for multimodal communication, including sign language representation and integration.
To facilitate real-world applications, including mobile platforms (iOS and Android), enabling speech-to-text and text-to-speech capabilities for improved accessibility, particularly for deaf and hard-of-hearing users.

Languages Covered (Long-Term Target: 73)

ZambiaLLM is designed with the long-term objective of digitally representing all 73 recognized Zambian languages.

⚠ Current fine-tuning covers 30 languages:

Bemba (Cibemba)
English
Kaonde (Kikaonde)
Lozi (Silozi)
Lunda (Cilunda)
Luvale (siluvale)
Nyanja (Cinyanja)
Shona (cishona)
Tonga (Chitonga)
Tumbuka (Chitumbuka)
Zambian Slang
Chokwe
Congo-Swahili
Icikuhane (Subiya)
Kunda (Cikunda)
Lala (Bisa)
Lambya (Chilambya)
Lenje (Cilenje)
Luchazi (Ciluchazi)
Mambwe-Lungu (Cimambwe)
Mashi (Kwandu)
Mbukushu (Thimbukushu)
Mbunda
Nkoya
Nsenga (Cinsenga)
Namwanga (Cinamwanga)
Nyiha
Soli (Cisoli)
Taabwa

Population Distribution Reference (Estimated)

Language	% of Population
Bemba (Cibemba)	22.5%
Tonga (Chitonga)	9.5%
Nyanja / (Cinyanja)	8.0%
Tumbuka (Chitumbuka)	7.5%
Lozi (Silozi)	4.5%
Lunda (Cilunda)	3.5%
Luvale (siluvale)	2.8%
Nsenga (Cinsenga)	2.5%
Namwanga (Cinamwanga)	2.3%
Kaonde (Kikaonde)	1.8%
Lenje (Cilenje)	1.5%
Lala (Bisa)	1.5%
Nkoya	1.2%
Mbunda	1.2%
Mambwe-Lungu (Cimambwe)	1.2%
Chokwe	1.0%
Lambya (Chilambya)	1.0%
Luchazi (Ciluchazi)	0.9%
Soli (Cisoli)	0.8%
Nyiha	0.8%
Congo Swahili	0.7%
Taabwa	0.6%
Mbukushu (Thimbukushu)	0.6%
Mashi (Kwandu)	0.5%
Kunda (Cikunda)	0.5%
Kuhane (Subiya)	0.5%
Shona (cishona)	0.2%
English	0.5%
Zambian Slang	0.9%

Total Population Reference: 21,000,000 (100%)

Future Standardization Plan

Future releases will:

Align all languages with ISO 639-3 codes
Merge dialect duplicates
Provide speaker population estimates
Document orthographic variants
Identify endangered vs stable language categories

Base Architecture

ZambiaLLM is built on:

bert-base-multilingual-cased

12-layer Transformer encoder
768 hidden dimensions
12 attention heads
~178M parameters (full model)
Pretrained on 104 languages using masked language modeling (MLM)

Instead of full fine-tuning, ZambiaLLM uses QLoRA (4-bit quantized training with low-rank adapters) for efficient adaptation.

Intended Use

ZambiaLLM is intended for:

Language identification
Low-resource NLP experimentation
Linguistic preservation research
Academic study
Cultural documentation initiatives
Foundation model adaptation for Zambia-specific NLP tasks
It is not designed for generative dialogue or translation in its current form.

Sources

Other datasets were derived from:

@inproceedings{sikasote23_interspeech,
  author={Claytone Sikasote and Kalinda Siaminwe and Stanly Mwape and Bangiwe Zulu and Mofya Phiri and Martin Phiri and David Zulu and Mayumbo Nyirenda and Antonios Anastasopoulos},
  title={{Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={3984--3988},
  doi={10.21437/Interspeech.2023-1979}
}

Downloads last month: 108

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Kelvinmbewe/ZambiaLLM

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(953)

this model

Kelvinmbewe
/

ZambiaLLM