ZambiaLLM
Version: ZambiaLLM_v2026_03_28
Base Model: bert-base-multilingual-cased
Architecture: mBERT (Transformer Encoder)
Fine-Tuning Method: QLoRA (4-bit Quantized Low-Rank Adaptation)
Author: Kelvin Mbewe
Framework: HuggingFace Transformers (PyTorch)
License: Apache 2.0
Model Overview
ZambiaLLM is a multilingual language representation and classification model developed to support and preserve languages spoken in Zambia, including low-resource and endangered languages.
The project is guided by the following long-term objectives:
- To digitally represent all 73 recognized Zambian languages within a unified AI framework.
- To build foundational natural language processing (NLP) infrastructure for underrepresented and low-resource languages.
- To contribute to the long-term preservation and accessibility of Zambian linguistic heritage.
- To ensure inclusion of minority languages with limited speaker populations, preventing digital extinction.
- To incorporate and support Zambian Sign Language as a core component of the model.
- To enable voice generation and speech technologies across all targeted languages to support learning and accessibility.
- To expand support for multimodal communication, including sign language representation and integration.
- To facilitate real-world applications, including mobile platforms (iOS and Android), enabling speech-to-text and text-to-speech capabilities for improved accessibility, particularly for deaf and hard-of-hearing users.
Languages Covered (Long-Term Target: 73)
ZambiaLLM is designed with the long-term objective of digitally representing all 73 recognized Zambian languages.
⚠ Current fine-tuning covers 30 languages:
- Bemba (Cibemba)
- English
- Kaonde (Kikaonde)
- Lozi (Silozi)
- Lunda (Cilunda)
- Luvale (siluvale)
- Nyanja (Cinyanja)
- Shona (cishona)
- Tonga (Chitonga)
- Tumbuka (Chitumbuka)
- Zambian Slang
- Chokwe
- Congo-Swahili
- Icikuhane (Subiya)
- Kunda (Cikunda)
- Lala (Bisa)
- Lambya (Chilambya)
- Lenje (Cilenje)
- Luchazi (Ciluchazi)
- Mambwe-Lungu (Cimambwe)
- Mashi (Kwandu)
- Mbukushu (Thimbukushu)
- Mbunda
- Nkoya
- Nsenga (Cinsenga)
- Namwanga (Cinamwanga)
- Nyiha
- Soli (Cisoli)
- Taabwa
Population Distribution Reference (Estimated)
| Language | % of Population |
|---|---|
| Bemba (Cibemba) | 22.5% |
| Tonga (Chitonga) | 9.5% |
| Nyanja / (Cinyanja) | 8.0% |
| Tumbuka (Chitumbuka) | 7.5% |
| Lozi (Silozi) | 4.5% |
| Lunda (Cilunda) | 3.5% |
| Luvale (siluvale) | 2.8% |
| Nsenga (Cinsenga) | 2.5% |
| Namwanga (Cinamwanga) | 2.3% |
| Kaonde (Kikaonde) | 1.8% |
| Lenje (Cilenje) | 1.5% |
| Lala (Bisa) | 1.5% |
| Nkoya | 1.2% |
| Mbunda | 1.2% |
| Mambwe-Lungu (Cimambwe) | 1.2% |
| Chokwe | 1.0% |
| Lambya (Chilambya) | 1.0% |
| Luchazi (Ciluchazi) | 0.9% |
| Soli (Cisoli) | 0.8% |
| Nyiha | 0.8% |
| Congo Swahili | 0.7% |
| Taabwa | 0.6% |
| Mbukushu (Thimbukushu) | 0.6% |
| Mashi (Kwandu) | 0.5% |
| Kunda (Cikunda) | 0.5% |
| Kuhane (Subiya) | 0.5% |
| Shona (cishona) | 0.2% |
| English | 0.5% |
| Zambian Slang | 0.9% |
Total Population Reference: 21,000,000 (100%)
Future Standardization Plan
Future releases will:
- Align all languages with ISO 639-3 codes
- Merge dialect duplicates
- Provide speaker population estimates
- Document orthographic variants
- Identify endangered vs stable language categories
Base Architecture
ZambiaLLM is built on:
bert-base-multilingual-cased
- 12-layer Transformer encoder
- 768 hidden dimensions
- 12 attention heads
- ~178M parameters (full model)
- Pretrained on 104 languages using masked language modeling (MLM)
Instead of full fine-tuning, ZambiaLLM uses QLoRA (4-bit quantized training with low-rank adapters) for efficient adaptation.
Intended Use
ZambiaLLM is intended for:
- Language identification
- Low-resource NLP experimentation
- Linguistic preservation research
- Academic study
- Cultural documentation initiatives
- Foundation model adaptation for Zambia-specific NLP tasks
- It is not designed for generative dialogue or translation in its current form.
Sources
Other datasets were derived from:
@inproceedings{sikasote23_interspeech,
author={Claytone Sikasote and Kalinda Siaminwe and Stanly Mwape and Bangiwe Zulu and Mofya Phiri and Martin Phiri and David Zulu and Mayumbo Nyirenda and Antonios Anastasopoulos},
title={{Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={3984--3988},
doi={10.21437/Interspeech.2023-1979}
}
- Downloads last month
- 108
Model tree for Kelvinmbewe/ZambiaLLM
Base model
google-bert/bert-base-multilingual-cased