Instructions to use mathiaskabango/shona-mt5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mathiaskabango/shona-mt5-small with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/shona-mt5-small") model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/shona-mt5-small") - Notebooks
- Google Colab
- Kaggle
shona-mt5-small
A multilingual T5 language model pre-trained on Shona (chiShona) text β one of the first publicly available Shona language models on HuggingFace.
"Language is the road map of a culture." β Rita Mae Brown
Shona is a Bantu language spoken by approximately 15 million people, primarily in Zimbabwe. Despite this, it remains severely underrepresented in NLP research and publicly available language models. This model is a step toward closing that gap.
Model Details
| Property | Details |
|---|---|
| Base Model | google/mt5-small |
| Model Type | Seq2Seq (Text-to-Text) |
| Language | Shona (sn) |
| License | Apache 2.0 |
| Developer | Mathias Kabango β African Leadership University, Kigali, Rwanda |
| Parameters | ~300M (mt5-small) |
| Framework | PyTorch + HuggingFace Transformers |
Model Description
shona-mt5-small is a fine-tuned version of Google's mT5-small β a multilingual text-to-text transformer β continued pre-trained on a curated Shona text corpus. The goal of this work is to provide the NLP community with a foundational Shona language model that can be further fine-tuned for downstream tasks such as:
- Conversational AI / chatbots in Shona
- Machine translation (Shona β English)
- Text summarisation
- Question answering
- Named entity recognition
This model is the backbone of TauraBot β an open-source conversational AI system built for Shona speakers.
Intended Uses
Primary Use Cases
- Fine-tuning for Shona downstream NLP tasks (translation, dialogue, classification)
- Research into low-resource African language modeling
- Education β building tools that serve Shona-speaking communities
- Baseline for benchmarking future Shona language models
How to Use
β οΈ Important: This is a pre-trained base model, not a conversational model. Running inference directly will produce outputs with
<extra_id>sentinel tokens rather than meaningful Shona text. This is expected behaviour for mT5-based models that have not yet been fine-tuned on a downstream task.This model is intended to be fine-tuned. For a conversational Shona model built on top of this base, see mathiaskabango/taurabot-shona (coming soon).
For researchers who want to fine-tune this model on their own Shona task:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/shona-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/shona-mt5-small")
Load the model and fine-tune it on your labelled Shona dataset using
Seq2SeqTrainer. See the roadmap below for the upcoming fine-tuned release.
Limitations
This model was developed under significant resource and data constraints. Users should be aware of the following limitations before deploying or building on top of it:
Compute Constraints
- Training was performed on a single consumer-grade GPU with limited VRAM. This restricted batch size, sequence length, and the total number of training steps possible.
- Only 2,000 training steps were completed, representing approximately 0.15 epochs over the training corpus β meaning the model has seen a very small fraction of the available data.
- Mixed precision training (AMP) was used to fit within GPU memory limits.
Data Constraints
- The training corpus is limited in size and domain diversity. Shona is a low-resource language and large-scale, high-quality Shona text data is scarce.
- The dataset is weighted toward written formal Shona and may not generalise well to spoken, dialectal, or regional variations of the language.
- Coverage of specialised domains (medicine, law, science) is minimal.
Performance Constraints
- The final evaluation loss of 2.6432 indicates the model has learned meaningful Shona language patterns, but is not yet at the fluency level of well-resourced language models.
- The model was only trained for a small fraction of one epoch, meaning it has not converged and would benefit substantially from additional training.
- Outputs may occasionally produce grammatically incorrect Shona, mix in English tokens, or repeat phrases β common behaviour in under-trained sequence models.
Scope Constraints
- This model covers Shona (chiShona) as spoken primarily in Zimbabwe. It is not intended to represent all Bantu or Zimbabwean languages (e.g. Ndebele is not covered).
- The model should not be used for high-stakes applications (medical advice, legal decisions, emergency services) without significant further development and evaluation.
Training and Evaluation Data
The model was trained on a curated Shona text corpus assembled from publicly available Shona language sources including web-scraped text, religious texts, and community-contributed writing. The dataset was preprocessed and tokenised using the mT5 tokenizer with Shona-specific filtering.
A dedicated dataset release β mathiaskabango/shona-corpus β is planned to accompany this model and will be made publicly available on HuggingFace Datasets.
Training Procedure
Hardware
- GPU: Single consumer-grade GPU (limited VRAM)
- Mixed Precision: Native AMP (float16)
- Gradient Accumulation: 4 steps (effective batch size: 64)
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 5e-4 |
| Train Batch Size | 16 |
| Eval Batch Size | 16 |
| Gradient Accumulation Steps | 4 |
| Effective Batch Size | 64 |
| Warmup Steps | 500 |
| Total Training Steps | 2,000 |
| LR Scheduler | Linear |
| Optimizer | AdamW (fused) |
| Seed | 42 |
Training Results
| Step | Epoch | Training Loss | Validation Loss |
|---|---|---|---|
| 500 | 0.039 | 3.6297 | 3.0109 |
| 1000 | 0.077 | 3.2255 | 2.7745 |
| 1500 | 0.116 | 3.0547 | 2.6556 |
| 2000 | 0.155 | 3.006 | 2.6432 |
The consistent downward trend in both training and validation loss across all steps confirms the model is learning Shona language structure. The gap between training and validation loss is narrow, suggesting no significant overfitting given the early training stage.
Framework Versions
| Library | Version |
|---|---|
| Transformers | 4.57.6 |
| PyTorch | 2.10.0+cu128 |
| Datasets | 2.21.0 |
| Tokenizers | 0.22.2 |
Roadmap
This model is the first release in a broader African language AI project:
- TauraBot β conversation fine-tuned Shona chatbot (
mathiaskabango/taurabot-shona) - Shona Corpus β public dataset release (
mathiaskabango/shona-corpus) - Shona Whisper β speech recognition benchmark for Shona
- TauraBot Gradio Space β interactive demo
Contact & Citation
Developer: Mathias Kabango Institution: African Leadership University, Kigali, Rwanda Email: kabangomathias0@gmail.com GitHub: Mathias-Kabango3
If you use this model in your research or build on top of it, please consider citing it and linking back to this repository. Community contributions, corrections to the Shona training data, and fine-tuning experiments are warmly welcome.
π€ Acknowledgements
This work was developed as part of a mission to build open-source AI infrastructure for African languages. Special thanks to the Masakhane community and all researchers working on low-resource African NLP.
- Downloads last month
- 71