How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/shona-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/shona-mt5-small")
Quick Links

shona-mt5-small

A multilingual T5 language model pre-trained on Shona (chiShona) text β€” one of the first publicly available Shona language models on HuggingFace.

"Language is the road map of a culture." β€” Rita Mae Brown

Shona is a Bantu language spoken by approximately 15 million people, primarily in Zimbabwe. Despite this, it remains severely underrepresented in NLP research and publicly available language models. This model is a step toward closing that gap.


Model Details

Property Details
Base Model google/mt5-small
Model Type Seq2Seq (Text-to-Text)
Language Shona (sn)
License Apache 2.0
Developer Mathias Kabango β€” African Leadership University, Kigali, Rwanda
Parameters ~300M (mt5-small)
Framework PyTorch + HuggingFace Transformers

Model Description

shona-mt5-small is a fine-tuned version of Google's mT5-small β€” a multilingual text-to-text transformer β€” continued pre-trained on a curated Shona text corpus. The goal of this work is to provide the NLP community with a foundational Shona language model that can be further fine-tuned for downstream tasks such as:

  • Conversational AI / chatbots in Shona
  • Machine translation (Shona ↔ English)
  • Text summarisation
  • Question answering
  • Named entity recognition

This model is the backbone of TauraBot β€” an open-source conversational AI system built for Shona speakers.


Intended Uses

Primary Use Cases

  • Fine-tuning for Shona downstream NLP tasks (translation, dialogue, classification)
  • Research into low-resource African language modeling
  • Education β€” building tools that serve Shona-speaking communities
  • Baseline for benchmarking future Shona language models

How to Use

⚠️ Important: This is a pre-trained base model, not a conversational model. Running inference directly will produce outputs with <extra_id> sentinel tokens rather than meaningful Shona text. This is expected behaviour for mT5-based models that have not yet been fine-tuned on a downstream task.

This model is intended to be fine-tuned. For a conversational Shona model built on top of this base, see mathiaskabango/taurabot-shona (coming soon).

For researchers who want to fine-tune this model on their own Shona task:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/shona-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/shona-mt5-small")

Load the model and fine-tune it on your labelled Shona dataset using Seq2SeqTrainer. See the roadmap below for the upcoming fine-tuned release.


Limitations

This model was developed under significant resource and data constraints. Users should be aware of the following limitations before deploying or building on top of it:

Compute Constraints

  • Training was performed on a single consumer-grade GPU with limited VRAM. This restricted batch size, sequence length, and the total number of training steps possible.
  • Only 2,000 training steps were completed, representing approximately 0.15 epochs over the training corpus β€” meaning the model has seen a very small fraction of the available data.
  • Mixed precision training (AMP) was used to fit within GPU memory limits.

Data Constraints

  • The training corpus is limited in size and domain diversity. Shona is a low-resource language and large-scale, high-quality Shona text data is scarce.
  • The dataset is weighted toward written formal Shona and may not generalise well to spoken, dialectal, or regional variations of the language.
  • Coverage of specialised domains (medicine, law, science) is minimal.

Performance Constraints

  • The final evaluation loss of 2.6432 indicates the model has learned meaningful Shona language patterns, but is not yet at the fluency level of well-resourced language models.
  • The model was only trained for a small fraction of one epoch, meaning it has not converged and would benefit substantially from additional training.
  • Outputs may occasionally produce grammatically incorrect Shona, mix in English tokens, or repeat phrases β€” common behaviour in under-trained sequence models.

Scope Constraints

  • This model covers Shona (chiShona) as spoken primarily in Zimbabwe. It is not intended to represent all Bantu or Zimbabwean languages (e.g. Ndebele is not covered).
  • The model should not be used for high-stakes applications (medical advice, legal decisions, emergency services) without significant further development and evaluation.

Training and Evaluation Data

The model was trained on a curated Shona text corpus assembled from publicly available Shona language sources including web-scraped text, religious texts, and community-contributed writing. The dataset was preprocessed and tokenised using the mT5 tokenizer with Shona-specific filtering.

A dedicated dataset release β€” mathiaskabango/shona-corpus β€” is planned to accompany this model and will be made publicly available on HuggingFace Datasets.


Training Procedure

Hardware

  • GPU: Single consumer-grade GPU (limited VRAM)
  • Mixed Precision: Native AMP (float16)
  • Gradient Accumulation: 4 steps (effective batch size: 64)

Hyperparameters

Parameter Value
Learning Rate 5e-4
Train Batch Size 16
Eval Batch Size 16
Gradient Accumulation Steps 4
Effective Batch Size 64
Warmup Steps 500
Total Training Steps 2,000
LR Scheduler Linear
Optimizer AdamW (fused)
Seed 42

Training Results

Step Epoch Training Loss Validation Loss
500 0.039 3.6297 3.0109
1000 0.077 3.2255 2.7745
1500 0.116 3.0547 2.6556
2000 0.155 3.006 2.6432

The consistent downward trend in both training and validation loss across all steps confirms the model is learning Shona language structure. The gap between training and validation loss is narrow, suggesting no significant overfitting given the early training stage.

Framework Versions

Library Version
Transformers 4.57.6
PyTorch 2.10.0+cu128
Datasets 2.21.0
Tokenizers 0.22.2

Roadmap

This model is the first release in a broader African language AI project:

  • TauraBot β€” conversation fine-tuned Shona chatbot (mathiaskabango/taurabot-shona)
  • Shona Corpus β€” public dataset release (mathiaskabango/shona-corpus)
  • Shona Whisper β€” speech recognition benchmark for Shona
  • TauraBot Gradio Space β€” interactive demo

Contact & Citation

Developer: Mathias Kabango Institution: African Leadership University, Kigali, Rwanda Email: kabangomathias0@gmail.com GitHub: Mathias-Kabango3

If you use this model in your research or build on top of it, please consider citing it and linking back to this repository. Community contributions, corrections to the Shona training data, and fine-tuning experiments are warmly welcome.


🀝 Acknowledgements

This work was developed as part of a mission to build open-source AI infrastructure for African languages. Special thanks to the Masakhane community and all researchers working on low-resource African NLP.


Downloads last month
71
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mathiaskabango/shona-mt5-small

Base model

google/mt5-small
Finetuned
(679)
this model
Finetunes
1 model