Instructions to use mathiaskabango/taurabot-shona with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mathiaskabango/taurabot-shona with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/taurabot-shona") model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/taurabot-shona") - Notebooks
- Google Colab
- Kaggle
TauraBot β Shona Conversational AI
"Taura" means "Speak" in Shona (chiShona)
TauraBot is the first open-source conversational AI model built specifically for Shona speakers. It is a fine-tuned version of mathiaskabango/shona-mt5-small β itself a continued pre-training of Google's mT5-small on a Shona text corpus.
Shona is spoken by approximately 15 million people, primarily in Zimbabwe, yet remains almost entirely absent from modern NLP research and tooling. TauraBot is a step toward changing that.
β οΈ Important β Please Read Before Using
This model is an early-stage research release and not yet production ready.
Due to significant GPU constraints during training, this model was fine-tuned on a limited dataset with restricted compute. As a result:
- Responses may be inconsistent or grammatically imperfect
- The model may repeat phrases or produce generic outputs
- It performs best on simple conversational exchanges similar to its training data
- It will not handle complex or domain-specific Shona well yet
If you want to use this model in a real application, we strongly recommend further fine-tuning on your own Shona conversational data. See the fine-tuning guide below.
This model is actively being improved. A better version with more training data and compute is planned for release. Watch this repo for updates.
Model Details
| Property | Details |
|---|---|
| Base Model | mathiaskabango/shona-mt5-small |
| Model Type | Seq2Seq Conversational (Text-to-Text) |
| Language | Shona (sn) |
| License | Apache 2.0 |
| Developer | Mathias Kabango β African Leadership University, Kigali, Rwanda |
| Training Data | 500 curated Shona conversation pairs |
| Task Prefix | taura: |
How to Use
The model requires a taura: prefix on all inputs. Without this prefix
it will not behave conversationally.
Basic inference
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/taurabot-shona")
model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/taurabot-shona")
def chat(message):
# Always include the task prefix
input_text = "taura: " + message.strip()
inputs = tokenizer(
input_text,
return_tensors="pt",
max_length=64,
truncation=True,
)
outputs = model.generate(
**inputs,
max_new_tokens=60,
num_beams=4,
no_repeat_ngram_size=3,
repetition_penalty=2.0,
early_stopping=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example conversations
print(chat("Mhoro, makadii?"))
# Expected: "Ndiripo mazvita, imi makadii?"
print(chat("Zita rako ndiani?"))
# Expected: "Zita rangu ndiTauraBot."
print(chat("Unoda kudya chii?"))
# Expected: "Ndinoda sadza nemufushwa."
Simple chat loop
print("TauraBot β Taura neni! (type 'exit' to quit)\n")
while True:
user = input("Iwe: ")
if user.lower() == "exit":
break
print(f"TauraBot: {chat(user)}\n")
How to Fine-Tune Further (Recommended)
Because this model was trained under compute constraints, further fine-tuning on your own data will significantly improve quality. Here is a minimal script to continue training:
from transformers import (
AutoTokenizer, AutoModelForSeq2SeqLM,
Seq2SeqTrainer, Seq2SeqTrainingArguments,
DataCollatorForSeq2Seq,
)
from datasets import Dataset
MODEL = "mathiaskabango/taurabot-shona"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)
# Your conversation pairs β the more the better
# Format: input is the human turn, target is the bot response
my_conversations = [
{"input": "taura: Mhoro!", "target": "Mhoro! Makadii?"},
{"input": "taura: Ndiri kuneta.", "target": "Zorora zvishoma. Unokwanisa!"},
# add as many as you have β 1000+ pairs recommended
]
dataset = Dataset.from_list(my_conversations)
def preprocess(batch):
inputs = tokenizer(batch["input"], max_length=64,
truncation=True, padding="max_length")
labels = tokenizer(batch["target"], max_length=64,
truncation=True, padding="max_length")
labels["input_ids"] = [
[(t if t != tokenizer.pad_token_id else -100) for t in label]
for label in labels["input_ids"]
]
inputs["labels"] = labels["input_ids"]
return inputs
tokenized = dataset.map(preprocess, batched=True)
args = Seq2SeqTrainingArguments(
output_dir="taurabot-finetuned",
num_train_epochs=20, # increase for better results
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-4, # lower LR when continuing from checkpoint
warmup_steps=50,
predict_with_generate=True,
logging_steps=10,
save_strategy="epoch",
fp16=True,
push_to_hub=False, # set True to push to your own HF repo
)
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=tokenized,
data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)
trainer.train()
# Save your improved model
model.save_pretrained("taurabot-finetuned")
tokenizer.save_pretrained("taurabot-finetuned")
print("Done! Test your improved model.")
Tips for better fine-tuning results
- More data is the single biggest improvement β aim for 1,000 to 5,000 conversation pairs
- Use native speaker corrections if possible
- Keep conversations short and natural β 1 to 2 sentences per turn
- Always use the
taura:prefix in your input column - A lower learning rate (
1e-4or5e-5) prevents overwriting what the model already knows
β οΈ Limitations
| Limitation | Detail |
|---|---|
| Compute constraints | Trained on a single consumer GPU with limited VRAM. Only 18 epochs completed before overfitting began. |
| Small training set | Fine-tuned on 500 conversation pairs β significantly below the recommended minimum for production conversational models |
| Early overfitting | Validation loss stopped improving after epoch 8 (2.33) and began rising β a sign the model needs more diverse training data |
| Hallucinated prefixes | May occasionally output "Mubvunzo:" or similar artefacts inherited from pre-training data |
| Limited domain coverage | Trained primarily on everyday conversational Shona β will not handle medical, legal, or technical topics |
| Dialect coverage | Covers standard Shona as spoken in Zimbabwe β may not generalise to regional dialects |
| Not for high-stakes use | Should not be used for medical advice, legal decisions, or any critical application without significant further development |
Training Details
What the loss curve tells us
Epoch 8: Validation loss 2.33 β best checkpoint Epoch 9: Validation loss 2.35 β started rising (overfitting) Epoch 18: Validation loss 2.58 β continued rising
The model began overfitting after epoch 8 because 500 conversation pairs is a small dataset for a seq2seq model. The best weights are from around epoch 8. More diverse training data would push the validation loss lower before overfitting begins.
Training hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 3e-4 |
| Train Batch Size | 8 |
| Gradient Accumulation | 2 (effective batch = 16) |
| Warmup Ratio | 0.03 |
| Epochs | 18 (of 200 planned) |
| Mixed Precision | fp16 |
| Optimizer | AdamW (fused) |
| Seed | 42 |
Training results
| Epoch | Step | Training Loss | Validation Loss |
|---|---|---|---|
| 2 | 80 | 11.7061 | 4.9347 |
| 3 | 120 | 6.0485 | 3.3783 |
| 4 | 160 | 3.8857 | 2.8899 |
| 5 | 200 | 2.9967 | 2.5140 |
| 6 | 240 | 2.5225 | 2.4059 |
| 7 | 280 | 2.2792 | 2.3723 |
| 8 | 320 | 2.071 | 2.3340 β best |
| 9 | 360 | 1.9104 | 2.3476 |
| 18 | 720 | 1.2119 | 2.5784 |
Framework versions
| Library | Version |
|---|---|
| Transformers | 4.57.6 |
| PyTorch | 2.10.0+cu128 |
| Datasets | 2.21.0 |
| Tokenizers | 0.22.2 |
Roadmap
- TauraBot v2 β retrain base model with more steps and larger corpus
- Larger conversation dataset β expanding beyond 500 pairs
- Shona corpus public release β
mathiaskabango/shona-corpus - Gradio demo space β interactive TauraBot demo
- Shona Whisper β speech recognition for Shona
Contact
Developer: Mathias Kabango Institution: African Leadership University, Kigali, Rwanda Email: kabangomathias0@gmail.com GitHub: Mathias-Kabango3 Base model: mathiaskabango/shona-mt5-small
If you fine-tune this model and get good results, please open a discussion on this repo and share what worked β it will help everyone building Shona NLP tools.
Acknowledgements
Built as part of a mission to create open-source AI infrastructure for African languages. If you are working on Shona, Ndebele, or related Bantu languages and want to collaborate, please reach out.
*Built with β€οΈ *
- Downloads last month
- 228