TauraBot β€” Shona Conversational AI

"Taura" means "Speak" in Shona (chiShona)

TauraBot is the first open-source conversational AI model built specifically for Shona speakers. It is a fine-tuned version of mathiaskabango/shona-mt5-small β€” itself a continued pre-training of Google's mT5-small on a Shona text corpus.

Shona is spoken by approximately 15 million people, primarily in Zimbabwe, yet remains almost entirely absent from modern NLP research and tooling. TauraBot is a step toward changing that.


⚠️ Important β€” Please Read Before Using

This model is an early-stage research release and not yet production ready.

Due to significant GPU constraints during training, this model was fine-tuned on a limited dataset with restricted compute. As a result:

  • Responses may be inconsistent or grammatically imperfect
  • The model may repeat phrases or produce generic outputs
  • It performs best on simple conversational exchanges similar to its training data
  • It will not handle complex or domain-specific Shona well yet

If you want to use this model in a real application, we strongly recommend further fine-tuning on your own Shona conversational data. See the fine-tuning guide below.

This model is actively being improved. A better version with more training data and compute is planned for release. Watch this repo for updates.


Model Details

Property Details
Base Model mathiaskabango/shona-mt5-small
Model Type Seq2Seq Conversational (Text-to-Text)
Language Shona (sn)
License Apache 2.0
Developer Mathias Kabango β€” African Leadership University, Kigali, Rwanda
Training Data 500 curated Shona conversation pairs
Task Prefix taura:

How to Use

The model requires a taura: prefix on all inputs. Without this prefix it will not behave conversationally.

Basic inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/taurabot-shona")
model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/taurabot-shona")

def chat(message):
    # Always include the task prefix
    input_text = "taura: " + message.strip()
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )
    outputs = model.generate(
        **inputs,
        max_new_tokens=60,
        num_beams=4,
        no_repeat_ngram_size=3,
        repetition_penalty=2.0,
        early_stopping=True,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example conversations
print(chat("Mhoro, makadii?"))
# Expected: "Ndiripo mazvita, imi makadii?"

print(chat("Zita rako ndiani?"))
# Expected: "Zita rangu ndiTauraBot."

print(chat("Unoda kudya chii?"))
# Expected: "Ndinoda sadza nemufushwa."

Simple chat loop

print("TauraBot  β€” Taura neni! (type 'exit' to quit)\n")
while True:
    user = input("Iwe:      ")
    if user.lower() == "exit":
        break
    print(f"TauraBot: {chat(user)}\n")

How to Fine-Tune Further (Recommended)

Because this model was trained under compute constraints, further fine-tuning on your own data will significantly improve quality. Here is a minimal script to continue training:

from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainer, Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
)
from datasets import Dataset

MODEL = "mathiaskabango/taurabot-shona"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

# Your conversation pairs β€” the more the better
# Format: input is the human turn, target is the bot response
my_conversations = [
    {"input": "taura: Mhoro!", "target": "Mhoro! Makadii?"},
    {"input": "taura: Ndiri kuneta.", "target": "Zorora zvishoma. Unokwanisa!"},
    # add as many as you have β€” 1000+ pairs recommended
]

dataset = Dataset.from_list(my_conversations)

def preprocess(batch):
    inputs = tokenizer(batch["input"],  max_length=64,
                       truncation=True, padding="max_length")
    labels = tokenizer(batch["target"], max_length=64,
                       truncation=True, padding="max_length")
    labels["input_ids"] = [
        [(t if t != tokenizer.pad_token_id else -100) for t in label]
        for label in labels["input_ids"]
    ]
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized = dataset.map(preprocess, batched=True)

args = Seq2SeqTrainingArguments(
    output_dir="taurabot-finetuned",
    num_train_epochs=20,              # increase for better results
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,               # lower LR when continuing from checkpoint
    warmup_steps=50,
    predict_with_generate=True,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
    push_to_hub=False,                # set True to push to your own HF repo
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

trainer.train()

# Save your improved model
model.save_pretrained("taurabot-finetuned")
tokenizer.save_pretrained("taurabot-finetuned")
print("Done! Test your improved model.")

Tips for better fine-tuning results

  • More data is the single biggest improvement β€” aim for 1,000 to 5,000 conversation pairs
  • Use native speaker corrections if possible
  • Keep conversations short and natural β€” 1 to 2 sentences per turn
  • Always use the taura: prefix in your input column
  • A lower learning rate (1e-4 or 5e-5) prevents overwriting what the model already knows

⚠️ Limitations

Limitation Detail
Compute constraints Trained on a single consumer GPU with limited VRAM. Only 18 epochs completed before overfitting began.
Small training set Fine-tuned on 500 conversation pairs β€” significantly below the recommended minimum for production conversational models
Early overfitting Validation loss stopped improving after epoch 8 (2.33) and began rising β€” a sign the model needs more diverse training data
Hallucinated prefixes May occasionally output "Mubvunzo:" or similar artefacts inherited from pre-training data
Limited domain coverage Trained primarily on everyday conversational Shona β€” will not handle medical, legal, or technical topics
Dialect coverage Covers standard Shona as spoken in Zimbabwe β€” may not generalise to regional dialects
Not for high-stakes use Should not be used for medical advice, legal decisions, or any critical application without significant further development

Training Details

What the loss curve tells us

Epoch 8: Validation loss 2.33 ← best checkpoint Epoch 9: Validation loss 2.35 ← started rising (overfitting) Epoch 18: Validation loss 2.58 ← continued rising

The model began overfitting after epoch 8 because 500 conversation pairs is a small dataset for a seq2seq model. The best weights are from around epoch 8. More diverse training data would push the validation loss lower before overfitting begins.

Training hyperparameters

Parameter Value
Learning Rate 3e-4
Train Batch Size 8
Gradient Accumulation 2 (effective batch = 16)
Warmup Ratio 0.03
Epochs 18 (of 200 planned)
Mixed Precision fp16
Optimizer AdamW (fused)
Seed 42

Training results

Epoch Step Training Loss Validation Loss
2 80 11.7061 4.9347
3 120 6.0485 3.3783
4 160 3.8857 2.8899
5 200 2.9967 2.5140
6 240 2.5225 2.4059
7 280 2.2792 2.3723
8 320 2.071 2.3340 ← best
9 360 1.9104 2.3476
18 720 1.2119 2.5784

Framework versions

Library Version
Transformers 4.57.6
PyTorch 2.10.0+cu128
Datasets 2.21.0
Tokenizers 0.22.2

Roadmap

  • TauraBot v2 β€” retrain base model with more steps and larger corpus
  • Larger conversation dataset β€” expanding beyond 500 pairs
  • Shona corpus public release β€” mathiaskabango/shona-corpus
  • Gradio demo space β€” interactive TauraBot demo
  • Shona Whisper β€” speech recognition for Shona

Contact

Developer: Mathias Kabango Institution: African Leadership University, Kigali, Rwanda Email: kabangomathias0@gmail.com GitHub: Mathias-Kabango3 Base model: mathiaskabango/shona-mt5-small

If you fine-tune this model and get good results, please open a discussion on this repo and share what worked β€” it will help everyone building Shona NLP tools.


Acknowledgements

Built as part of a mission to create open-source AI infrastructure for African languages. If you are working on Shona, Ndebele, or related Bantu languages and want to collaborate, please reach out.


*Built with ❀️ *

Downloads last month
228
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mathiaskabango/taurabot-shona

Base model

google/mt5-small
Finetuned
(1)
this model