TauraBot — Shona Conversational AI

"Taura" means "Speak" in Shona (chiShona)

TauraBot is the first open-source conversational AI model built specifically for Shona speakers. It is a fine-tuned version of mathiaskabango/shona-mt5-small — itself a continued pre-training of Google's mT5-small on a Shona text corpus.

Shona is spoken by approximately 15 million people, primarily in Zimbabwe, yet remains almost entirely absent from modern NLP research and tooling. TauraBot is a step toward changing that.

⚠️ Important — Please Read Before Using

This model is an early-stage research release and not yet production ready.

Due to significant GPU constraints during training, this model was fine-tuned on a limited dataset with restricted compute. As a result:

Responses may be inconsistent or grammatically imperfect
The model may repeat phrases or produce generic outputs
It performs best on simple conversational exchanges similar to its training data
It will not handle complex or domain-specific Shona well yet

If you want to use this model in a real application, we strongly recommend further fine-tuning on your own Shona conversational data. See the fine-tuning guide below.

This model is actively being improved. A better version with more training data and compute is planned for release. Watch this repo for updates.

Model Details

Property	Details
Base Model	mathiaskabango/shona-mt5-small
Model Type	Seq2Seq Conversational (Text-to-Text)
Language	Shona (`sn`)
License	Apache 2.0
Developer	Mathias Kabango — African Leadership University, Kigali, Rwanda
Training Data	500 curated Shona conversation pairs
Task Prefix	`taura:`

How to Use

The model requires a taura: prefix on all inputs. Without this prefix it will not behave conversationally.

Basic inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/taurabot-shona")
model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/taurabot-shona")

def chat(message):
    # Always include the task prefix
    input_text = "taura: " + message.strip()
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )
    outputs = model.generate(
        **inputs,
        max_new_tokens=60,
        num_beams=4,
        no_repeat_ngram_size=3,
        repetition_penalty=2.0,
        early_stopping=True,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example conversations
print(chat("Mhoro, makadii?"))
# Expected: "Ndiripo mazvita, imi makadii?"

print(chat("Zita rako ndiani?"))
# Expected: "Zita rangu ndiTauraBot."

print(chat("Unoda kudya chii?"))
# Expected: "Ndinoda sadza nemufushwa."

Simple chat loop

print("TauraBot  — Taura neni! (type 'exit' to quit)\n")
while True:
    user = input("Iwe:      ")
    if user.lower() == "exit":
        break
    print(f"TauraBot: {chat(user)}\n")

How to Fine-Tune Further (Recommended)

Because this model was trained under compute constraints, further fine-tuning on your own data will significantly improve quality. Here is a minimal script to continue training:

from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainer, Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
)
from datasets import Dataset

MODEL = "mathiaskabango/taurabot-shona"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

# Your conversation pairs — the more the better
# Format: input is the human turn, target is the bot response
my_conversations = [
    {"input": "taura: Mhoro!", "target": "Mhoro! Makadii?"},
    {"input": "taura: Ndiri kuneta.", "target": "Zorora zvishoma. Unokwanisa!"},
    # add as many as you have — 1000+ pairs recommended
]

dataset = Dataset.from_list(my_conversations)

def preprocess(batch):
    inputs = tokenizer(batch["input"],  max_length=64,
                       truncation=True, padding="max_length")
    labels = tokenizer(batch["target"], max_length=64,
                       truncation=True, padding="max_length")
    labels["input_ids"] = [
        [(t if t != tokenizer.pad_token_id else -100) for t in label]
        for label in labels["input_ids"]
    ]
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized = dataset.map(preprocess, batched=True)

args = Seq2SeqTrainingArguments(
    output_dir="taurabot-finetuned",
    num_train_epochs=20,              # increase for better results
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,               # lower LR when continuing from checkpoint
    warmup_steps=50,
    predict_with_generate=True,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
    push_to_hub=False,                # set True to push to your own HF repo
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

trainer.train()

# Save your improved model
model.save_pretrained("taurabot-finetuned")
tokenizer.save_pretrained("taurabot-finetuned")
print("Done! Test your improved model.")

Tips for better fine-tuning results

More data is the single biggest improvement — aim for 1,000 to 5,000 conversation pairs
Use native speaker corrections if possible
Keep conversations short and natural — 1 to 2 sentences per turn
Always use the taura: prefix in your input column
A lower learning rate (1e-4 or 5e-5) prevents overwriting what the model already knows

⚠️ Limitations

Limitation	Detail
Compute constraints	Trained on a single consumer GPU with limited VRAM. Only 18 epochs completed before overfitting began.
Small training set	Fine-tuned on 500 conversation pairs — significantly below the recommended minimum for production conversational models
Early overfitting	Validation loss stopped improving after epoch 8 (2.33) and began rising — a sign the model needs more diverse training data
Hallucinated prefixes	May occasionally output "Mubvunzo:" or similar artefacts inherited from pre-training data
Limited domain coverage	Trained primarily on everyday conversational Shona — will not handle medical, legal, or technical topics
Dialect coverage	Covers standard Shona as spoken in Zimbabwe — may not generalise to regional dialects
Not for high-stakes use	Should not be used for medical advice, legal decisions, or any critical application without significant further development

Training Details

What the loss curve tells us

Epoch 8: Validation loss 2.33 ← best checkpoint Epoch 9: Validation loss 2.35 ← started rising (overfitting) Epoch 18: Validation loss 2.58 ← continued rising

The model began overfitting after epoch 8 because 500 conversation pairs is a small dataset for a seq2seq model. The best weights are from around epoch 8. More diverse training data would push the validation loss lower before overfitting begins.

Training hyperparameters

Parameter	Value
Learning Rate	3e-4
Train Batch Size	8
Gradient Accumulation	2 (effective batch = 16)
Warmup Ratio	0.03
Epochs	18 (of 200 planned)
Mixed Precision	fp16
Optimizer	AdamW (fused)
Seed	42

Training results

Epoch	Step	Training Loss	Validation Loss
2	80	11.7061	4.9347
3	120	6.0485	3.3783
4	160	3.8857	2.8899
5	200	2.9967	2.5140
6	240	2.5225	2.4059
7	280	2.2792	2.3723
8	320	2.071	2.3340 ← best
9	360	1.9104	2.3476
18	720	1.2119	2.5784

Framework versions

Library	Version
Transformers	4.57.6
PyTorch	2.10.0+cu128
Datasets	2.21.0
Tokenizers	0.22.2

Roadmap

TauraBot v2 — retrain base model with more steps and larger corpus
Larger conversation dataset — expanding beyond 500 pairs
Shona corpus public release — mathiaskabango/shona-corpus
Gradio demo space — interactive TauraBot demo
Shona Whisper — speech recognition for Shona

Contact

Developer: Mathias Kabango Institution: African Leadership University, Kigali, Rwanda Email: kabangomathias0@gmail.com GitHub: Mathias-Kabango3 Base model: mathiaskabango/shona-mt5-small

If you fine-tune this model and get good results, please open a discussion on this repo and share what worked — it will help everyone building Shona NLP tools.

Acknowledgements

Built as part of a mission to create open-source AI infrastructure for African languages. If you are working on Shona, Ndebele, or related Bantu languages and want to collaborate, please reach out.

*Built with ❤️ *

Downloads last month: 3

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mathiaskabango/taurabot-shona

Base model

google/mt5-small

Finetuned

mathiaskabango/shona-mt5-small

Finetuned

(1)

this model