🌍 Multilingual SLM — Ateso · Luganda · English · Runyankore · Japadhola

A lightweight, multilingual Small Language Model (SLM) fine-tuned for question-and-answer tasks across five languages spoken in Uganda and East Africa. Built on top of CohereLabs/tiny-aya-global using LoRA (PEFT), this model is optimized for low-resource, local-language understanding.


Model Details

Field Details
Base Model CohereLabs/tiny-aya-global
Fine-tuning Method LoRA (PEFT)
Task Question Answering (QA)
Languages Ateso, Luganda, English, Runyankore, Japadhola
Training Samples 90K custom QA pairs
Framework Transformers + PEFT 0.18.1
License Apache 2.0

Supported Languages

Language Code Region
English en International
Luganda lug Central Uganda
Runyankore nyn Western Uganda
Ateso teo Eastern Uganda / Northern Kenya
Japadhola dho Eastern Uganda

Intended Use

✅ Direct Use

This model is designed for question-and-answer inference in multilingual East African contexts. It is suitable for:

  • Building local-language chatbots and virtual assistants
  • Educational tools for Ugandan language communities
  • Research into low-resource NLP for African languages
  • Prototyping QA systems before scaling to larger datasets

🔧 Downstream Use

The model can be further fine-tuned or integrated into:

  • Mobile or web-based community knowledge bases
  • Agricultural, health, or civic information systems in local languages
  • Language learning applications

❌ Out-of-Scope Use

  • High-stakes or safety-critical applications without additional evaluation
  • Languages not covered in training (the model may produce low-quality outputs)
  • Tasks beyond question-answering (e.g., code generation, summarization) without further fine-tuning

How to Get Started

Installation

pip install transformers peft torch

Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model_id = "CohereLabs/tiny-aya-global"
adapter_id = "Bateesa/tiny-aya-global-lora-qa" 

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

def ask(question: str) -> str:
    prompt = f"Question: {question}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# English
print(ask("What is the capital of Uganda?"))

# Luganda
print(ask("Ekibuga ekikulembera Uganda kye ki?"))

# Runyankore
print(ask("Obwakabaka bw'Uganda nibuki?"))

Training Details

Training Data

  • Dataset size: 90K custom QA pairs
  • Format: Instruction-style prompt/response pairs (Question: ... \nAnswer: ...)
  • Languages: Balanced across Ateso, Luganda, English, Runyankore, and Japadhola
  • Source: Manually curated domain-specific questions and answers relevant to East African contexts

Training Procedure

Fine-tuned using LoRA (Low-Rank Adaptation) via the HuggingFace PEFT library on top of CohereLabs/tiny-aya-global.

Training Hyperparameters

Parameter Value
Method LoRA
PEFT Version 0.18.1
Training regime fp16 mixed precision
LoRA rank (r) 8
LoRA alpha 16
LoRA dropout 0.05
Target modules q_proj, v_proj
Epochs 3
Batch size 4
Learning rate 2e-4

Evaluation

Testing Data

Held-out subset from the 90K custom QA samples, with manual review of responses across all five languages.

Metrics

  • Qualitative review: Human evaluation of answer relevance and fluency per language
  • BLEU / ROUGE: Planned for future evaluation with expanded dataset

Results

⚠️ This model is trained on a small dataset of 90 samples. Performance may vary across languages and domains. It is best used as a baseline or proof-of-concept. Expanding the training dataset is strongly recommended for production use.


Bias, Risks, and Limitations

  • Small dataset (90K samples): The model may hallucinate or give incorrect answers, particularly for rare or complex questions.
  • Language imbalance: If training samples were not evenly distributed, some languages may perform better than others.
  • Cultural context: The model may not capture nuanced cultural meanings or idiomatic expressions in all five languages.
  • No safety fine-tuning: This model has not been RLHF-tuned or filtered for harmful outputs.

Recommendations

Users should validate model outputs before deploying in community-facing applications. Additional data collection and evaluation is recommended, especially for Ateso and Japadhola which have fewer NLP resources available.


Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact Calculator.

Field Details
Hardware Type GPU (e.g., T4 / A100)
Training Duration ~1–2 hours (estimated for 90 samples)
Cloud Provider TBD
Carbon Emitted Low (small dataset + LoRA adapter only)

Citation

If you use this model in your research or application, please cite:

@misc{multilingual-slm-ug,
  title     = {Multilingual SLM for Ugandan Languages: Ateso, Luganda, English, Runyankore, Japadhola},
  author    = {PhosAI},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/Bateesa/tiny-aya-global-lora-qa}
}

Model Card Contact

For questions, feedback, or collaboration inquiries, please open an issue on the model repository or contact [your contact info].


Framework Versions

  • PEFT 0.18.1
  • Transformers ≥ 4.38.0
  • PyTorch ≥ 2.0
Downloads last month
57
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bateesa/tiny-aya-global-lora-qa

Adapter
(1)
this model