Sepedi-Llama v1

First open-source continued pre-training of XLM-RoBERTa on Sepedi (Northern Sotho)

Developed by Sediba AI / TSEBO SOVEREIGN TECH (Pty) Ltd
Mankweng, Limpopo, South Africa


Model Description

Sepedi-Llama v1 is a continued pre-training of xlm-roberta-base on a 376,803-record clean Sepedi corpus. It is the first publicly available language model specifically trained on Sepedi (ISO 639-3: nso) at this scale.

Sepedi (Northern Sotho / Sesotho sa Leboa) is spoken by approximately 4.6 million people in Limpopo, Gauteng, and Mpumalanga provinces of South Africa. Despite being one of South Africa's 11 official languages, it has near-zero representation in existing NLP infrastructure.

This model is the foundation layer of the Sediba AI intelligence platform — a community-sovereign AI system serving Sepedi-speaking communities in Limpopo.


Intended Uses

Primary uses

  • Fine-tuning for downstream Sepedi NLP tasks:
    • Sentiment classification
    • Named Entity Recognition (NER)
    • Text classification
    • Question answering
  • EduIntel — curriculum AI assistant for Grade 10-12 Sepedi teachers (Morutabana platform)
  • Research — low-resource African language NLP

Out-of-scope uses

  • Production text generation without further fine-tuning
  • Languages other than Sepedi/Northern Sotho
  • Commercial use without review of the CC BY-NC-SA 4.0 license terms

Training Details

Training Data

Source Records License Notes
Autshumato Monolingual Sepedi v2.1 91,347 CC BY SADiLaR
Autshumato Bilingual Sepedi-English (NSO side) 122,665 CC BY SADiLaR
NCHLT RAW Sepedi corpus 63,984 CC BY 4.0 CTexT/NWU
FineWeb-2 NSO subset 144,611 CC BY 4.0 HuggingFaceFW
Sepedi Bible 30,616 Public domain
Wikipedia NSO 18,347 CC BY-SA wikimedia
DBE NSC exam papers 2017-2025 195 documents Public domain Grade 10-12
DBE Annual Teaching Plans 2023 9 documents Public domain Grade 10-12
Total 376,803 records

Full dataset: Sediba-AI/sepedi-training-v1

Training Procedure

Parameter Value
Base model xlm-roberta-base
Training objective Masked Language Modeling (MLM, 15% mask rate)
Epochs 1
Batch size (effective) 32 (2 per device × 16 accumulation steps)
Max sequence length 64 tokens
Learning rate 5e-5 with linear warmup (200 steps)
Weight decay 0.01
Hardware Tesla T4 (16GB VRAM)
Training time 10 hours 3 minutes
Framework HuggingFace Transformers 4.x

Training Loss

Epoch Step Loss
0.00 0
0.004 50 83.27
0.085 1000 ~35
0.25 3000 ~26
0.50 5900 ~22
0.75 8800 ~20
1.00 11776 19.04

Final training loss: 25.45 (average) | 19.04 (final step)
Loss reduction: 83.27 → 19.04 (77% reduction)


Evaluation

Formal evaluation benchmarks (FLORES-200, MasakhaNER) are pending. This is a v1 release intended to establish a baseline and receive community feedback.

Downstream fine-tuning experiments:

  • Sepedi sentiment classification (in progress — Sediba-AI/sepedi-sentiment-classifier)
  • EduIntel curriculum QA (deployed to Morutabana dashboard)

Limitations

  • Short sequence length: Trained with max_length=64 due to GPU memory constraints. May underperform on longer documents. Future versions will train at max_length=512 with larger GPU allocation.
  • 1 epoch only: Single-epoch training is a starting point. Further training will improve performance significantly.
  • Informal register: Training data is primarily formal text (government, education, religious). Performance on informal/conversational Sepedi may be limited.
  • Dialect coverage: Sepedi has multiple dialect zones (Sekhukhune, Balobedu, Batlokwa etc.). Current corpus does not tag dialect zones.

Ethical Considerations

Data sovereignty

This model was developed under the Sediba Sovereignty Framework:

  • Training data sourced from publicly licensed corpora
  • Community data collected via Leotša la Sepedi with contributor consent (FPIC framework)
  • Commercial use governed by TSEBO SOVEREIGN TECH (Pty) Ltd — 25% of commercial revenue flows to Sediba AI NPC (community non-profit)

Language sovereignty

Sepedi speakers have historically been excluded from AI development. This model is built by and for the Sepedi-speaking community in Limpopo, with the explicit goal of returning AI capability to the community that speaks the language.

Bias and risks

  • Religious text (Sepedi Bible) may introduce theological framing in certain contexts
  • Government/educational text may reflect formal register bias
  • Model reflects biases present in web-crawled text (FineWeb-2)

How to Use

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("Sediba-AI/sepedi-llama-v1")
model = AutoModelForMaskedLM.from_pretrained("Sediba-AI/sepedi-llama-v1")

# Example: fill-mask in Sepedi
text = "Baithuti ba <mask> go bala dipuku tša Sepedi."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Get top predictions for masked token
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0][1]
logits = outputs.logits[0, mask_idx]
top_tokens = tokenizer.convert_ids_to_tokens(logits.topk(5).indices)
print("Top predictions:", top_tokens)

Fine-tuning for classification

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "Sediba-AI/sepedi-llama-v1",
    num_labels=3  # e.g. positive/neutral/negative
)

Citation

@misc{sepedi-llama-v1-2026,
  author       = {Lehlohonolo Lemekoana and Sediba AI},
  title        = {Sepedi-Llama v1: Continued Pre-training of XLM-RoBERTa on Sepedi},
  year         = {2026},
  publisher    = {HuggingFace},
  url          = {https://huggingface.co/Sediba-AI/sepedi-llama-v1},
  note         = {TSEBO SOVEREIGN TECH (Pty) Ltd, Mankweng, Limpopo, South Africa}
}

Related Resources

Resource Link
Training dataset Sediba-AI/sepedi-training-v1
Sentiment classifier Sediba-AI/sepedi-sentiment-classifier
Data collection platform Leotša la Sepedi
Organisation Sediba-AI on HuggingFace
GitHub github.com/Sediba-AI

Acknowledgements

  • SADiLaR (South African Centre for Digital Language Resources) — Autshumato and NCHLT corpora
  • CTexT, NWU — NCHLT speech and text resources
  • Masakhane — African NLP community and benchmarks
  • Department of Basic Education, South Africa — NSC exam papers and Annual Teaching Plans (public domain)
  • HuggingFaceFW — FineWeb-2 NSO subset
  • Anri Lombard — MzansiLM reference architecture

Built in Mankweng, Limpopo. Powering the invisible foundations.
Sediba AI — Making the unseen ours and seen.

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sediba-AI/sepedi-llama-v1

Finetuned
(4019)
this model

Dataset used to train Sediba-AI/sepedi-llama-v1