Sepedi-Llama v1
First open-source continued pre-training of XLM-RoBERTa on Sepedi (Northern Sotho)
Developed by Sediba AI / TSEBO SOVEREIGN TECH (Pty) Ltd
Mankweng, Limpopo, South Africa
Model Description
Sepedi-Llama v1 is a continued pre-training of xlm-roberta-base on a 376,803-record clean Sepedi corpus. It is the first publicly available language model specifically trained on Sepedi (ISO 639-3: nso) at this scale.
Sepedi (Northern Sotho / Sesotho sa Leboa) is spoken by approximately 4.6 million people in Limpopo, Gauteng, and Mpumalanga provinces of South Africa. Despite being one of South Africa's 11 official languages, it has near-zero representation in existing NLP infrastructure.
This model is the foundation layer of the Sediba AI intelligence platform — a community-sovereign AI system serving Sepedi-speaking communities in Limpopo.
Intended Uses
Primary uses
- Fine-tuning for downstream Sepedi NLP tasks:
- Sentiment classification
- Named Entity Recognition (NER)
- Text classification
- Question answering
- EduIntel — curriculum AI assistant for Grade 10-12 Sepedi teachers (Morutabana platform)
- Research — low-resource African language NLP
Out-of-scope uses
- Production text generation without further fine-tuning
- Languages other than Sepedi/Northern Sotho
- Commercial use without review of the CC BY-NC-SA 4.0 license terms
Training Details
Training Data
| Source | Records | License | Notes |
|---|---|---|---|
| Autshumato Monolingual Sepedi v2.1 | 91,347 | CC BY | SADiLaR |
| Autshumato Bilingual Sepedi-English (NSO side) | 122,665 | CC BY | SADiLaR |
| NCHLT RAW Sepedi corpus | 63,984 | CC BY 4.0 | CTexT/NWU |
| FineWeb-2 NSO subset | 144,611 | CC BY 4.0 | HuggingFaceFW |
| Sepedi Bible | 30,616 | Public domain | |
| Wikipedia NSO | 18,347 | CC BY-SA | wikimedia |
| DBE NSC exam papers 2017-2025 | 195 documents | Public domain | Grade 10-12 |
| DBE Annual Teaching Plans 2023 | 9 documents | Public domain | Grade 10-12 |
| Total | 376,803 records |
Full dataset: Sediba-AI/sepedi-training-v1
Training Procedure
| Parameter | Value |
|---|---|
| Base model | xlm-roberta-base |
| Training objective | Masked Language Modeling (MLM, 15% mask rate) |
| Epochs | 1 |
| Batch size (effective) | 32 (2 per device × 16 accumulation steps) |
| Max sequence length | 64 tokens |
| Learning rate | 5e-5 with linear warmup (200 steps) |
| Weight decay | 0.01 |
| Hardware | Tesla T4 (16GB VRAM) |
| Training time | 10 hours 3 minutes |
| Framework | HuggingFace Transformers 4.x |
Training Loss
| Epoch | Step | Loss |
|---|---|---|
| 0.00 | 0 | — |
| 0.004 | 50 | 83.27 |
| 0.085 | 1000 | ~35 |
| 0.25 | 3000 | ~26 |
| 0.50 | 5900 | ~22 |
| 0.75 | 8800 | ~20 |
| 1.00 | 11776 | 19.04 |
Final training loss: 25.45 (average) | 19.04 (final step)
Loss reduction: 83.27 → 19.04 (77% reduction)
Evaluation
Formal evaluation benchmarks (FLORES-200, MasakhaNER) are pending. This is a v1 release intended to establish a baseline and receive community feedback.
Downstream fine-tuning experiments:
- Sepedi sentiment classification (in progress — Sediba-AI/sepedi-sentiment-classifier)
- EduIntel curriculum QA (deployed to Morutabana dashboard)
Limitations
- Short sequence length: Trained with max_length=64 due to GPU memory constraints. May underperform on longer documents. Future versions will train at max_length=512 with larger GPU allocation.
- 1 epoch only: Single-epoch training is a starting point. Further training will improve performance significantly.
- Informal register: Training data is primarily formal text (government, education, religious). Performance on informal/conversational Sepedi may be limited.
- Dialect coverage: Sepedi has multiple dialect zones (Sekhukhune, Balobedu, Batlokwa etc.). Current corpus does not tag dialect zones.
Ethical Considerations
Data sovereignty
This model was developed under the Sediba Sovereignty Framework:
- Training data sourced from publicly licensed corpora
- Community data collected via Leotša la Sepedi with contributor consent (FPIC framework)
- Commercial use governed by TSEBO SOVEREIGN TECH (Pty) Ltd — 25% of commercial revenue flows to Sediba AI NPC (community non-profit)
Language sovereignty
Sepedi speakers have historically been excluded from AI development. This model is built by and for the Sepedi-speaking community in Limpopo, with the explicit goal of returning AI capability to the community that speaks the language.
Bias and risks
- Religious text (Sepedi Bible) may introduce theological framing in certain contexts
- Government/educational text may reflect formal register bias
- Model reflects biases present in web-crawled text (FineWeb-2)
How to Use
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained("Sediba-AI/sepedi-llama-v1")
model = AutoModelForMaskedLM.from_pretrained("Sediba-AI/sepedi-llama-v1")
# Example: fill-mask in Sepedi
text = "Baithuti ba <mask> go bala dipuku tša Sepedi."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Get top predictions for masked token
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0][1]
logits = outputs.logits[0, mask_idx]
top_tokens = tokenizer.convert_ids_to_tokens(logits.topk(5).indices)
print("Top predictions:", top_tokens)
Fine-tuning for classification
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"Sediba-AI/sepedi-llama-v1",
num_labels=3 # e.g. positive/neutral/negative
)
Citation
@misc{sepedi-llama-v1-2026,
author = {Lehlohonolo Lemekoana and Sediba AI},
title = {Sepedi-Llama v1: Continued Pre-training of XLM-RoBERTa on Sepedi},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/Sediba-AI/sepedi-llama-v1},
note = {TSEBO SOVEREIGN TECH (Pty) Ltd, Mankweng, Limpopo, South Africa}
}
Related Resources
| Resource | Link |
|---|---|
| Training dataset | Sediba-AI/sepedi-training-v1 |
| Sentiment classifier | Sediba-AI/sepedi-sentiment-classifier |
| Data collection platform | Leotša la Sepedi |
| Organisation | Sediba-AI on HuggingFace |
| GitHub | github.com/Sediba-AI |
Acknowledgements
- SADiLaR (South African Centre for Digital Language Resources) — Autshumato and NCHLT corpora
- CTexT, NWU — NCHLT speech and text resources
- Masakhane — African NLP community and benchmarks
- Department of Basic Education, South Africa — NSC exam papers and Annual Teaching Plans (public domain)
- HuggingFaceFW — FineWeb-2 NSO subset
- Anri Lombard — MzansiLM reference architecture
Built in Mankweng, Limpopo. Powering the invisible foundations.
Sediba AI — Making the unseen ours and seen.
- Downloads last month
- -
Model tree for Sediba-AI/sepedi-llama-v1
Base model
FacebookAI/xlm-roberta-base