Sepedi-Llama v1

First open-source continued pre-training of XLM-RoBERTa on Sepedi (Northern Sotho)

Developed by Sediba AI / TSEBO SOVEREIGN TECH (Pty) Ltd
Mankweng, Limpopo, South Africa

Model Description

Sepedi-Llama v1 is a continued pre-training of xlm-roberta-base on a 376,803-record clean Sepedi corpus. It is the first publicly available language model specifically trained on Sepedi (ISO 639-3: nso) at this scale.

Sepedi (Northern Sotho / Sesotho sa Leboa) is spoken by approximately 4.6 million people in Limpopo, Gauteng, and Mpumalanga provinces of South Africa. Despite being one of South Africa's 11 official languages, it has near-zero representation in existing NLP infrastructure.

This model is the foundation layer of the Sediba AI intelligence platform — a community-sovereign AI system serving Sepedi-speaking communities in Limpopo.

Intended Uses

Primary uses

Fine-tuning for downstream Sepedi NLP tasks:
- Sentiment classification
- Named Entity Recognition (NER)
- Text classification
- Question answering
EduIntel — curriculum AI assistant for Grade 10-12 Sepedi teachers (Morutabana platform)
Research — low-resource African language NLP

Out-of-scope uses

Production text generation without further fine-tuning
Languages other than Sepedi/Northern Sotho
Commercial use without review of the CC BY-NC-SA 4.0 license terms

Training Details

Training Data

Source	Records	License	Notes
Autshumato Monolingual Sepedi v2.1	91,347	CC BY	SADiLaR
Autshumato Bilingual Sepedi-English (NSO side)	122,665	CC BY	SADiLaR
NCHLT RAW Sepedi corpus	63,984	CC BY 4.0	CTexT/NWU
FineWeb-2 NSO subset	144,611	CC BY 4.0	HuggingFaceFW
Sepedi Bible	30,616	Public domain
Wikipedia NSO	18,347	CC BY-SA	wikimedia
DBE NSC exam papers 2017-2025	195 documents	Public domain	Grade 10-12
DBE Annual Teaching Plans 2023	9 documents	Public domain	Grade 10-12
Total	376,803 records

Full dataset: Sediba-AI/sepedi-training-v1

Training Procedure

Parameter	Value
Base model	`xlm-roberta-base`
Training objective	Masked Language Modeling (MLM, 15% mask rate)
Epochs	1
Batch size (effective)	32 (2 per device × 16 accumulation steps)
Max sequence length	64 tokens
Learning rate	5e-5 with linear warmup (200 steps)
Weight decay	0.01
Hardware	Tesla T4 (16GB VRAM)
Training time	10 hours 3 minutes
Framework	HuggingFace Transformers 4.x

Training Loss

Epoch	Step	Loss
0.00	0	—
0.004	50	83.27
0.085	1000	~35
0.25	3000	~26
0.50	5900	~22
0.75	8800	~20
1.00	11776	19.04

Final training loss: 25.45 (average) | 19.04 (final step)
Loss reduction: 83.27 → 19.04 (77% reduction)

Evaluation

Formal evaluation benchmarks (FLORES-200, MasakhaNER) are pending. This is a v1 release intended to establish a baseline and receive community feedback.

Downstream fine-tuning experiments:

Sepedi sentiment classification (in progress — Sediba-AI/sepedi-sentiment-classifier)
EduIntel curriculum QA (deployed to Morutabana dashboard)

Limitations

Short sequence length: Trained with max_length=64 due to GPU memory constraints. May underperform on longer documents. Future versions will train at max_length=512 with larger GPU allocation.
1 epoch only: Single-epoch training is a starting point. Further training will improve performance significantly.
Informal register: Training data is primarily formal text (government, education, religious). Performance on informal/conversational Sepedi may be limited.
Dialect coverage: Sepedi has multiple dialect zones (Sekhukhune, Balobedu, Batlokwa etc.). Current corpus does not tag dialect zones.

Ethical Considerations

Data sovereignty

This model was developed under the Sediba Sovereignty Framework:

Training data sourced from publicly licensed corpora
Community data collected via Leotša la Sepedi with contributor consent (FPIC framework)
Commercial use governed by TSEBO SOVEREIGN TECH (Pty) Ltd — 25% of commercial revenue flows to Sediba AI NPC (community non-profit)

Language sovereignty

Sepedi speakers have historically been excluded from AI development. This model is built by and for the Sepedi-speaking community in Limpopo, with the explicit goal of returning AI capability to the community that speaks the language.

Bias and risks

Religious text (Sepedi Bible) may introduce theological framing in certain contexts
Government/educational text may reflect formal register bias
Model reflects biases present in web-crawled text (FineWeb-2)

How to Use

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained("Sediba-AI/sepedi-llama-v1")
model = AutoModelForMaskedLM.from_pretrained("Sediba-AI/sepedi-llama-v1")

# Example: fill-mask in Sepedi
text = "Baithuti ba <mask> go bala dipuku tša Sepedi."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Get top predictions for masked token
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0][1]
logits = outputs.logits[0, mask_idx]
top_tokens = tokenizer.convert_ids_to_tokens(logits.topk(5).indices)
print("Top predictions:", top_tokens)

Fine-tuning for classification

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "Sediba-AI/sepedi-llama-v1",
    num_labels=3  # e.g. positive/neutral/negative
)

Citation

@misc{sepedi-llama-v1-2026,
  author       = {Lehlohonolo Lemekoana and Sediba AI},
  title        = {Sepedi-Llama v1: Continued Pre-training of XLM-RoBERTa on Sepedi},
  year         = {2026},
  publisher    = {HuggingFace},
  url          = {https://huggingface.co/Sediba-AI/sepedi-llama-v1},
  note         = {TSEBO SOVEREIGN TECH (Pty) Ltd, Mankweng, Limpopo, South Africa}
}

Related Resources

Resource	Link
Training dataset	Sediba-AI/sepedi-training-v1
Sentiment classifier	Sediba-AI/sepedi-sentiment-classifier
Data collection platform	Leotša la Sepedi
Organisation	Sediba-AI on HuggingFace
GitHub	github.com/Sediba-AI

Acknowledgements

SADiLaR (South African Centre for Digital Language Resources) — Autshumato and NCHLT corpora
CTexT, NWU — NCHLT speech and text resources
Masakhane — African NLP community and benchmarks
Department of Basic Education, South Africa — NSC exam papers and Annual Teaching Plans (public domain)
HuggingFaceFW — FineWeb-2 NSO subset
Anri Lombard — MzansiLM reference architecture

Built in Mankweng, Limpopo. Powering the invisible foundations.
Sediba AI — Making the unseen ours and seen.

Downloads last month: 4

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sediba-AI/sepedi-llama-v1

Base model

FacebookAI/xlm-roberta-base

Finetuned

(4133)

this model

Sediba-AI
/

sepedi-llama-v1