dondza-xitsonga-asr-wav2vec2

Dondza-Xitsonga Wav2Vec2 is a Xitsonga Automatic Speech Recognition (ASR) model fine-tuned in the context of the Dondza project (Raggio AI) from the pretrained checkpoint facebook/wav2vec2-xls-r-300m.

To our knowledge, this is among the first Mozambican-developed and publicly released end-to-end ASR models for Xitsonga.
(If you are aware of earlier Mozambican releases, please share them — we welcome corrections.)

License (READ FIRST) — Dual licensing for the model checkpoint

This repository provides a fine-tuned ASR checkpoint and related files.

1) Non-commercial use (default, free)

The model weights/checkpoint and associated files in this repo are released under the Dondza Non-Commercial Model License.

✅ Allowed: research, academic use, personal projects, demos, non-profit experiments
❌ Not allowed without a commercial license: use in a paid product, paid service, internal business operations, monetized apps, paid APIs, SaaS, enterprise deployments, or any other commercial exploitation
Full terms: see LICENSE in this repository.

2) Commercial use (paid)

If you want to use this model commercially (including deploying it in a commercial product/service or using it to power a paid offering), you must obtain a Commercial License from Raggio AI.

How to request a commercial license:

Open a discussion on this repo: https://huggingface.co/Raggio/dondza-xitsonga-asr-wav2vec2/discussions
Or open an issue on the Dondza project tracker (if provided by the team)

We can offer:

Commercial license for self-hosting the model weights, and/or
Subscription access to a hosted ASR API (commercial terms/SLA available upon request)

Upstream / third-party licenses & attribution (important)

Even though this repo uses a custom license for the fine-tuned checkpoint, it also depends on upstream components with their own licenses.

Base model and libraries

Base checkpoint: facebook/wav2vec2-xls-r-300m (Apache-2.0)
Transformers library: Apache-2.0

When redistributing or using this repo, you must also comply with any applicable upstream notice and attribution requirements.

Training dataset (not redistributed here)

This model was fine-tuned on the NCHLT Xitsonga Speech Corpus (approx. 56 hours of read speech).

We do not redistribute the dataset in this repository.
You must obtain the dataset from its official source.
Dataset license: CC BY 3.0 (attribution required).

Required attribution (dataset):
The Department of Arts and Culture of the government of the Republic of South Africa (DAC), the Council for Scientific and Industrial Research (CSIR), and North-West University (NWU), for the NCHLT Speech Corpus.

Base model

Base checkpoint: facebook/wav2vec2-xls-r-300m
Architecture: Wav2Vec2 + CTC
Note: the CTC output head (lm_head.weight, lm_head.bias) is newly initialized for the Xitsonga vocabulary and learned during fine-tuning. This is expected for CTC fine-tuning.

Dataset

This model was fine-tuned on the NCHLT Speech Corpus (Xitsonga):

Dataset page: https://repo.sadilar.org/items/5a6587ad-3067-49ff-bf2a-05ab171f4807
Project page: https://sites.google.com/site/nchltspeechcorpus/home/xitsonga

The NCHLT corpus is primarily read speech. Real-world conversational audio (noise, code-switching, phone recordings) may produce higher error rates.

Data splits (utterances, speakers, duration)

split	utterances	unique_speakers	duration_hours	duration_hms
train	44924	190	52.658	52:39:29
val	2247	188	2.595	02:35:42
test	2905	8	3.609	03:36:31

Speaker separation

The train and validation speaker sets are disjoint (no overlap), so validation results reflect generalisation to unseen speakers.

Results

Best checkpoint (selected by lowest validation WER):

Validation WER: 0.0719 (≈ 7.19%)
Validation loss: 0.0721
Step: 26,500
Epoch: 29.7757
Eval runtime: 27.787s
Eval throughput: 80.865 samples/s

Intended use

Voice input for the Dondza app (Xitsonga ASR)
Research and prototyping for Xitsonga speech technology
Batch transcription of read or clean speech recordings in Xitsonga

Limitations

Domain mismatch: trained on read speech; conversational or noisy audio may degrade performance.
Dialect/region: NCHLT Xitsonga data is largely from South Africa; Mozambican varieties (e.g., Changana/Ronga/Tswa influences) may differ in pronunciation and orthography.
Not for high-stakes use: not validated for medical, legal, or other safety-critical transcription tasks.

How to use

import torch
import librosa
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

model_id = "Raggio/dondza-xitsonga-asr-wav2vec2"

processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)

Downloads last month: 9

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Raggio/dondza-xitsonga-asr-wav2vec2

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

(833)

this model

Evaluation results

WER on NCHLT Speech Corpus (Xitsonga)
self-reported

0.072
Eval Loss on NCHLT Speech Corpus (Xitsonga)
self-reported

0.072