dondza-xitsonga-asr-wav2vec2

Dondza-Xitsonga Wav2Vec2 is a Xitsonga Automatic Speech Recognition (ASR) model fine-tuned in the context of the Dondza project (Raggio AI) from the pretrained checkpoint facebook/wav2vec2-xls-r-300m.

To our knowledge, this is among the first Mozambican-developed and publicly released end-to-end ASR models for Xitsonga.
(If you are aware of earlier Mozambican releases, please share them โ€” we welcome corrections.)


License (READ FIRST) โ€” Dual licensing for the model checkpoint

This repository provides a fine-tuned ASR checkpoint and related files.

1) Non-commercial use (default, free)

The model weights/checkpoint and associated files in this repo are released under the Dondza Non-Commercial Model License.

  • โœ… Allowed: research, academic use, personal projects, demos, non-profit experiments
  • โŒ Not allowed without a commercial license: use in a paid product, paid service, internal business operations, monetized apps, paid APIs, SaaS, enterprise deployments, or any other commercial exploitation
  • Full terms: see LICENSE in this repository.

2) Commercial use (paid)

If you want to use this model commercially (including deploying it in a commercial product/service or using it to power a paid offering), you must obtain a Commercial License from Raggio AI.

How to request a commercial license:

We can offer:

  • Commercial license for self-hosting the model weights, and/or
  • Subscription access to a hosted ASR API (commercial terms/SLA available upon request)

Upstream / third-party licenses & attribution (important)

Even though this repo uses a custom license for the fine-tuned checkpoint, it also depends on upstream components with their own licenses.

Base model and libraries

  • Base checkpoint: facebook/wav2vec2-xls-r-300m (Apache-2.0)
  • Transformers library: Apache-2.0

When redistributing or using this repo, you must also comply with any applicable upstream notice and attribution requirements.

Training dataset (not redistributed here)

This model was fine-tuned on the NCHLT Xitsonga Speech Corpus (approx. 56 hours of read speech).

  • We do not redistribute the dataset in this repository.
  • You must obtain the dataset from its official source.
  • Dataset license: CC BY 3.0 (attribution required).

Required attribution (dataset):
The Department of Arts and Culture of the government of the Republic of South Africa (DAC), the Council for Scientific and Industrial Research (CSIR), and North-West University (NWU), for the NCHLT Speech Corpus.


Base model

  • Base checkpoint: facebook/wav2vec2-xls-r-300m
  • Architecture: Wav2Vec2 + CTC
  • Note: the CTC output head (lm_head.weight, lm_head.bias) is newly initialized for the Xitsonga vocabulary and learned during fine-tuning. This is expected for CTC fine-tuning.

Dataset

This model was fine-tuned on the NCHLT Speech Corpus (Xitsonga):

The NCHLT corpus is primarily read speech. Real-world conversational audio (noise, code-switching, phone recordings) may produce higher error rates.


Data splits (utterances, speakers, duration)

split utterances unique_speakers duration_hours duration_hms
train 44924 190 52.658 52:39:29
val 2247 188 2.595 02:35:42
test 2905 8 3.609 03:36:31

Speaker separation

The train and validation speaker sets are disjoint (no overlap), so validation results reflect generalisation to unseen speakers.


Results

Best checkpoint (selected by lowest validation WER):

  • Validation WER: 0.0719 (โ‰ˆ 7.19%)
  • Validation loss: 0.0721
  • Step: 26,500
  • Epoch: 29.7757
  • Eval runtime: 27.787s
  • Eval throughput: 80.865 samples/s

Intended use

  • Voice input for the Dondza app (Xitsonga ASR)
  • Research and prototyping for Xitsonga speech technology
  • Batch transcription of read or clean speech recordings in Xitsonga

Limitations

  • Domain mismatch: trained on read speech; conversational or noisy audio may degrade performance.
  • Dialect/region: NCHLT Xitsonga data is largely from South Africa; Mozambican varieties (e.g., Changana/Ronga/Tswa influences) may differ in pronunciation and orthography.
  • Not for high-stakes use: not validated for medical, legal, or other safety-critical transcription tasks.

How to use

import torch
import librosa
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

model_id = "Raggio/dondza-xitsonga-asr-wav2vec2"

processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)
Downloads last month
9
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Raggio/dondza-xitsonga-asr-wav2vec2

Finetuned
(833)
this model

Evaluation results