dondza-xitsonga-asr-wav2vec2
Dondza-Xitsonga Wav2Vec2 is a Xitsonga Automatic Speech Recognition (ASR) model fine-tuned in the context of the Dondza project (Raggio AI) from the pretrained checkpoint facebook/wav2vec2-xls-r-300m.
To our knowledge, this is among the first Mozambican-developed and publicly released end-to-end ASR models for Xitsonga.
(If you are aware of earlier Mozambican releases, please share them โ we welcome corrections.)
License (READ FIRST) โ Dual licensing for the model checkpoint
This repository provides a fine-tuned ASR checkpoint and related files.
1) Non-commercial use (default, free)
The model weights/checkpoint and associated files in this repo are released under the Dondza Non-Commercial Model License.
- โ Allowed: research, academic use, personal projects, demos, non-profit experiments
- โ Not allowed without a commercial license: use in a paid product, paid service, internal business operations, monetized apps, paid APIs, SaaS, enterprise deployments, or any other commercial exploitation
- Full terms: see LICENSE in this repository.
2) Commercial use (paid)
If you want to use this model commercially (including deploying it in a commercial product/service or using it to power a paid offering), you must obtain a Commercial License from Raggio AI.
How to request a commercial license:
- Open a discussion on this repo: https://huggingface.co/Raggio/dondza-xitsonga-asr-wav2vec2/discussions
- Or open an issue on the Dondza project tracker (if provided by the team)
We can offer:
- Commercial license for self-hosting the model weights, and/or
- Subscription access to a hosted ASR API (commercial terms/SLA available upon request)
Upstream / third-party licenses & attribution (important)
Even though this repo uses a custom license for the fine-tuned checkpoint, it also depends on upstream components with their own licenses.
Base model and libraries
- Base checkpoint:
facebook/wav2vec2-xls-r-300m(Apache-2.0) - Transformers library: Apache-2.0
When redistributing or using this repo, you must also comply with any applicable upstream notice and attribution requirements.
Training dataset (not redistributed here)
This model was fine-tuned on the NCHLT Xitsonga Speech Corpus (approx. 56 hours of read speech).
- We do not redistribute the dataset in this repository.
- You must obtain the dataset from its official source.
- Dataset license: CC BY 3.0 (attribution required).
Required attribution (dataset):
The Department of Arts and Culture of the government of the Republic of South Africa (DAC), the Council for Scientific and Industrial Research (CSIR), and North-West University (NWU), for the NCHLT Speech Corpus.
Base model
- Base checkpoint: facebook/wav2vec2-xls-r-300m
- Architecture: Wav2Vec2 + CTC
- Note: the CTC output head (
lm_head.weight,lm_head.bias) is newly initialized for the Xitsonga vocabulary and learned during fine-tuning. This is expected for CTC fine-tuning.
Dataset
This model was fine-tuned on the NCHLT Speech Corpus (Xitsonga):
- Dataset page: https://repo.sadilar.org/items/5a6587ad-3067-49ff-bf2a-05ab171f4807
- Project page: https://sites.google.com/site/nchltspeechcorpus/home/xitsonga
The NCHLT corpus is primarily read speech. Real-world conversational audio (noise, code-switching, phone recordings) may produce higher error rates.
Data splits (utterances, speakers, duration)
| split | utterances | unique_speakers | duration_hours | duration_hms |
|---|---|---|---|---|
| train | 44924 | 190 | 52.658 | 52:39:29 |
| val | 2247 | 188 | 2.595 | 02:35:42 |
| test | 2905 | 8 | 3.609 | 03:36:31 |
Speaker separation
The train and validation speaker sets are disjoint (no overlap), so validation results reflect generalisation to unseen speakers.
Results
Best checkpoint (selected by lowest validation WER):
- Validation WER: 0.0719 (โ 7.19%)
- Validation loss: 0.0721
- Step: 26,500
- Epoch: 29.7757
- Eval runtime: 27.787s
- Eval throughput: 80.865 samples/s
Intended use
- Voice input for the Dondza app (Xitsonga ASR)
- Research and prototyping for Xitsonga speech technology
- Batch transcription of read or clean speech recordings in Xitsonga
Limitations
- Domain mismatch: trained on read speech; conversational or noisy audio may degrade performance.
- Dialect/region: NCHLT Xitsonga data is largely from South Africa; Mozambican varieties (e.g., Changana/Ronga/Tswa influences) may differ in pronunciation and orthography.
- Not for high-stakes use: not validated for medical, legal, or other safety-critical transcription tasks.
How to use
import torch
import librosa
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
model_id = "Raggio/dondza-xitsonga-asr-wav2vec2"
processor = Wav2Vec2Processor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)
- Downloads last month
- 9
Model tree for Raggio/dondza-xitsonga-asr-wav2vec2
Base model
facebook/wav2vec2-xls-r-300mEvaluation results
- WER on NCHLT Speech Corpus (Xitsonga)self-reported0.072
- Eval Loss on NCHLT Speech Corpus (Xitsonga)self-reported0.072