- Nigeria's Voice in AI. Now Sharper.
- π V1 β V2 Improvement
- π£οΈ Languages Supported
- π Quick Start
- π Benchmark Results
- π‘οΈ Robustness Improvements over V1
- ποΈ Sample Transcriptions
- ποΈ Model Architecture
- π¦ Training Details
- β
Intended Use
- π« Prohibited Use
- π€ Creator
- π₯ About Axiveri
- π Citation
- π License
| Language | WER | vs V1 |
|---|---|---|
| π³π¬ Pidgin | 14.7% | β 2.1pp |
| π³π¬ Nigerian English | 19.6% | β 1.5pp |
| π³π¬ Yoruba | 22.3% | β 6.5pp |
| π³π¬ Hausa | 25.8% | β 5.2pp |
| π³π¬ Igbo | 30.5% | β 11.4pp |
Nigeria's Voice in AI. Now Sharper.
NaijaVox-2.0 is the second generation of Axiveri's open-weight automatic speech recognition model for Nigerian languages β Yoruba (with full diacritics), Hausa, Igbo, Nigerian Pidgin, and Nigerian-accented English. Built on OpenAI Whisper-large-v3 with PEFT LoRA fine-tuning, NaijaVox-2.0 delivers significant accuracy gains over V1 through a larger and more diverse training corpus (25,866 samples across 7 datasets), deeper LoRA adaptation (r=64 targeting attention and feed-forward layers), SpecAugment, and realistic noise augmentation for real-world robustness.
"Every Nigerian deserves to be heard and understood by AI β in their own language, with their own voice."
β NaijaVox-V1 β the original model
π V1 β V2 Improvement
Evaluated on identical test sets with identical methodology (50 samples/language, strict WER, no normalization):
| Language | V1 WER | V2 WER | Absolute Ξ | Relative Gain |
|---|---|---|---|---|
| π³π¬ Yoruba | 28.8% | 22.3% | β6.5pp | +22.6% |
| π³π¬ Hausa | 31.0% | 25.8% | β5.2pp | +16.8% |
| π³π¬ Igbo | 41.9% | 30.5% | β11.4pp | +27.2% |
| π³π¬ Nigerian English | 21.1% | 19.6% | β1.5pp | +7.1% |
| π³π¬ Nigerian Pidgin | 16.8% | 14.7% | β2.1pp | +12.5% |
| Average | 27.9% | 22.58% | β5.3pp | +19.1% |
Igbo sees the largest jump (+27.2% relative) β driven by WaxalNLP Igbo TTS data and Nigerian Common Voice Igbo samples, combined with SpecAugment frequency masking.
π£οΈ Languages Supported
| Language | ISO Code | Script | Token |
|---|---|---|---|
| Yoruba | yo |
Latin + full diacritics (αΊΉ, α», αΉ£, Γ , Γ‘, etc.) | <|yo|> |
| Hausa | ha |
Latin + special chars (Ζ, Ζ΄, Ι, etc.) | <|ha|> |
| Igbo | ig |
Latin + diacritics | <|ig|> |
| Nigerian Pidgin | pcm |
Latin | <|pcm|> |
| Nigerian English | en |
Latin | <|en|> |
Note:
<\|ig\|>and<\|pcm\|>are custom language tokens added to the Whisper vocabulary. The extended tokenizer is included in this repository.
π Quick Start
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="Axiveri/NaijaVox-2.0",
device=0 # use GPU, or remove for CPU
)
result = pipe("your_audio.wav")
print(result["text"])
Specifying Language
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
model = WhisperForConditionalGeneration.from_pretrained("Axiveri/NaijaVox-2.0")
processor = WhisperProcessor.from_pretrained("Axiveri/NaijaVox-2.0")
vocab = processor.tokenizer.get_vocab()
LANG_TOKENS = {
"yoruba": "<|yo|>",
"hausa": "<|ha|>",
"igbo": "<|ig|>",
"nigerian_english": "<|en|>",
"pidgin": "<|pcm|>",
}
def transcribe(audio_array, sampling_rate, language="yoruba"):
lang_id = vocab[LANG_TOKENS[language]]
transcribe = vocab["<|transcribe|>"]
notimestamps = vocab["<|notimestamps|>"]
forced_ids = [[1, lang_id], [2, transcribe], [3, notimestamps]]
inputs = processor.feature_extractor(
audio_array, sampling_rate=sampling_rate, return_tensors="pt"
).input_features
with torch.no_grad():
generated = model.generate(
input_features=inputs,
forced_decoder_ids=forced_ids,
max_new_tokens=448
)
return processor.tokenizer.decode(generated[0], skip_special_tokens=True)
π Benchmark Results
Evaluated on FLEURS test splits (Yoruba, Hausa, Igbo), Nigerian Pidgin ASR test set, and Nigerian Accented English dataset. 50 samples per language, greedy decoding, strict WER via jiwer (no text normalization). Identical methodology to V1 for direct comparison.
| Language | WER (%) | Accuracy (%) | Test Set | Samples |
|---|---|---|---|---|
| π³π¬ Nigerian Pidgin | 14.7 | 85.3 | asr-nigerian-pidgin/nigerian-pidgin-1.0 | 50 |
| π³π¬ Nigerian English | 19.6 | 80.4 | benjaminogbonna/nigerian_accented_english | 50 |
| π³π¬ Yoruba | 22.3 | 77.7 | google/fleurs yo_ng | 50 |
| π³π¬ Hausa | 25.8 | 74.2 | google/fleurs ha_ng | 50 |
| π³π¬ Igbo | 30.5 | 70.5 | google/fleurs ig_ng | 50 |
| Average | 22.58 | 77.62 | β | 250 |
Lower WER = better. Human-level transcription β 5β10%.
π‘οΈ Robustness Improvements over V1
SpecAugment
Frequency masking (up to 27 mel bins) and time masking (up to 100 time steps) applied to mel spectrograms during training. This prevents over-reliance on specific frequency bands or time positions, improving generalization to real-world recordings.
Noise Augmentation
30% of training samples received realistic background noise injection at random SNR levels before mel extraction. This directly trains the model for common Nigerian recording conditions β market noise, phone compression artifacts, outdoor ambient sound, and crowd audio.
Code-Switching Robustness
Trained on Nigerian Pidgin and Nigerian English together with Yoruba, Hausa, and Igbo β all of which contain natural code-switching patterns present in everyday Nigerian speech, media, and social content.
ποΈ Sample Transcriptions
Real audio samples from FLEURS test, Nigerian English, and Pidgin datasets β data the model never saw during training. Transcriptions generated by the published merged model.
Yoruba
| Reference | Audio | NaijaVox-2.0 Output |
|---|---|---|
| Γ wα»n Γ¨yΓ n ti mα»Μ nΓpa Γ wα»n kemika pepe bΓ wΓΊrΓ fΓ dΓ‘kΓ Γ ti kα»Μpa Γ tijα»Μ torΓpΓ© a lΓ¨ rΓ wα»n | Γ wα»n èèyΓ n ti mα»Μ nΓpa Γ wα»n kαΊΉmΓkΓ pèèpèé bΓ wΓΊrΓ fΓ dΓ‘kΓ Γ ti kα»pa Γ tijα»Μ torΓ pΓ© a lΓ¨ rΓ wα»n | |
| Γ wα»n ara Γ¬rano lo kα»Μkα»Μ bαΊΉΜrαΊΉΜ si ni sin ewure nΓ bΓi α»dΓΊn 15,0000 sαΊΉΜyΓ¬n nΓ oke sagrosi | Γ wα»n arΓ‘ Γ¬rΓ nÑà lΓ³ kα»Μkα»Μ bαΊΉΜrαΊΉΜ sΓ nΓ sin ewΓΊrαΊΉΜ nΓ bΓ α»dΓΊn 1500 sαΊΉΜyΓ¬n nΓ Γ²kΓ¨ sagrosi |
Hausa
| Reference | Audio | NaijaVox-2.0 Output |
|---|---|---|
| an kwatanta faretin gine-ginen da ke yin sararin samaniyar hong kong da ginshiΖi mai walΖi | an kwatanta feretin gine-ginen da ke yin sararin samaniya hong kong da ginshiki mai walΖiy | |
| aristotle masanin falsafa ne yayi tunanin cewa komai ya kunshi cakuda daya ko fiye daga ab | aristotle masanin falsafani ya yi tunanin cewa kome ya kunshi ca kuda daya ko fiye daga ab |
Igbo
| Reference | Audio | NaijaVox-2.0 Output |
|---|---|---|
| ka akara rossby na-adα» obere karα»a ka arα»₯marα»₯ na-adα»kwu obere nke kpakpando n'ikwanye ugwu | akara rossby na-adα» obere karα»a ka arα»₯marα»₯ na-adα»kwa obere nke kpakpando n'α»kwΓ nye monto | |
| ka agha dara mba britenα» jiri ndα» agha elu mmiri gbochie ndα» jamani inweta enyemaka | ka agha adara mba briten jiri ndα» agha elu mmiri gbochie ndα» jamanα» inweta enyemaka |
Nigerian English
| Reference | Audio | NaijaVox-2.0 Output |
|---|---|---|
| Did it change plain? Yes. yes. Ok that means he was correct so this is if he's right that | Did it change green? Yes. Ok that means she was correct. So this is if its red then its no | |
| Ebube Nwagbo studied Mass Communication at Nnamdi Azikiwe University. | Ebube Nwagbo studied Mass Communication at Nnamdi Azikiwe University. |
Nigerian Pidgin
| Reference | Audio | NaijaVox-2.0 Output |
|---|---|---|
| on top di injury her uncle no even carry her go hospital for treatment | on top di injury and her uncle no even carry her go hospital for treatment | |
| she tell don jazzy for december 2016 say as she be | she tell don jazzy for december 2016 say i should be |
ποΈ Model Architecture
Input Audio (16kHz)
β
βΌ
Whisper-large-v3 Encoder (frozen during fine-tuning)
β 1500 Γ 1280 features
βΌ
Whisper Decoder + LoRA (r=64, alpha=128, fine-tuned)
target modules: q_proj, k_proj, v_proj, out_proj, fc1, fc2
V1: attention only (q/k/v/out) β V2: adds feed-forward (fc1/fc2)
β
βΌ
Extended Tokenizer (vocab: 51,868 tokens)
+ <|ig|> Igbo token
+ <|pcm|> Nigerian Pidgin token
β
βΌ
Transcript
V2 publishes a fully merged standalone model β no PEFT dependency required. Load directly with
transformers.
π¦ Training Details
| Parameter | V1 | V2 |
|---|---|---|
| Base model | openai/whisper-large-v3 | openai/whisper-large-v3 |
| Fine-tuning method | LoRA (PEFT) | LoRA (PEFT) |
| LoRA rank | 32 | 64 |
| LoRA alpha | 64 | 128 |
| Target modules | q/k/v/out_proj | q/k/v/out_proj + fc1/fc2 |
| LoRA dropout | 0.05 | 0.05 |
| Training precision | fp16 | fp16 |
| Effective batch size | 16 | 32 |
| Learning rate | 1e-3 | 5e-4 |
| Warmup steps | 50 | 200 |
| Epochs (best) | 2 | 3 of 5 |
| SpecAugment | β | β |
| Noise augmentation | β | β (30% of samples) |
| Total training samples | 13,866 | 25,866 |
| GPU | Tesla T4 Γ 2 (Kaggle) | Tesla T4 Γ 2 (Kaggle) |
| Total training time | ~20 hours | ~40 hours |
Training Datasets
| Dataset | Language(s) | Samples | New in V2 |
|---|---|---|---|
| google/fleurs (yo_ng, ha_ng, ig_ng) | Yoruba, Hausa, Igbo | 8,437 | β |
| benjaminogbonna/nigerian_accented_english_dataset | Nigerian English | 2,721 | β |
| asr-nigerian-pidgin/nigerian-pidgin-1.0 | Nigerian Pidgin | 2,708 | β |
| Tundragoon/IroyinSpeech | Yoruba | 2,500 | β |
| google/WaxalNLP (ha/ig/yo/pcm) | Hausa, Igbo, Yoruba, Pidgin | 6,000 | β |
| benjaminogbonna/nigerian_common_voice_dataset | en/ha/ig/yo | 2,000 | β |
| vpetukhov/bible_tts_hausa | Hausa | 1,500 | β |
| Total | 5 languages | 25,866 |
β Intended Use
- π¦ Fintech & banking β voice transactions and customer service in Nigerian languages
- π± Mobile apps β voice input for Yoruba, Hausa, Igbo, and Pidgin speakers
- ποΈ Media & journalism β transcribing interviews and broadcasts
- π₯ Healthcare β patient intake and medical documentation
- π Education β language learning tools and accessibility
- π¬ Research β low-resource ASR study for West African languages
- βΏ Accessibility β assistive technology for Nigerians with disabilities
π« Prohibited Use
- β Non-consensual surveillance β transcribing calls without consent of all parties
- β Fraud facilitation β forging spoken statements or supporting advance-fee fraud
- β Deepfake pipelines β combining with TTS to fake audio attributed to real people
- β Discriminatory systems β denying services based on language or accent identification
- β Political disinformation β generating or verifying false transcripts of political speech
π€ Creator
Emmanuel Ariyo (Ememzyvisuals) β Founder, Axiveri
NaijaVox is conceived, built, and trained by Emmanuel Ariyo β combining ML engineering with a Nigerian cultural design identity to bring open-weight speech recognition to Nigerian language speakers.
π₯ About Axiveri
Axiveri is building Africa's AI infrastructure β open models, open data, and open tools for African languages and developers.
- π Axiveri on HuggingFace
- π£οΈ NaijaVox Collection
π Citation
@misc{naijavox2026,
title = {NaijaVox-2.0: Open-Weight Speech Recognition for Nigerian Languages},
author = {Ariyo, Emmanuel (Ememzyvisuals)},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Axiveri/NaijaVox-2.0}}
}
π License
This model is released under the Apache License 2.0 with the following additional behavioral restrictions. Use of this model constitutes acceptance of both the Apache 2.0 terms and these conditions.
Apache 2.0 Terms
Free to use, modify, distribute, and use commercially with attribution. Full terms: apache.org/licenses/LICENSE-2.0
Additional Conditions (Binding)
The following uses are explicitly prohibited regardless of the Apache 2.0 permissions:
- Non-consensual surveillance β transcribing private calls or conversations without the informed consent of all parties involved.
- Fraud and impersonation β using transcription output to forge or misrepresent spoken statements, support advance-fee fraud, or impersonate individuals.
- Synthetic media abuse β combining this model with TTS systems to fabricate audio-text pairs attributed to real, identifiable people without their consent.
- Discriminatory gatekeeping β using the model's language or accent detection to deny individuals access to services, employment, housing, or legal rights.
- Political disinformation β generating, falsifying, or selectively editing transcripts of political speech to deceive or manipulate public opinion.
These restrictions constitute a binding behavioral license condition under Axiveri's Responsible AI Use Policy. Violation of these conditions terminates your license to use this model.
Built in Nigeria π³π¬ β for Nigeria and the world.
Created by Emmanuel Ariyo (Ememzyvisuals)
Second model in the NaijaVox series by Axiveri
- Downloads last month
- -
Model tree for Axiveri/NaijaVox-2.0
Base model
openai/whisper-large-v3