Language	WER	vs V1
🇳🇬 Pidgin	14.7%	↓ 2.1pp
🇳🇬 Nigerian English	19.6%	↓ 1.5pp
🇳🇬 Yoruba	22.3%	↓ 6.5pp
🇳🇬 Hausa	25.8%	↓ 5.2pp
🇳🇬 Igbo	30.5%	↓ 11.4pp

Nigeria's Voice in AI. Now Sharper.

NaijaVox-2.0 is the second generation of Axiveri's open-weight automatic speech recognition model for Nigerian languages — Yoruba (with full diacritics), Hausa, Igbo, Nigerian Pidgin, and Nigerian-accented English. Built on OpenAI Whisper-large-v3 with PEFT LoRA fine-tuning, NaijaVox-2.0 delivers significant accuracy gains over V1 through a larger and more diverse training corpus (25,866 samples across 7 datasets), deeper LoRA adaptation (r=64 targeting attention and feed-forward layers), SpecAugment, and realistic noise augmentation for real-world robustness.

"Every Nigerian deserves to be heard and understood by AI — in their own language, with their own voice."

← NaijaVox-V1 — the original model

📈 V1 → V2 Improvement

Evaluated on identical test sets with identical methodology (50 samples/language, strict WER, no normalization):

Language	V1 WER	V2 WER	Absolute Δ	Relative Gain
🇳🇬 Yoruba	28.8%	22.3%	−6.5pp	+22.6%
🇳🇬 Hausa	31.0%	25.8%	−5.2pp	+16.8%
🇳🇬 Igbo	41.9%	30.5%	−11.4pp	+27.2%
🇳🇬 Nigerian English	21.1%	19.6%	−1.5pp	+7.1%
🇳🇬 Nigerian Pidgin	16.8%	14.7%	−2.1pp	+12.5%
Average	27.9%	22.58%	−5.3pp	+19.1%

Igbo sees the largest jump (+27.2% relative) — driven by WaxalNLP Igbo TTS data and Nigerian Common Voice Igbo samples, combined with SpecAugment frequency masking.

🗣️ Languages Supported

Language	ISO Code	Script	Token
Yoruba	`yo`	Latin + full diacritics (ẹ, ọ, ṣ, à, á, etc.)	`<\|yo\|>`
Hausa	`ha`	Latin + special chars (ƙ, ƴ, ɗ, etc.)	`<\|ha\|>`
Igbo	`ig`	Latin + diacritics	`<\|ig\|>`
Nigerian Pidgin	`pcm`	Latin	`<\|pcm\|>`
Nigerian English	`en`	Latin	`<\|en\|>`

Note: <\|ig\|> and <\|pcm\|> are custom language tokens added to the Whisper vocabulary. The extended tokenizer is included in this repository.

🚀 Quick Start

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="Axiveri/NaijaVox-2.0",
    device=0  # use GPU, or remove for CPU
)

result = pipe("your_audio.wav")
print(result["text"])

Specifying Language

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

model     = WhisperForConditionalGeneration.from_pretrained("Axiveri/NaijaVox-2.0")
processor = WhisperProcessor.from_pretrained("Axiveri/NaijaVox-2.0")
vocab     = processor.tokenizer.get_vocab()

LANG_TOKENS = {
    "yoruba":           "<|yo|>",
    "hausa":            "<|ha|>",
    "igbo":             "<|ig|>",
    "nigerian_english": "<|en|>",
    "pidgin":           "<|pcm|>",
}

def transcribe(audio_array, sampling_rate, language="yoruba"):
    lang_id      = vocab[LANG_TOKENS[language]]
    transcribe   = vocab["<|transcribe|>"]
    notimestamps = vocab["<|notimestamps|>"]
    forced_ids   = [[1, lang_id], [2, transcribe], [3, notimestamps]]

    inputs = processor.feature_extractor(
        audio_array, sampling_rate=sampling_rate, return_tensors="pt"
    ).input_features

    with torch.no_grad():
        generated = model.generate(
            input_features=inputs,
            forced_decoder_ids=forced_ids,
            max_new_tokens=448
        )
    return processor.tokenizer.decode(generated[0], skip_special_tokens=True)

📊 Benchmark Results

Evaluated on FLEURS test splits (Yoruba, Hausa, Igbo), Nigerian Pidgin ASR test set, and Nigerian Accented English dataset. 50 samples per language, greedy decoding, strict WER via jiwer (no text normalization). Identical methodology to V1 for direct comparison.

Language	WER (%)	Accuracy (%)	Test Set	Samples
🇳🇬 Nigerian Pidgin	14.7	85.3	asr-nigerian-pidgin/nigerian-pidgin-1.0	50
🇳🇬 Nigerian English	19.6	80.4	benjaminogbonna/nigerian_accented_english	50
🇳🇬 Yoruba	22.3	77.7	google/fleurs yo_ng	50
🇳🇬 Hausa	25.8	74.2	google/fleurs ha_ng	50
🇳🇬 Igbo	30.5	70.5	google/fleurs ig_ng	50
Average	22.58	77.62	—	250

Lower WER = better. Human-level transcription ≈ 5–10%.

🛡️ Robustness Improvements over V1

SpecAugment

Frequency masking (up to 27 mel bins) and time masking (up to 100 time steps) applied to mel spectrograms during training. This prevents over-reliance on specific frequency bands or time positions, improving generalization to real-world recordings.

Noise Augmentation

30% of training samples received realistic background noise injection at random SNR levels before mel extraction. This directly trains the model for common Nigerian recording conditions — market noise, phone compression artifacts, outdoor ambient sound, and crowd audio.

Code-Switching Robustness

Trained on Nigerian Pidgin and Nigerian English together with Yoruba, Hausa, and Igbo — all of which contain natural code-switching patterns present in everyday Nigerian speech, media, and social content.

🎙️ Sample Transcriptions

Real audio samples from FLEURS test, Nigerian English, and Pidgin datasets — data the model never saw during training. Transcriptions generated by the published merged model.

Yoruba

Reference	Audio	NaijaVox-2.0 Output
àwọn èyàn ti mọ̀ nípa àwọn kemika pepe bí wúrà fàdákà àti kọ́pa àtijọ́ torípé a lè rí wọn		àwọn èèyàn ti mọ̀ nípa àwọn kẹmíkà pèèpèé bí wúrà fàdákà àti kọpa àtijọ́ torí pé a lè rí wọn
àwọn ara ìrano lo kọ́kọ́ bẹ̀rẹ̀ si ni sin ewure ní bíi ọdún 15,0000 sẹ́yìn ní oke sagrosi		àwọn ará ìrà náà ló kọ́kọ́ bẹ̀rẹ̀ sí ní sin ewúrẹ́ ní bí ọdún 1500 sẹ́yìn ní òkè sagrosi

Hausa

Reference	Audio	NaijaVox-2.0 Output
an kwatanta faretin gine-ginen da ke yin sararin samaniyar hong kong da ginshiƙi mai walƙi		an kwatanta feretin gine-ginen da ke yin sararin samaniya hong kong da ginshiki mai walƙiy
aristotle masanin falsafa ne yayi tunanin cewa komai ya kunshi cakuda daya ko fiye daga ab		aristotle masanin falsafani ya yi tunanin cewa kome ya kunshi ca kuda daya ko fiye daga ab

Igbo

Reference	Audio	NaijaVox-2.0 Output
ka akara rossby na-adị obere karịa ka arụmarụ na-adịkwu obere nke kpakpando n'ikwanye ugwu		akara rossby na-adị obere karịa ka arụmarụ na-adịkwa obere nke kpakpando n'ịkwà nye monto
ka agha dara mba britenị jiri ndị agha elu mmiri gbochie ndị jamani inweta enyemaka		ka agha adara mba briten jiri ndị agha elu mmiri gbochie ndị jamanị inweta enyemaka

Nigerian English

Reference	Audio	NaijaVox-2.0 Output
Did it change plain? Yes. yes. Ok that means he was correct so this is if he's right that		Did it change green? Yes. Ok that means she was correct. So this is if its red then its no
Ebube Nwagbo studied Mass Communication at Nnamdi Azikiwe University.		Ebube Nwagbo studied Mass Communication at Nnamdi Azikiwe University.

Nigerian Pidgin

Reference	Audio	NaijaVox-2.0 Output
on top di injury her uncle no even carry her go hospital for treatment		on top di injury and her uncle no even carry her go hospital for treatment
she tell don jazzy for december 2016 say as she be		she tell don jazzy for december 2016 say i should be

🏗️ Model Architecture

Input Audio (16kHz)
        │
        ▼
Whisper-large-v3 Encoder  (frozen during fine-tuning)
        │  1500 × 1280 features
        ▼
Whisper Decoder + LoRA    (r=64, alpha=128, fine-tuned)
  target modules: q_proj, k_proj, v_proj, out_proj, fc1, fc2
  V1: attention only (q/k/v/out) — V2: adds feed-forward (fc1/fc2)
        │
        ▼
Extended Tokenizer         (vocab: 51,868 tokens)
  + <|ig|> Igbo token
  + <|pcm|> Nigerian Pidgin token
        │
        ▼
Transcript

V2 publishes a fully merged standalone model — no PEFT dependency required. Load directly with transformers.

📦 Training Details

Parameter	V1	V2
Base model	openai/whisper-large-v3	openai/whisper-large-v3
Fine-tuning method	LoRA (PEFT)	LoRA (PEFT)
LoRA rank	32	64
LoRA alpha	64	128
Target modules	q/k/v/out_proj	q/k/v/out_proj + fc1/fc2
LoRA dropout	0.05	0.05
Training precision	fp16	fp16
Effective batch size	16	32
Learning rate	1e-3	5e-4
Warmup steps	50	200
Epochs (best)	2	3 of 5
SpecAugment	❌	✅
Noise augmentation	❌	✅ (30% of samples)
Total training samples	13,866	25,866
GPU	Tesla T4 × 2 (Kaggle)	Tesla T4 × 2 (Kaggle)
Total training time	~20 hours	~40 hours

Training Datasets

Dataset	Language(s)	Samples	New in V2
google/fleurs (yo_ng, ha_ng, ig_ng)	Yoruba, Hausa, Igbo	8,437	—
benjaminogbonna/nigerian_accented_english_dataset	Nigerian English	2,721	—
asr-nigerian-pidgin/nigerian-pidgin-1.0	Nigerian Pidgin	2,708	—
Tundragoon/IroyinSpeech	Yoruba	2,500	✅
google/WaxalNLP (ha/ig/yo/pcm)	Hausa, Igbo, Yoruba, Pidgin	6,000	✅
benjaminogbonna/nigerian_common_voice_dataset	en/ha/ig/yo	2,000	✅
vpetukhov/bible_tts_hausa	Hausa	1,500	✅
Total	5 languages	25,866

✅ Intended Use

🏦 Fintech & banking — voice transactions and customer service in Nigerian languages
📱 Mobile apps — voice input for Yoruba, Hausa, Igbo, and Pidgin speakers
🎙️ Media & journalism — transcribing interviews and broadcasts
🏥 Healthcare — patient intake and medical documentation
📚 Education — language learning tools and accessibility
🔬 Research — low-resource ASR study for West African languages
♿ Accessibility — assistive technology for Nigerians with disabilities

🚫 Prohibited Use

❌ Non-consensual surveillance — transcribing calls without consent of all parties
❌ Fraud facilitation — forging spoken statements or supporting advance-fee fraud
❌ Deepfake pipelines — combining with TTS to fake audio attributed to real people
❌ Discriminatory systems — denying services based on language or accent identification
❌ Political disinformation — generating or verifying false transcripts of political speech

👤 Creator

Emmanuel Ariyo (Ememzyvisuals) — Founder, Axiveri

NaijaVox is conceived, built, and trained by Emmanuel Ariyo — combining ML engineering with a Nigerian cultural design identity to bring open-weight speech recognition to Nigerian language speakers.

👥 About Axiveri

Axiveri is building Africa's AI infrastructure — open models, open data, and open tools for African languages and developers.

🌍 Axiveri on HuggingFace
🗣️ NaijaVox Collection

📄 Citation

@misc{naijavox2026,
  title        = {NaijaVox-2.0: Open-Weight Speech Recognition for Nigerian Languages},
  author       = {Ariyo, Emmanuel (Ememzyvisuals)},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Axiveri/NaijaVox-2.0}}
}

📜 License

This model is released under the Apache License 2.0 with the following additional behavioral restrictions. Use of this model constitutes acceptance of both the Apache 2.0 terms and these conditions.

Apache 2.0 Terms

Free to use, modify, distribute, and use commercially with attribution. Full terms: apache.org/licenses/LICENSE-2.0

Additional Conditions (Binding)

The following uses are explicitly prohibited regardless of the Apache 2.0 permissions:

Non-consensual surveillance — transcribing private calls or conversations without the informed consent of all parties involved.
Fraud and impersonation — using transcription output to forge or misrepresent spoken statements, support advance-fee fraud, or impersonate individuals.
Synthetic media abuse — combining this model with TTS systems to fabricate audio-text pairs attributed to real, identifiable people without their consent.
Discriminatory gatekeeping — using the model's language or accent detection to deny individuals access to services, employment, housing, or legal rights.
Political disinformation — generating, falsifying, or selectively editing transcripts of political speech to deceive or manipulate public opinion.

These restrictions constitute a binding behavioral license condition under Axiveri's Responsible AI Use Policy. Violation of these conditions terminates your license to use this model.

Built in Nigeria 🇳🇬 — for Nigeria and the world.
Created by Emmanuel Ariyo (Ememzyvisuals)
Second model in the NaijaVox series by Axiveri