🎙️ BaltiVoice ASR — Whisper Small Fine-Tuned for Balti (bft)

First public Automatic Speech Recognition model for Balti, a critically low-resource Tibetic language spoken in Gilgit-Baltistan, Pakistan.

📦 Dataset • 🎧 Live Demo • 💻 GitHub • 📄 Paper

Model Details

Model Description

This model is a fine-tuned version of openai/whisper-small for Automatic Speech Recognition (ASR) in the Balti language (bft).

Balti is a Tibetic language with roughly 400,000 speakers, written in Nastaliq (Arabic-based) script. Before this work, no publicly available ASR models or datasets existed for Balti. This model transcribes Balti speech into native Nastaliq text.

Developed by: Muhammad Ali, Independent Researcher, Gilgit-Baltistan, Pakistan. Alumnus, The Islamia University of Bahawalpur (IUB).
Model type: Sequence-to-sequence ASR (Whisper architecture)
Language: Balti (bft)
License: Apache 2.0
Base model: openai/whisper-small

Model Sources

Repository: github.com/mohdali-dev/BaltiVoice-ASR
Demo: HuggingFace Spaces
Paper: arXiv:2606.03504

Results

Model	WER (%)	CER (%)
Whisper-small (zero-shot)	159.19	152.52
Whisper-base (fine-tuned)	44.54	15.61
Whisper-small (fine-tuned, this model)	26.74	8.67

Zero-shot WER above 100% indicates hallucination — the model generates words not present in the reference. Fine-tuning on 16.8 hours of Balti speech reduces this to an impressive 26.74% WER and 8.67% CER on the 538-utterance speaker-disjoint validation set.

How to Get Started

Installation

pip install transformers torch librosa

Inference

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="mohdali1/whisper-small-balti",
    generate_kwargs={"language": "urdu", "task": "transcribe"}
)

result = asr("your_balti_audio.wav")
print(result["text"])

Manual inference

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa

model_id  = "mohdali1/whisper-small-balti"
processor = WhisperProcessor.from_pretrained(
    model_id, language="urdu", task="transcribe"
)
model     = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, sr = librosa.load("your_balti_audio.wav", sr=16000)
inputs    = processor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(
    generated_ids, skip_special_tokens=True
)[0]
print(transcription)

Uses

Direct Use

Transcription: Convert Balti audio into native Nastaliq text
Research: Study low-resource ASR and transfer learning for Tibetic languages
Education: Build tools for Balti literacy and pronunciation

Downstream Use

Voice assistants for Balti speakers
Media archiving of radio broadcasts, folk stories, oral histories
Healthcare documentation in rural Gilgit-Baltistan settings

Out-of-Scope Use

High-stakes decisions (legal, medical, safety-critical) without human verification — WER is ~27%, not production-ready
Other languages — performance on non-Balti input is not guaranteed
Commercial deployment without further domain-specific evaluation

Training Details

Training Data

Dataset: BaltiVoice ASR Dataset
Total clips: 10,060 validated utterances (~16.8 hours)
Format: 16kHz mono WAV, native Nastaliq transcriptions
Split method: Speaker-disjoint (GroupShuffleSplit on client_id, seed 42)

Split	Samples	Speakers
Train	9,519	122
Validation	538	14

Training Hyperparameters

Parameter	Value
Base model	openai/whisper-small
Language token	urdu (closest Nastaliq script in Whisper)
Task	transcribe
Learning rate	1e-5
Effective batch size	16 (8 × 2 gradient accumulation)
Max steps	1,000
Optimizer	AdamW
Precision	fp16
Gradient checkpointing	Enabled
Hardware	NVIDIA Tesla T4 (Google Colab)
Training time	1h 54m

Training Curve

Step	Train Loss	Val Loss	Raw WER (%)
250	0.7905	0.4037	40.19
500	0.5968	0.3208	33.37
750	0.4542	0.2963	31.37
1000	0.4652	0.2830	30.07

Note: The raw training WER at step 1,000 was 30.07%. However, the final normalized evaluation (with punctuation removed) on the speaker-disjoint held-out set yielded the reported 26.74% WER and 8.67% CER, confirming the model generalizes well to unseen speakers.

Bias, Risks, and Limitations

Technical Limitations

WER of 26.74% — roughly one word in four may be incorrect. Not suitable for critical applications without human review.
Read speech only — trained on short read clips (avg 6 seconds). Performance on spontaneous conversational speech will likely be lower.
No Unicode normalization — Nastaliq script Unicode ambiguities (e.g., Arabic Yeh vs. Farsi Yeh) may affect output consistency.
Speaker diversity — 136 speakers, mostly from Gilgit-Baltistan. Dialectal variation from other regions may affect accuracy.

Sociotechnical Considerations

Balti is an endangered language. Mis-transcriptions could distort cultural meaning. Native speaker validation is recommended.
The dataset represents a specific regional subset of Balti speakers and may not capture all dialectal variation.

Recommendations

Use human review for sensitive or important content
Encourage Balti speakers to report errors via GitHub Issues
Consider extended training or Whisper-medium for higher accuracy

Environmental Impact

Estimated using the ML Impact Calculator (Lacoste et al., 2019).

Hardware: NVIDIA Tesla T4
Training time: ~1.9 hours
Cloud provider: Google Colab
Carbon emitted: ~0.1 kg CO₂eq (estimated)

Citation

If you use this model or the associated dataset in your research, please cite:

@misc{ali2026baltivoice,
  author    = {Muhammad Ali},
  title     = {BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language},
  year      = {2026},
  eprint    = {2606.03504},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url       = {https://arxiv.org/abs/2606.03504}
}

Glossary

WER: Word Error Rate = (Substitutions + Deletions + Insertions) / Total Words. Lower is better.
CER: Character Error Rate. Useful for Nastaliq script where Unicode ambiguities can inflate WER.
Nastaliq: Arabic-based script used for Urdu, Persian, and Balti.
Low-resource language: A language with limited digital data, tools, and models available for NLP/ASR.
Speaker-disjoint split: Train and validation sets contain entirely different speakers, preventing the model from memorizing speaker acoustics.

More Information

Dataset: mohdali1/baltivoice-asr
Demo: baltivoice-demo
GitHub: BaltiVoice-ASR
Paper: arXiv:2606.03504
Author: Muhammad Ali
Contact: s22bseen1m01052@iub.edu.pk | ORCID | LinkedIn

Downloads last month: 14

Safetensors

Model size

0.2B params

Tensor type

F32

Space using mohdali1/whisper-small-balti 1

Paper for mohdali1/whisper-small-balti

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

Paper • 2606.03504 • Published Jun 2