Instructions to use mohdali1/whisper-small-balti with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mohdali1/whisper-small-balti with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="mohdali1/whisper-small-balti")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("mohdali1/whisper-small-balti") model = AutoModelForSpeechSeq2Seq.from_pretrained("mohdali1/whisper-small-balti") - Notebooks
- Google Colab
- Kaggle
🎙️ BaltiVoice ASR — Whisper Small Fine-Tuned for Balti (bft)
First public Automatic Speech Recognition model for Balti, a critically low-resource Tibetic language spoken in Gilgit-Baltistan, Pakistan.
📊 Dataset • 🎧 Live Demo • 🐦 Twitter Thread
Model Details
Model Description
This model is a fine-tuned version of openai/whisper-small for Automatic Speech Recognition (ASR) in the Balti language (bft).
Balti is a Tibetic language with ~400,000 speakers, written primarily in the Nastaliq (Arabic-based) script. Prior to this work, there were no publicly available ASR models or standardized datasets for Balti. This model enables transcription of Balti speech into text, supporting cultural preservation and digital accessibility for Balti speakers.
- Developed by: Mohammad Ali
- Model type: Sequence-to-sequence ASR (Whisper architecture)
- Language(s): Balti (bft)
- License: Apache 2.0
- Finetuned from model: openai/whisper-small
Model Sources
- Repository: github.com/mohdali-dev/baltivoice-asr
- Demo: HuggingFace Spaces
- Paper: Draft in progress (arXiv submission planned)
Uses
Direct Use
- Transcription: Convert Balti audio (speech) into native Balti text (Nastaliq script).
- Research: Study low-resource ASR techniques and transfer learning for Tibetic languages.
- Education: Assist in creating educational tools for Balti literacy and pronunciation.
Downstream Use
- Voice Assistants: Integrate into voice-enabled applications for Balti speakers.
- Media Archiving: Transcribe local radio broadcasts, folk stories, and oral histories.
- Healthcare: Support voice-to-text documentation in rural healthcare settings where Balti is spoken.
Out-of-Scope Use
- High-Stakes Applications: Do not use for legal, medical, or safety-critical decisions without human verification due to ~30% Word Error Rate (WER).
- Other Languages: Performance on non-Balti languages (e.g., Urdu, English) is not guaranteed and may be poor.
- Commercial Deployment: Requires further evaluation and potential fine-tuning for specific commercial domains.
Bias, Risks, and Limitations
Technical Limitations
- Word Error Rate (WER): The model achieves a WER of 30.07% on the validation set. This means approximately 3 out of 10 words may be incorrect. It is not yet production-ready for critical tasks.
- Script Output: The model outputs text in Nastaliq (Arabic-based) script. It does not support Romanized Balti.
- Audio Quality: Trained on short clips (5–8 seconds). Performance may degrade on longer, continuous speech or noisy environments.
- Speaker Diversity: The training data has limited speaker diversity. The model may underperform on unseen accents, dialects, or recording conditions.
Sociotechnical Considerations
- Cultural Sensitivity: Balti is an endangered language. Mis-transcriptions could distort meaning or cultural context. Always involve native speakers in validation.
- Data Representation: The dataset represents a specific subset of Balti speakers (Gilgit-Baltistan region). It may not capture all dialectal variations.
Recommendations
- Human-in-the-Loop: Always review transcriptions for accuracy, especially for sensitive content.
- Community Feedback: Encourage Balti speakers to report errors and contribute to dataset improvement.
- Further Training: Consider extended training or larger models (Whisper-medium/large) for improved accuracy.
How to Get Started with the Model
Installation
pip install transformers torch librosa
Inference Code
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa
# Load model and processor
model_id = "mohdali1/whisper-small-balti"
processor = WhisperProcessor.from_pretrained(model_id, language="urdu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(model_id)
# Load audio (ensure 16kHz mono)
audio_path = "your_balti_audio.wav"
audio, sr = librosa.load(audio_path, sr=16000)
# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
Training Details
Training Data
- Dataset: BaltiVoice ASR Dataset
- Size: 10,060 validated audio clips (~16.8 hours total)
- Format: 16kHz mono WAV
- Script: Native Balti (Nastaliq/Arabic-based)
- Splits:
- Train: 9,051 samples
- Validation: 1,006 samples
Training Procedure
Preprocessing
- Audio resampled to 16kHz mono.
- Text normalized to standard Nastaliq script.
- Clips shorter than 3 words were removed to ensure quality.
Training Hyperparameters
- Base Model:
openai/whisper-small - Language Token:
urdu(closest supported language to Balti in Whisper) - Task:
transcribe - Learning Rate:
1e-5 - Batch Size:
8(with gradient accumulation steps = 2) - Max Steps:
1000 - Optimizer: AdamW
- Precision:
fp16(mixed precision) - Gradient Checkpointing: Enabled
Speeds, Sizes, Times
- Hardware: Google Colab T4 GPU (16GB VRAM)
- Training Time: ~1 hour 53 minutes
- Total FLOPs: ~4.6e18
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Validation Set: 1,006 unseen Balti audio clips from the BaltiVoice dataset.
Factors
- Language: Balti (bft)
- Script: Nastaliq
- Audio Length: 5–8 seconds per clip
Metrics
- Word Error Rate (WER): Primary metric for ASR performance. Lower is better.
- Training Loss: Cross-entropy loss during training.
- Validation Loss: Cross-entropy loss on unseen data.
Results
| Step | Training Loss | Validation Loss | WER |
|---|---|---|---|
| 250 | 0.7905 | 0.4037 | 40.19% |
| 500 | 0.5968 | 0.3208 | 33.37% |
| 750 | 0.4542 | 0.2963 | 31.37% |
| 1000 | 0.4652 | 0.2830 | 30.07% |
Summary
The model achieved a WER of 30.07% after 1000 training steps, representing a significant improvement from the zero-shot baseline (~95%+). The consistent decrease in both training and validation loss indicates healthy learning without overfitting.
Environmental Impact
Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA Tesla T4 GPU
- Hours used: ~2 hours
- Cloud Provider: Google Colab
- Compute Region: US-Central (approximate)
- Carbon Emitted: ~0.1 kg CO2eq (estimated)
Training was performed on shared free-tier infrastructure to minimize environmental footprint.
Technical Specifications
Model Architecture and Objective
- Architecture: Whisper (Encoder-Decoder Transformer)
- Parameters: ~244 million (Whisper-small)
- Objective: Sequence-to-sequence speech recognition
Compute Infrastructure
Hardware
- GPU: NVIDIA Tesla T4 (16GB VRAM)
- CPU: Intel Xeon @ 2.20GHz (Colab standard)
Software
- Framework: PyTorch, HuggingFace Transformers 4.40+
- Libraries: Librosa, Datasets, Evaluate, Jiwer
Citation
If you use this model or dataset in your research, please cite:
@misc{ali2026baltivoice,
author = {Ali, Muhammad},
title = {BaltiVoice: First Public ASR Dataset and Model for the Low-Resource Tibetic Language Balti (bft)},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/mohdali1/baltivoice-asr},
note = {Model: https://huggingface.co/mohdali1/whisper-small-balti}
}
Glossary
- WER (Word Error Rate): A metric for evaluating ASR systems, calculated as
(Substitutions + Deletions + Insertions) / Total Words. - Nastaliq: A style of Islamic calligraphy used for writing Persian, Urdu, and Balti.
- Low-Resource Language: A language with limited digital resources (data, tools, models) available for NLP/ASR tasks.
More Information
- Project GitHub: github.com/mohdali-dev/baltivoice-asr
- Live Demo: HuggingFace Spaces
Model Card Authors
- Mohammad Ali (mohdali1)
Model Card Contact
- Email: alisdkse@gmail.com
- LinkedIn: https://linkedin.com/in/mohdali1
- GitHub Issues: Report Bugs/Feedback
- Downloads last month
- 38