File size: 4,872 Bytes
d3a3ddc 9902eb6 d3a3ddc e151a3e d3a3ddc e151a3e d3a3ddc e151a3e d3a3ddc e151a3e d3a3ddc e151a3e a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc e151a3e a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 e151a3e a12c6f6 e151a3e a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 e151a3e a12c6f6 e151a3e d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 d3a3ddc a12c6f6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | # 🪶 Katib ASR: State-of-the-Art Pashto Speech Recognition
> *Listening to the voices that the AI boom forgot.*
Katib ASR is the most capable open-source Automatic Speech Recognition (ASR) model for the Pashto language (پښتو). Built on top of Whisper Large v3 and fine-tuned on the largest curated Pashto speech corpus assembled to date, Katib ASR brings real-time, highly accurate speech-to-text capabilities to millions of Pashto speakers.
---
## 🩸 The Story Behind Katib
Building state-of-the-art AI usually takes massive corporate research labs, entire teams of engineers, and unlimited compute. **Katib ASR was built entirely solo.**
The generative AI revolution is moving fast, but regional languages are being left behind. While developing voice-activated AI agents for medical clinics in Pakistan, the bottleneck became painfully clear: there was no reliable, high-fidelity transcription for Pashto.
Training an ASR model for a low-resource language is a massive grind. It meant hunting down scarce, fragmented audio datasets, writing custom text normalizers to fix broken Arabic-script transcriptions, and maximizing A100 GPU compute to ensure the architecture could handle the complex phonetics of native Pashto speakers.
Katib ASR is the result of that struggle — a dedicated, open-source model designed to give Pashto speakers a voice in the digital age.
---
---
## 🏆 Model Architecture & Performance
This is not a generic multilingual model. Katib ASR is a **dedicated, purpose-built Pashto ASR system** — the only model of its kind at this scale.
| Feature | Detail |
|---|---|
| 🧠 Base Model | Whisper Large v3 (1.55B parameters) |
| 🗣️ Language | Pashto (پښتو) — Afghan & Pakistani dialects |
| ⚡ Hardware | NVIDIA A100 80GB |
| 🔢 WER | **28.23%** — best published result for open Pashto ASR |
### Evaluation Results
Evaluated on a held-out Pashto test set not seen during training:
| Metric | Score |
|---|---|
| Word Error Rate (WER) | **28.23%** |
| Evaluation Loss | 0.3011 |
> 💡 **For context:** The base `whisper-large-v3` model — with no Pashto fine-tuning — produces largely garbled or Arabic-language output on Pashto audio. Katib ASR delivers coherent, structured transcriptions where the base model fails entirely.
---
## 📚 Datasets & Text Normalization
Katib ASR was trained on a multi-source, multi-dialect Pashto speech corpus carefully assembled and preprocessed from:
- Common Voice Pashto 24
- FLEURS Pashto
- A Custom Curated Pashto Corpus of in-house recordings
### Custom Pashto Text Normalization
A key contribution of this model is a dedicated **Pashto text normalization pipeline** applied consistently to both training labels and inference output. It handles script variant inconsistencies across sources:
- Arabic Kaf (ك) → Pashto Kaf (ک)
- ݢ / گ → Pashto Gaf (ګ)
- Arabic Yey / Alef Maqsura variants → Pashto Yey (ی)
- All non-Arabic-script noise and punctuation removed
This ensures the model produces clean, standardized Pashto script regardless of the source audio's original transcription.
---
## 🚀 Quick Start
### Using the Pipeline (Recommended)
```python
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="uzair0/Katib-ASR",
torch_dtype="auto",
device="cuda",
chunk_length_s=30,
)
result = asr("pashto_audio.wav")
print(result["text"])
# Example output: "زه غواړم چی ښار ته لاړ کړم"
```
### Direct Model Loading
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
processor = WhisperProcessor.from_pretrained("uzair0/Katib-ASR")
model = WhisperForConditionalGeneration.from_pretrained(
"uzair0/Katib-ASR",
torch_dtype=torch.bfloat16
).to("cuda")
model.generation_config.language = "pashto"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []
```
---
## ⚙️ Training Configuration
| Parameter | Value |
|---|---|
| Base model | whisper-large-v3 |
| Precision | bfloat16 + TF32 |
| Effective batch size | 128 (64 × 2 grad accumulation) |
| Learning rate | 1e-5 (linear schedule) |
| Warmup steps | 92 |
| Epochs | 3 |
| Optimizer | AdamW Fused |
| Gradient checkpointing | ✅ Enabled |
---
## 👨💻 Author & Citation
Built from the ground up by **Muhammad Uzair** at the University of Peshawar.
If you use Katib ASR in your research or applications, please consider citing it:
```bibtex
@misc{katibasr2026,
title = {Katib ASR: State-of-the-Art Pashto Automatic Speech Recognition},
author = {Muhammad Uzair},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/uzair0/Katib-ASR}
}
```
---
*Built with ❤️ for the Pashto-speaking world.* |