Katib-ASR / README.md
uzair0's picture
Update README.md
9902eb6 verified
# 🪶 Katib ASR: State-of-the-Art Pashto Speech Recognition
> *Listening to the voices that the AI boom forgot.*
Katib ASR is the most capable open-source Automatic Speech Recognition (ASR) model for the Pashto language (پښتو). Built on top of Whisper Large v3 and fine-tuned on the largest curated Pashto speech corpus assembled to date, Katib ASR brings real-time, highly accurate speech-to-text capabilities to millions of Pashto speakers.
---
## 🩸 The Story Behind Katib
Building state-of-the-art AI usually takes massive corporate research labs, entire teams of engineers, and unlimited compute. **Katib ASR was built entirely solo.**
The generative AI revolution is moving fast, but regional languages are being left behind. While developing voice-activated AI agents for medical clinics in Pakistan, the bottleneck became painfully clear: there was no reliable, high-fidelity transcription for Pashto.
Training an ASR model for a low-resource language is a massive grind. It meant hunting down scarce, fragmented audio datasets, writing custom text normalizers to fix broken Arabic-script transcriptions, and maximizing A100 GPU compute to ensure the architecture could handle the complex phonetics of native Pashto speakers.
Katib ASR is the result of that struggle — a dedicated, open-source model designed to give Pashto speakers a voice in the digital age.
---
---
## 🏆 Model Architecture & Performance
This is not a generic multilingual model. Katib ASR is a **dedicated, purpose-built Pashto ASR system** — the only model of its kind at this scale.
| Feature | Detail |
|---|---|
| 🧠 Base Model | Whisper Large v3 (1.55B parameters) |
| 🗣️ Language | Pashto (پښتو) — Afghan & Pakistani dialects |
| ⚡ Hardware | NVIDIA A100 80GB |
| 🔢 WER | **28.23%** — best published result for open Pashto ASR |
### Evaluation Results
Evaluated on a held-out Pashto test set not seen during training:
| Metric | Score |
|---|---|
| Word Error Rate (WER) | **28.23%** |
| Evaluation Loss | 0.3011 |
> 💡 **For context:** The base `whisper-large-v3` model — with no Pashto fine-tuning — produces largely garbled or Arabic-language output on Pashto audio. Katib ASR delivers coherent, structured transcriptions where the base model fails entirely.
---
## 📚 Datasets & Text Normalization
Katib ASR was trained on a multi-source, multi-dialect Pashto speech corpus carefully assembled and preprocessed from:
- Common Voice Pashto 24
- FLEURS Pashto
- A Custom Curated Pashto Corpus of in-house recordings
### Custom Pashto Text Normalization
A key contribution of this model is a dedicated **Pashto text normalization pipeline** applied consistently to both training labels and inference output. It handles script variant inconsistencies across sources:
- Arabic Kaf (ك) → Pashto Kaf (ک)
- ݢ / گ → Pashto Gaf (ګ)
- Arabic Yey / Alef Maqsura variants → Pashto Yey (ی)
- All non-Arabic-script noise and punctuation removed
This ensures the model produces clean, standardized Pashto script regardless of the source audio's original transcription.
---
## 🚀 Quick Start
### Using the Pipeline (Recommended)
```python
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="uzair0/Katib-ASR",
torch_dtype="auto",
device="cuda",
chunk_length_s=30,
)
result = asr("pashto_audio.wav")
print(result["text"])
# Example output: "زه غواړم چی ښار ته لاړ کړم"
```
### Direct Model Loading
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
processor = WhisperProcessor.from_pretrained("uzair0/Katib-ASR")
model = WhisperForConditionalGeneration.from_pretrained(
"uzair0/Katib-ASR",
torch_dtype=torch.bfloat16
).to("cuda")
model.generation_config.language = "pashto"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []
```
---
## ⚙️ Training Configuration
| Parameter | Value |
|---|---|
| Base model | whisper-large-v3 |
| Precision | bfloat16 + TF32 |
| Effective batch size | 128 (64 × 2 grad accumulation) |
| Learning rate | 1e-5 (linear schedule) |
| Warmup steps | 92 |
| Epochs | 3 |
| Optimizer | AdamW Fused |
| Gradient checkpointing | ✅ Enabled |
---
## 👨‍💻 Author & Citation
Built from the ground up by **Muhammad Uzair** at the University of Peshawar.
If you use Katib ASR in your research or applications, please consider citing it:
```bibtex
@misc{katibasr2026,
title = {Katib ASR: State-of-the-Art Pashto Automatic Speech Recognition},
author = {Muhammad Uzair},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/uzair0/Katib-ASR}
}
```
---
*Built with ❤️ for the Pashto-speaking world.*