# 🪶 Katib ASR: State-of-the-Art Pashto Speech Recognition

> *Listening to the voices that the AI boom forgot.*

Katib ASR is the most capable open-source Automatic Speech Recognition (ASR) model for the Pashto language (پښتو). Built on top of Whisper Large v3 and fine-tuned on the largest curated Pashto speech corpus assembled to date, Katib ASR brings real-time, highly accurate speech-to-text capabilities to millions of Pashto speakers.

---

## 🩸 The Story Behind Katib

Building state-of-the-art AI usually takes massive corporate research labs, entire teams of engineers, and unlimited compute. **Katib ASR was built entirely solo.**

The generative AI revolution is moving fast, but regional languages are being left behind. While developing voice-activated AI agents for medical clinics in Pakistan, the bottleneck became painfully clear: there was no reliable, high-fidelity transcription for Pashto.

Training an ASR model for a low-resource language is a massive grind. It meant hunting down scarce, fragmented audio datasets, writing custom text normalizers to fix broken Arabic-script transcriptions, and maximizing A100 GPU compute to ensure the architecture could handle the complex phonetics of native Pashto speakers.

Katib ASR is the result of that struggle — a dedicated, open-source model designed to give Pashto speakers a voice in the digital age.

---

---

## 🏆 Model Architecture & Performance

This is not a generic multilingual model. Katib ASR is a **dedicated, purpose-built Pashto ASR system** — the only model of its kind at this scale.

| Feature | Detail |
|---|---|
| 🧠 Base Model | Whisper Large v3 (1.55B parameters) |
| 🗣️ Language | Pashto (پښتو) — Afghan & Pakistani dialects |
| ⚡ Hardware | NVIDIA A100 80GB |
| 🔢 WER | **28.23%** — best published result for open Pashto ASR |

### Evaluation Results

Evaluated on a held-out Pashto test set not seen during training:

| Metric | Score |
|---|---|
| Word Error Rate (WER) | **28.23%** |
| Evaluation Loss | 0.3011 |

> 💡 **For context:** The base `whisper-large-v3` model — with no Pashto fine-tuning — produces largely garbled or Arabic-language output on Pashto audio. Katib ASR delivers coherent, structured transcriptions where the base model fails entirely.

---

## 📚 Datasets & Text Normalization

Katib ASR was trained on a multi-source, multi-dialect Pashto speech corpus carefully assembled and preprocessed from:

- Common Voice Pashto 24
- FLEURS Pashto
- A Custom Curated Pashto Corpus of in-house recordings

### Custom Pashto Text Normalization

A key contribution of this model is a dedicated **Pashto text normalization pipeline** applied consistently to both training labels and inference output. It handles script variant inconsistencies across sources:

- Arabic Kaf (ك) → Pashto Kaf (ک)
- ݢ / گ → Pashto Gaf (ګ)
- Arabic Yey / Alef Maqsura variants → Pashto Yey (ی)
- All non-Arabic-script noise and punctuation removed

This ensures the model produces clean, standardized Pashto script regardless of the source audio's original transcription.

---

## 🚀 Quick Start

### Using the Pipeline (Recommended)

```python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="uzair0/Katib-ASR",
    torch_dtype="auto",
    device="cuda",
    chunk_length_s=30,
)

result = asr("pashto_audio.wav")
print(result["text"])
# Example output: "زه غواړم چی ښار ته لاړ کړم"
```

### Direct Model Loading

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

processor = WhisperProcessor.from_pretrained("uzair0/Katib-ASR")
model = WhisperForConditionalGeneration.from_pretrained(
    "uzair0/Katib-ASR",
    torch_dtype=torch.bfloat16
).to("cuda")

model.generation_config.language = "pashto"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []
```

---

## ⚙️ Training Configuration

| Parameter | Value |
|---|---|
| Base model | whisper-large-v3 |
| Precision | bfloat16 + TF32 |
| Effective batch size | 128 (64 × 2 grad accumulation) |
| Learning rate | 1e-5 (linear schedule) |
| Warmup steps | 92 |
| Epochs | 3 |
| Optimizer | AdamW Fused |
| Gradient checkpointing | ✅ Enabled |

---

## 👨‍💻 Author & Citation

Built from the ground up by **Muhammad Uzair** at the University of Peshawar.

If you use Katib ASR in your research or applications, please consider citing it:

```bibtex
@misc{katibasr2026,
  title     = {Katib ASR: State-of-the-Art Pashto Automatic Speech Recognition},
  author    = {Muhammad Uzair},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/uzair0/Katib-ASR}
}
```

---

*Built with ❤️ for the Pashto-speaking world.*