| # 🪶 Katib ASR: State-of-the-Art Pashto Speech Recognition |
|
|
| > *Listening to the voices that the AI boom forgot.* |
|
|
| Katib ASR is the most capable open-source Automatic Speech Recognition (ASR) model for the Pashto language (پښتو). Built on top of Whisper Large v3 and fine-tuned on the largest curated Pashto speech corpus assembled to date, Katib ASR brings real-time, highly accurate speech-to-text capabilities to millions of Pashto speakers. |
|
|
| --- |
|
|
| ## 🩸 The Story Behind Katib |
|
|
| Building state-of-the-art AI usually takes massive corporate research labs, entire teams of engineers, and unlimited compute. **Katib ASR was built entirely solo.** |
|
|
| The generative AI revolution is moving fast, but regional languages are being left behind. While developing voice-activated AI agents for medical clinics in Pakistan, the bottleneck became painfully clear: there was no reliable, high-fidelity transcription for Pashto. |
|
|
| Training an ASR model for a low-resource language is a massive grind. It meant hunting down scarce, fragmented audio datasets, writing custom text normalizers to fix broken Arabic-script transcriptions, and maximizing A100 GPU compute to ensure the architecture could handle the complex phonetics of native Pashto speakers. |
|
|
| Katib ASR is the result of that struggle — a dedicated, open-source model designed to give Pashto speakers a voice in the digital age. |
|
|
| --- |
|
|
| --- |
|
|
| ## 🏆 Model Architecture & Performance |
|
|
| This is not a generic multilingual model. Katib ASR is a **dedicated, purpose-built Pashto ASR system** — the only model of its kind at this scale. |
|
|
| | Feature | Detail | |
| |---|---| |
| | 🧠 Base Model | Whisper Large v3 (1.55B parameters) | |
| | 🗣️ Language | Pashto (پښتو) — Afghan & Pakistani dialects | |
| | ⚡ Hardware | NVIDIA A100 80GB | |
| | 🔢 WER | **28.23%** — best published result for open Pashto ASR | |
|
|
| ### Evaluation Results |
|
|
| Evaluated on a held-out Pashto test set not seen during training: |
|
|
| | Metric | Score | |
| |---|---| |
| | Word Error Rate (WER) | **28.23%** | |
| | Evaluation Loss | 0.3011 | |
|
|
| > 💡 **For context:** The base `whisper-large-v3` model — with no Pashto fine-tuning — produces largely garbled or Arabic-language output on Pashto audio. Katib ASR delivers coherent, structured transcriptions where the base model fails entirely. |
|
|
| --- |
|
|
| ## 📚 Datasets & Text Normalization |
|
|
| Katib ASR was trained on a multi-source, multi-dialect Pashto speech corpus carefully assembled and preprocessed from: |
|
|
| - Common Voice Pashto 24 |
| - FLEURS Pashto |
| - A Custom Curated Pashto Corpus of in-house recordings |
|
|
| ### Custom Pashto Text Normalization |
|
|
| A key contribution of this model is a dedicated **Pashto text normalization pipeline** applied consistently to both training labels and inference output. It handles script variant inconsistencies across sources: |
|
|
| - Arabic Kaf (ك) → Pashto Kaf (ک) |
| - ݢ / گ → Pashto Gaf (ګ) |
| - Arabic Yey / Alef Maqsura variants → Pashto Yey (ی) |
| - All non-Arabic-script noise and punctuation removed |
|
|
| This ensures the model produces clean, standardized Pashto script regardless of the source audio's original transcription. |
|
|
| --- |
|
|
| ## 🚀 Quick Start |
|
|
| ### Using the Pipeline (Recommended) |
|
|
| ```python |
| from transformers import pipeline |
| |
| asr = pipeline( |
| "automatic-speech-recognition", |
| model="uzair0/Katib-ASR", |
| torch_dtype="auto", |
| device="cuda", |
| chunk_length_s=30, |
| ) |
| |
| result = asr("pashto_audio.wav") |
| print(result["text"]) |
| # Example output: "زه غواړم چی ښار ته لاړ کړم" |
| ``` |
|
|
| ### Direct Model Loading |
|
|
| ```python |
| from transformers import WhisperProcessor, WhisperForConditionalGeneration |
| import torch |
| |
| processor = WhisperProcessor.from_pretrained("uzair0/Katib-ASR") |
| model = WhisperForConditionalGeneration.from_pretrained( |
| "uzair0/Katib-ASR", |
| torch_dtype=torch.bfloat16 |
| ).to("cuda") |
| |
| model.generation_config.language = "pashto" |
| model.generation_config.task = "transcribe" |
| model.generation_config.forced_decoder_ids = None |
| model.generation_config.suppress_tokens = [] |
| ``` |
|
|
| --- |
|
|
| ## ⚙️ Training Configuration |
|
|
| | Parameter | Value | |
| |---|---| |
| | Base model | whisper-large-v3 | |
| | Precision | bfloat16 + TF32 | |
| | Effective batch size | 128 (64 × 2 grad accumulation) | |
| | Learning rate | 1e-5 (linear schedule) | |
| | Warmup steps | 92 | |
| | Epochs | 3 | |
| | Optimizer | AdamW Fused | |
| | Gradient checkpointing | ✅ Enabled | |
|
|
| --- |
|
|
| ## 👨💻 Author & Citation |
|
|
| Built from the ground up by **Muhammad Uzair** at the University of Peshawar. |
|
|
| If you use Katib ASR in your research or applications, please consider citing it: |
|
|
| ```bibtex |
| @misc{katibasr2026, |
| title = {Katib ASR: State-of-the-Art Pashto Automatic Speech Recognition}, |
| author = {Muhammad Uzair}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/uzair0/Katib-ASR} |
| } |
| ``` |
|
|
| --- |
|
|
| *Built with ❤️ for the Pashto-speaking world.* |