ZarosASR — Whisper Small fine-tuned on Central Kurdish (Sorani)
ZarosASR is a fine-tuned version of openai/whisper-small for Central Kurdish (Sorani / CKB) automatic speech recognition, developed as part of a thesis project on low-resource Kurdish ASR.
The model was fine-tuned using LoRA (PEFT) on Mozilla Common Voice 24.0 CKB, then the adapter was merged into the base Whisper small weights for standalone inference.
Language token note: Since Whisper has no native CKB language token, training uses a Persian token hijack strategy — the
<|fa|>decoder prompt token is repurposed to condition the model on Central Kurdish audio. This is a standard technique for extending Whisper to unsupported languages using a phonologically adjacent token.
Model Details
- Developed by: Section (thesis project)
- Model type: Automatic Speech Recognition (Seq2Seq Transformer)
- Language(s): Central Kurdish — Sorani (
ckb) - License: Apache 2.0
- Fine-tuned from: openai/whisper-small (multilingual)
Performance
| Split | WER (%) |
|---|---|
| Test | 7.96 |
Training converged over ~8,300 steps across ~11 epochs, with eval WER dropping from ~35% at the first checkpoint to 7.96% at step 8,250.
| Step | Train Loss | Eval Loss | Eval WER (%) |
|---|---|---|---|
| 750 | 0.211 | 0.177 | 35.26 |
| 1500 | 0.157 | 0.139 | 29.11 |
| 2250 | 0.141 | 0.129 | 26.97 |
| 3000 | 0.122 | 0.119 | 25.26 |
| 3750 | 0.102 | 0.104 | 23.24 |
| 4500 | 0.084 | 0.086 | 19.06 |
| 5250 | 0.068 | 0.072 | 15.54 |
| 6000 | 0.050 | 0.063 | 13.18 |
| 6750 | 0.034 | 0.053 | 10.70 |
| 7500 | 0.021 | 0.047 | 9.26 |
| 8250 | 0.016 | 0.043 | 7.96 |
How to Get Started
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="SECT19N/whisper-small-ckb-merged",
generate_kwargs={"language": "persian", "task": "transcribe"},
)
result = pipe("audio.wav")
print(result["text"])
Or using the model and processor directly:
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained(
"SECT19N/whisper-small-ckb-merged", language="Persian", task="transcribe"
)
model = WhisperForConditionalGeneration.from_pretrained("SECT19N/whisper-small-ckb-merged")
model.eval()
# Use the Persian token (<|fa|>) as the decoder prompt — CKB hijack
forced_decoder_ids = processor.get_decoder_prompt_ids(language="fa", task="transcribe")
model.generation_config.forced_decoder_ids = forced_decoder_ids
model.generation_config.language = "fa"
model.generation_config.task = "transcribe"
# audio_array: np.ndarray at 16000 Hz sample rate
inputs = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt",
return_attention_mask=True,
)
with torch.no_grad():
predicted_ids = model.generate(
inputs["input_features"],
attention_mask=inputs["attention_mask"],
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Training Details
Training Data
Common Voice Scripted Speech 24.0 - Central Kurdish
| Split | Samples |
|---|---|
| Train | 95,895 |
| Validation | 11,987 |
| Test | 11,987 |
Preprocessing & Augmentation
Audio was resampled to 16 kHz mono and processed through Whisper's log-mel feature extractor
with return_attention_mask=True. Training samples were augmented on-the-fly with:
- Volume perturbation (scale ×0.7–1.3, p=0.4)
- Gaussian noise (σ=0.001–0.006, p=0.3)
- Speed perturbation via resampling (factor ×0.95–1.05, p=0.2)
No augmentation was applied to validation or test sets.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | openai/whisper-small |
| Precision | BF16 + TF32 |
| Per-device batch size (train) | 128 |
| Per-device batch size (eval) | 128 |
| Gradient accumulation steps | 1 |
| Learning rate | 1e-3 |
| LR scheduler | Cosine |
| Warmup steps | 375 |
| Max epochs | 15 |
| Eval & save interval | 750 steps |
| Best model metric | WER (lower is better) |
| Gradient checkpointing | Enabled |
LoRA Configuration
| Parameter | Value |
|---|---|
Rank (r) |
32 |
| Alpha | 64 |
| Dropout | 0.1 |
| Target modules | q_proj, v_proj, k_proj, out_proj, fc1, fc2 |
| Bias | none |
| Task type | SEQ_2_SEQ_LM |
| Adapter size | ~52 MB (pre-merge) |
Training was run in multiple resumed sessions on Modal (A10G GPU).
Uses
Direct Use
Transcribing Central Kurdish (Sorani) speech to text. Suitable for:
- Voice-to-text applications for Sorani Kurdish speakers
- Subtitle and caption generation for CKB audio/video content
- Research and downstream NLP tasks on Kurdish text
Out-of-Scope Use
- Not suitable for Kurmanji (
kmr), Zazaki, or other Kurdish dialects without further fine-tuning. - Not evaluated on telephone-quality, heavily noisy, or far-field audio.
- Not intended for real-time streaming ASR without additional latency optimization.
Bias, Risks, and Limitations
- Training data is crowdsourced from Common Voice and may not represent all regional accents, age groups, or speaking styles within the Sorani-speaking community.
- The Persian token hijack (
<|fa|>) is a pragmatic workaround; the model has no explicit linguistic knowledge that it is processing Kurdish rather than Persian. - Performance may degrade on accents, domains, or recording conditions outside Common Voice.
- The model inherits any biases present in the Whisper small multilingual base model.
Environmental Impact
Training was performed on Modal cloud infrastructure.
- Hardware: NVIDIA A100 40GB GPU
- Carbon emissions can be estimated using the Machine Learning Impact calculator.
Citation
@misc{zaros_asr_2026,
title = {ZarosASR: Fine-tuning Whisper for Central Kurdish (Sorani) Speech Recognition},
author = {Yusf Idres},
year = {2026},
howpublished = {\url{https://huggingface.co/SECT19N/whisper-small-ckb-merged}},
note = {LoRA-adapted Whisper small on Mozilla Common Voice 24.0 CKB}
}
Model Card Contact
For questions or issues, open a discussion on the model repository. If you're building a commercial product on this Model, we'd appreciate you reaching out.
- Downloads last month
- 21
Model tree for SECT19N/whisper-small-ckb-merged
Base model
openai/whisper-smallEvaluation results
- WER on Common Voice 24.0 (ckb)test set self-reported7.960