You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ZarosASR — Whisper Small fine-tuned on Central Kurdish (Sorani)

ZarosASR is a fine-tuned version of openai/whisper-small for Central Kurdish (Sorani / CKB) automatic speech recognition, developed as part of a thesis project on low-resource Kurdish ASR.

The model was fine-tuned using LoRA (PEFT) on Mozilla Common Voice 24.0 CKB, then the adapter was merged into the base Whisper small weights for standalone inference.

Language token note: Since Whisper has no native CKB language token, training uses a Persian token hijack strategy — the <|fa|> decoder prompt token is repurposed to condition the model on Central Kurdish audio. This is a standard technique for extending Whisper to unsupported languages using a phonologically adjacent token.

Model Details

Developed by: Section (thesis project)
Model type: Automatic Speech Recognition (Seq2Seq Transformer)
Language(s): Central Kurdish — Sorani (ckb)
License: Apache 2.0
Fine-tuned from: openai/whisper-small (multilingual)

Performance

Split	WER (%)
Test	7.96

Training converged over ~8,300 steps across ~11 epochs, with eval WER dropping from ~35% at the first checkpoint to 7.96% at step 8,250.

Step	Train Loss	Eval Loss	Eval WER (%)
750	0.211	0.177	35.26
1500	0.157	0.139	29.11
2250	0.141	0.129	26.97
3000	0.122	0.119	25.26
3750	0.102	0.104	23.24
4500	0.084	0.086	19.06
5250	0.068	0.072	15.54
6000	0.050	0.063	13.18
6750	0.034	0.053	10.70
7500	0.021	0.047	9.26
8250	0.016	0.043	7.96

How to Get Started

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="SECT19N/whisper-small-ckb-merged",
    generate_kwargs={"language": "persian", "task": "transcribe"},
)

result = pipe("audio.wav")
print(result["text"])

Or using the model and processor directly:

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained(
    "SECT19N/whisper-small-ckb-merged", language="Persian", task="transcribe"
)
model = WhisperForConditionalGeneration.from_pretrained("SECT19N/whisper-small-ckb-merged")
model.eval()

# Use the Persian token (<|fa|>) as the decoder prompt — CKB hijack
forced_decoder_ids = processor.get_decoder_prompt_ids(language="fa", task="transcribe")
model.generation_config.forced_decoder_ids = forced_decoder_ids
model.generation_config.language = "fa"
model.generation_config.task = "transcribe"

# audio_array: np.ndarray at 16000 Hz sample rate
inputs = processor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt",
    return_attention_mask=True,
)

with torch.no_grad():
    predicted_ids = model.generate(
        inputs["input_features"],
        attention_mask=inputs["attention_mask"],
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Details

Training Data

Common Voice Scripted Speech 24.0 - Central Kurdish

Split	Samples
Train	95,895
Validation	11,987
Test	11,987

Preprocessing & Augmentation

Audio was resampled to 16 kHz mono and processed through Whisper's log-mel feature extractor with return_attention_mask=True. Training samples were augmented on-the-fly with:

Volume perturbation (scale ×0.7–1.3, p=0.4)
Gaussian noise (σ=0.001–0.006, p=0.3)
Speed perturbation via resampling (factor ×0.95–1.05, p=0.2)

No augmentation was applied to validation or test sets.

Training Hyperparameters

Parameter	Value
Base model	openai/whisper-small
Precision	BF16 + TF32
Per-device batch size (train)	128
Per-device batch size (eval)	128
Gradient accumulation steps	1
Learning rate	1e-3
LR scheduler	Cosine
Warmup steps	375
Max epochs	15
Eval & save interval	750 steps
Best model metric	WER (lower is better)
Gradient checkpointing	Enabled

LoRA Configuration

Parameter	Value
Rank (`r`)	32
Alpha	64
Dropout	0.1
Target modules	`q_proj`, `v_proj`, `k_proj`, `out_proj`, `fc1`, `fc2`
Bias	none
Task type	SEQ_2_SEQ_LM
Adapter size	~52 MB (pre-merge)

Training was run in multiple resumed sessions on Modal (A10G GPU).

Uses

Direct Use

Transcribing Central Kurdish (Sorani) speech to text. Suitable for:

Voice-to-text applications for Sorani Kurdish speakers
Subtitle and caption generation for CKB audio/video content
Research and downstream NLP tasks on Kurdish text

Out-of-Scope Use

Not suitable for Kurmanji (kmr), Zazaki, or other Kurdish dialects without further fine-tuning.
Not evaluated on telephone-quality, heavily noisy, or far-field audio.
Not intended for real-time streaming ASR without additional latency optimization.

Bias, Risks, and Limitations

Training data is crowdsourced from Common Voice and may not represent all regional accents, age groups, or speaking styles within the Sorani-speaking community.
The Persian token hijack (<|fa|>) is a pragmatic workaround; the model has no explicit linguistic knowledge that it is processing Kurdish rather than Persian.
Performance may degrade on accents, domains, or recording conditions outside Common Voice.
The model inherits any biases present in the Whisper small multilingual base model.

Environmental Impact

Training was performed on Modal cloud infrastructure.

Hardware: NVIDIA A100 40GB GPU
Carbon emissions can be estimated using the Machine Learning Impact calculator.

Citation

@misc{zaros_asr_2026,
  title        = {ZarosASR: Fine-tuning Whisper for Central Kurdish (Sorani) Speech Recognition},
  author       = {Yusf Idres},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/SECT19N/whisper-small-ckb-merged}},
  note         = {LoRA-adapted Whisper small on Mozilla Common Voice 24.0 CKB}
}

Model Card Contact

For questions or issues, open a discussion on the model repository. If you're building a commercial product on this Model, we'd appreciate you reaching out.

Downloads last month: 21

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for SECT19N/whisper-small-ckb-merged

Base model

openai/whisper-small

Adapter

(238)

this model

Evaluation results

WER on Common Voice 24.0 (ckb)
test set self-reported

7.960