|
|
--- |
|
|
base_model: openai/whisper-small |
|
|
library_name: peft |
|
|
tags: |
|
|
- whisper |
|
|
- lora |
|
|
- transformers |
|
|
- speech-to-text |
|
|
- persian |
|
|
- stt |
|
|
- fine-tune |
|
|
- adapter |
|
|
datasets: |
|
|
- vhdm/persian-voice-v1 |
|
|
language: |
|
|
- fa |
|
|
metrics: |
|
|
- cer |
|
|
- wer |
|
|
--- |
|
|
|
|
|
# Whisper-Small Persian STT — LoRA Fine-Tuned |
|
|
|
|
|
A fine-tuned version of **openai/whisper-small** for **Persian speech-to-text (ASR)** using **LoRA**. |
|
|
This model is optimized for persian conversational speech and dataset-quality audio. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
This model is a **LoRA fine-tuned Whisper Small** focused on **Persian (fa)** speech recognition. |
|
|
It improves transcription accuracy on standard Persian audio segments (16kHz, mono, normalized WAV). |
|
|
|
|
|
- **Developed by:** *Mehdi Pouladrag* |
|
|
- **Model type:** Speech-to-Text (ASR) — Whisper Small (Seq2Seq Transformer) |
|
|
- **Language(s):** Persian (fa) |
|
|
- **License:** MIT (or your preferred license) |
|
|
- **Finetuned from:** `openai/whisper-small` |
|
|
- **Dataset:** `persian-voice-v1` (single dataset) |
|
|
- **Training technique:** LoRA (Low-Rank Adaptation) |
|
|
|
|
|
### Model Sources |
|
|
- **Repository:** https://github.com/Mehdipoladrag/Fine-tuning-Whisper-Model |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
- Convert Persian speech to text |
|
|
- Subtitle generation for Persian audio |
|
|
- Conversational ASR |
|
|
- Podcast / video transcription |
|
|
- General Persian content recognition |
|
|
|
|
|
### Downstream Use |
|
|
- Integrate into ASR pipelines |
|
|
- Use in real-time Persian voice applications |
|
|
- Further fine-tuning on custom Persian domains (medical, legal, etc.) |
|
|
|
|
|
### Out-of-Scope Use |
|
|
- Non-Persian audio |
|
|
- Low-quality/noisy multi-speaker overlapping speech |
|
|
- Misuse for surveillance or unethical monitoring |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
- Whisper may still struggle with dialect-heavy, noisy, or low-quality audio. |
|
|
- The dataset used is relatively limited (~6099 audio–subtitle pairs), so: |
|
|
- Certain accents may be underrepresented. |
|
|
- Model may hallucinate or mis-transcribe in rare cases. |
|
|
|
|
|
### Recommendations |
|
|
Users should: |
|
|
- Provide clean 16kHz mono WAV audio |
|
|
- Use domain-specific fine-tuning if necessary |
|
|
- Validate outputs before critical use |
|
|
|