|
|
--- |
|
|
language: tr |
|
|
license: mit |
|
|
tags: |
|
|
- audio |
|
|
- speech-recognition |
|
|
- whisper |
|
|
- turkish |
|
|
- asr |
|
|
datasets: |
|
|
- Codyfederer/tr-full-dataset |
|
|
model-index: |
|
|
- name: whisper-small-tr |
|
|
results: |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 7.75 |
|
|
name: Word Error Rate |
|
|
- type: cer |
|
|
value: 1.95 |
|
|
name: Character Error Rate |
|
|
--- |
|
|
|
|
|
# whisper-small-tr - Fine-tuned Whisper Small for Turkish ASR |
|
|
|
|
|
This model is a fine-tuned version of `openai/whisper-small` optimized for Turkish Automatic Speech Recognition (ASR). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Whisper is a pre-trained model for automatic speech recognition and speech translation. This version has been fine-tuned on Turkish audio data to improve performance on Turkish speech recognition tasks. |
|
|
|
|
|
- **Base Model:** openai/whisper-small |
|
|
- **Language:** Turkish (tr) |
|
|
- **Task:** Automatic Speech Recognition |
|
|
- **Dataset:** Codyfederer/tr-full-dataset |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model uses the `Codyfederer/tr-full-dataset`, consisting of 3,000 Turkish audio-transcription samples, split into 90% training and 10% testing. |
|
|
|
|
|
## Training Parameters |
|
|
|
|
|
Training utilized the Hugging Face `Trainer` with the following `Seq2SeqTrainingArguments`: |
|
|
|
|
|
- `output_dir`: `./whisper-small-tr` |
|
|
- `per_device_train_batch_size`: 16 |
|
|
- `gradient_accumulation_steps`: 1 |
|
|
- `learning_rate`: 3e-5 |
|
|
- `warmup_steps`: 50 |
|
|
- `num_train_epochs`: 3 |
|
|
- `weight_decay`: 0.005 |
|
|
- `gradient_checkpointing`: True |
|
|
- `fp16`: True |
|
|
- `eval_strategy`: "steps" |
|
|
- `per_device_eval_batch_size`: 8 |
|
|
- `predict_with_generate`: True |
|
|
- `generation_max_length`: 225 |
|
|
- `save_steps`: 200 |
|
|
- `eval_steps`: 200 |
|
|
- `logging_steps`: 25 |
|
|
- `report_to`: ["tensorboard"] |
|
|
- `load_best_model_at_end`: True |
|
|
- `metric_for_best_model`: "wer" |
|
|
- `greater_is_better`: False |
|
|
- `push_to_hub`: True |
|
|
- `hub_model_id`: whisper-small-tr |
|
|
- `optim`: adamw_torch |
|
|
- `dataloader_num_workers`: 4 |
|
|
- `dataloader_pin_memory`: True |
|
|
- `save_total_limit`: 2 |
|
|
|
|
|
## Performance |
|
|
|
|
|
Test set evaluation results: |
|
|
|
|
|
- **Word Error Rate (WER):** 7.75% |
|
|
- **Character Error Rate (CER):** 1.95% |
|
|
- **Loss:** 0.1321 |
|
|
|
|
|
The fine-tuned model shows significant improvement in Turkish ASR performance compared to the base model. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from transformers import pipeline |
|
|
import torch |
|
|
|
|
|
pipe = pipeline( |
|
|
task="automatic-speech-recognition", |
|
|
model="emredeveloper/whisper-small-tr", |
|
|
chunk_length_s=30, |
|
|
device="cuda" if torch.cuda.is_available() else "cpu", |
|
|
) |
|
|
|
|
|
audio_file = "path/to/your/audio.mp3" |
|
|
result = pipe(audio_file) |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
### Gradio Demo |
|
|
```python |
|
|
import gradio as gr |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="emredeveloper/whisper-small-tr" |
|
|
) |
|
|
|
|
|
def transcribe(audio): |
|
|
if audio is None: |
|
|
return "" |
|
|
return pipe(audio)["text"] |
|
|
|
|
|
demo = gr.Interface( |
|
|
fn=transcribe, |
|
|
inputs=gr.Audio(sources=["microphone", "upload"], type="filepath"), |
|
|
outputs="text", |
|
|
title="Turkish Speech Recognition", |
|
|
description="Upload or record Turkish audio to transcribe." |
|
|
) |
|
|
|
|
|
demo.launch(share=True) |
|
|
``` |
|
|
|
|
|
### Advanced Usage |
|
|
```python |
|
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
|
import torch |
|
|
import librosa |
|
|
|
|
|
processor = WhisperProcessor.from_pretrained("emredeveloper/whisper-small-tr") |
|
|
model = WhisperForConditionalGeneration.from_pretrained("emredeveloper/whisper-small-tr") |
|
|
|
|
|
audio, sr = librosa.load("audio.mp3", sr=16000) |
|
|
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features |
|
|
|
|
|
predicted_ids = model.generate(input_features) |
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
|
|
|
|
print(transcription[0]) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on 3,000 samples, which may limit generalization |
|
|
- Performance may vary on noisy audio or non-standard dialects |
|
|
- Best results with clear audio at 16kHz sampling rate |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@misc{whisper-small-tr, |
|
|
author = {emredeveloper}, |
|
|
title = {whisper-small-tr: Fine-tuned Whisper Small for Turkish ASR}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/emredeveloper/whisper-small-tr}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Base model: [openai/whisper-small](https://huggingface.co/openai/whisper-small) |
|
|
- Dataset: [Codyfederer/tr-full-dataset](https://huggingface.co/datasets/Codyfederer/tr-full-dataset) |
|
|
- Built with [Hugging Face Transformers](https://github.com/huggingface/transformers) |