Whisper Medium Fine-Tuned on Custom English Dataset

This model is a fine-tuned version of OpenAI's whisper-medium, optimized for transcribing English speech from a custom dataset.

πŸ› οΈ Model Details

  • Base Model: openai/whisper-medium
  • Fine-tuned by: Winardi (Research by Ms. Tong Rong)
  • Language: English (monolingual)
  • Framework: PyTorch, Hugging Face Transformers

πŸ“š Training Data

The model was fine-tuned on a proprietary/custom audio dataset using metadata(clean1).csv. Corrupted or low-quality audio files were excluded. The data was split as follows:

  • Training: 80%
  • Validation: 10%
  • Testing: 10% (used only for evaluation, not during training)

🎯 Intended Use

This model is intended for automatic speech recognition (ASR) in English, especially for environments similar to the training dataset (e.g., single-speaker, clean audio).

πŸ“‰ Performance

  • Metric: Word Error Rate (WER)
  • WER: 2.07%
  • WER with Limited Vocalubary: 3.23%

🚫 Limitations

  • Not robust to heavy background noise or overlapping speech
  • May not perform well on dialects or accents not represented in training data
  • Only supports English input

πŸ’¬ How to Use

from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="Pengwin30/whisper-medium-fine-tuned")
result = asr("path/to/audio.wav")
print(result["text"])

πŸ“œ License

This model is licensed under the MIT License.

πŸ™ Citation

If you use this model in your work, please cite:

@misc{Pengwin30/whisper-medium-fine-tuned,
  author = {Tong Rong, Winardi},
  title = {Whisper Medium Fine-Tuned on Custom Dataset},
  year = {2025},
  url = {https://huggingface.co/Pengwin30/whisper-medium-fine-tuned}
}
Downloads last month
-
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Pengwin30/whisper-medium-fine-tuned

Finetuned
(772)
this model

Evaluation results

  • Word Error Rate on Custom Audio Dataset
    self-reported
    2.07%
  • Word Error Rate With Limited Vocabulary on Custom Audio Dataset
    self-reported
    3.23%