whisper-small-pt-br / README.md
RodrigoFardin's picture
Update README.md
ee8e835 verified
metadata
library_name: transformers
language:
  - pt
license: apache-2.0
base_model: openai/whisper-small
tags:
  - generated_from_trainer
datasets:
  - mozilla-foundation/common_voice_17_0
metrics:
  - wer
model-index:
  - name: Whisper Small Pt-Br - RFard
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 17.0
          type: mozilla-foundation/common_voice_17_0
          config: pt
          split: None
          args: 'config: pt, split: test'
        metrics:
          - name: Wer
            type: wer
            value: 17.37912830747441

Whisper Small Pt-Br - RFard

This model is a fine-tuned version of openai/whisper-small on the Common Voice 17.0 dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2593
  • Wer: 17.3791

Model description

Whisper Small Pt-Br - RFard is an automatic speech recognition (ASR) model based on Whisper Small by OpenAI. It has been fine-tuned on the Common Voice 17.0 dataset from Mozilla for the Portuguese language (pt-BR), making it more efficient at transcribing speech to text in this language.

The model uses a transformer-based encoder-decoder architecture optimized for ASR tasks, leveraging Whisper's structure to improve transcription accuracy across various audio sources, including different accents and regional variations of Brazilian Portuguese.

With a Word Error Rate (WER) of 17.38%, the model performs well in transcription tasks but may struggle with noisy audio or overlapping speech.

Intended uses & limitations

This model is designed for automatic speech recognition (ASR) in Brazilian Portuguese, making it suitable for tasks such as speech-to-text transcription, voice assistants, automatic subtitles, and other applications that require converting spoken language into written text.

Limitations

The main limitation of this model is that it was not trained for more epochs due to hardware constraints. Extending the training process could further improve its accuracy and robustness, especially in challenging audio conditions such as noisy environments or overlapping speech.

If you are interested in collaborating on further development and improving the model’s performance, feel free to reach out—I am open to cooperation!

Training and evaluation data

For the experiments, the Common Voice 17.0 dataset was used, which was loaded and segmented into two subsets: the training set, composed of 31,432 samples, and the test set, with 9,467 samples.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 2
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer
0.1831 1.0173 1000 0.2593 17.3791

Framework versions

  • Transformers 4.50.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.4.1
  • Tokenizers 0.21.1

Contact

For any inquiries, collaborations, or contributions to the model’s development, feel free to reach out:

📧 Email: rodrigo.correa.fardin@gmail.com

I am open to discussions and potential improvements to the model! 🚀