You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

whisper-accent-medium.en

This model is a fine-tuned version of openai/whisper-medium.en on the westbrook/English_Accent_DataSet dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2273
  • Wer: 0.0939
  • Accent Accuracy: 0.9615

Model description

Make Whisper better at transcribing diverse English accents by conditioning the decoder on learnt accent embeddings via Adaptive Layer Normalization (AdaLN). Built on top of OpenAI Whisper using Hugging Face Transformers.

  • Extends Whisper with per-accent conditioning via AdaLN in every decoder layer where the weights are trained with zero-initialization while the bias is initialized to pretrained LayerNorm gamma and beta values and frozen.
  • Accent embeddings learnt for each accent independently and used to condition the decoder hidden states.
  • Accents predicted from encoder hidden states via a classifier head:
    • Learnable weighted sum across all layers + input embeddings
    • Projection layer
    • Multi-head attention pooling over time
  • Encoder & decoder remain completely frozen preserving the original generalization capability
  • Only <10% of parameters are trainable (AdaLN modulation weights, accent embeddings, accent classifier)

Parameter Count

  • Total: 839,897,904
  • Original: 763,856,896
  • Added: 76,041,008 (9.05%)
    • Accent Classifier: 531,760
    • AdaLN + Accent Embeddings: 75,509,248

Training procedure

Training follows a two-stage optimization scheme:

  1. Stage 1: Accent Classifier Training (mavleo96/whisper-accent-medium.en-accent-head-only)

    • Initialization: The model is initialized from a pretrained English Whisper checkpoint (e.g. openai/whisper-small.en), and all encoder/decoder weights are kept fixed.
    • Trainable components:
      • Accent classification stack: layer-fusion weights over encoder representations, projection layer, multi-head attention pooling, and the final accent classifier.
    • Learning objective:
      • The model is optimized solely for accent classification with respect to the ground-truth accent labels (lambda_ce = 0.0, lambda_accent = 1.0).
  2. Stage 2: Decoder AdaLN + Accent Embeddings Training

    • Initialization: The checkpoint obtained from Stage 1 is used as base_model_name_or_path.
    • Trainable components:
      • Decoder-side AdaLN modulation parameters
      • Accent embeddings, updated with a dedicated embedding_learning_rate
    • Learning objective:
      • The model is optimized only for automatic speech recognition using cross-entropy on reference transcripts (lambda_ce = 1.0, lambda_accent = 0.0).
      • Ground-truth accent labels are used to condition the decoder while training; Predicted accent labels are used at evaluation time.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • total_eval_batch_size: 8
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 0.05
  • num_epochs: 2.0

Training results

Training Loss Epoch Step Validation Loss Wer
No log 0 0 1.3238 0.1259
0.5056 0.1270 200 0.3784 0.1101
0.3482 0.2541 400 0.2783 0.1450
0.3338 0.3811 600 0.2603 0.1324
0.2383 0.5082 800 0.2497 0.0970
0.2660 0.6352 1000 0.2434 0.0959
0.2869 0.7623 1200 0.2398 0.0969
0.2723 0.8893 1400 0.2344 0.0944
0.2628 1.0159 1600 0.2324 0.0942
0.3397 1.1429 1800 0.2304 0.0944
0.2329 1.2700 2000 0.2273 0.0939
0.3111 1.3970 2200 0.2264 0.0943
0.1920 1.5241 2400 0.2262 0.0940
0.2996 1.6511 2600 0.2241 0.0942
0.2438 1.7781 2800 0.2239 0.0939
0.2148 1.9052 3000 0.2232 0.0939

Framework versions

  • Transformers 5.2.0
  • Pytorch 2.10.0+cu128
  • Datasets 4.5.0
  • Tokenizers 0.22.2
Downloads last month
286
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mavleo96/whisper-accent-medium.en

Finetuned
(93)
this model

Dataset used to train mavleo96/whisper-accent-medium.en

Collection including mavleo96/whisper-accent-medium.en