whisper-accent-small.en
This model is a fine-tuned version of openai/whisper-small.en on the westbrook/English_Accent_DataSet dataset. It achieves the following results on the evaluation set:
- Loss: 0.2671
- Wer: 0.1030
- Accent Accuracy: 0.8668
Model description
Make Whisper better at transcribing diverse English accents by conditioning the decoder on learnt accent embeddings via Adaptive Layer Normalization (AdaLN). Built on top of OpenAI Whisper using Hugging Face Transformers.
- Extends Whisper with per-accent conditioning via AdaLN in every decoder layer where the weights are trained with zero-initialization while the bias is initialized to pretrained LayerNorm gamma and beta values and frozen.
- Accent embeddings learnt for each accent independently and used to condition the decoder hidden states.
- Accents predicted from encoder hidden states via a classifier head:
- Learnable weighted sum across all layers + input embeddings
- Projection layer
- Multi-head attention pooling over time
- Encoder & decoder remain completely frozen preserving the original generalization capability
- Only <10% of parameters are trainable (AdaLN modulation weights, accent embeddings, accent classifier)
Parameter Count
- Total: 263,442,852
- Original: 241,734,144
- Added: 21,707,808 (8.24%)
- Accent Classifier: 466,212
- AdaLN + Accent Embeddings: 21,242,496
Training Procedure
Training follows a two-stage optimization scheme:
Stage 1: Accent Classifier Training (mavleo96/whisper-accent-small.en-accent-head-only)
- Initialization: The model is initialized from a pretrained English Whisper checkpoint (e.g.
openai/whisper-small.en), and all encoder/decoder weights are kept fixed. - Trainable components:
- Accent classification stack: layer-fusion weights over encoder representations, projection layer, multi-head attention pooling, and the final accent classifier.
- Learning objective:
- The model is optimized solely for accent classification with respect to the ground-truth accent labels (
lambda_ce = 0.0,lambda_accent = 1.0).
- The model is optimized solely for accent classification with respect to the ground-truth accent labels (
- Initialization: The model is initialized from a pretrained English Whisper checkpoint (e.g.
Stage 2: Decoder AdaLN + Accent Embeddings Training
- Initialization: The checkpoint obtained from Stage 1 is used as
base_model_name_or_path. - Trainable components:
- Decoder-side AdaLN modulation parameters
- Accent embeddings, updated with a dedicated
embedding_learning_rate
- Learning objective:
- The model is optimized only for automatic speech recognition using cross-entropy on reference transcripts (
lambda_ce = 1.0,lambda_accent = 0.0). - Ground-truth accent labels are used to condition the decoder while training; Predicted accent labels are used at evaluation time.
- The model is optimized only for automatic speech recognition using cross-entropy on reference transcripts (
- Initialization: The checkpoint obtained from Stage 1 is used as
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 32
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.05
- num_epochs: 2.0
Training results
| Training Loss | Epoch | Step | Validation Loss | Wer |
|---|---|---|---|---|
| No log | 0 | 0 | 1.3076 | 0.1243 |
| 0.6113 | 0.1271 | 200 | 0.4731 | 0.1175 |
| 0.3969 | 0.2541 | 400 | 0.3300 | 0.1107 |
| 0.3838 | 0.3812 | 600 | 0.3000 | 0.1088 |
| 0.2818 | 0.5082 | 800 | 0.2886 | 0.1428 |
| 0.2988 | 0.6353 | 1000 | 0.2811 | 0.1044 |
| 0.3329 | 0.7623 | 1200 | 0.2764 | 0.1038 |
| 0.3173 | 0.8894 | 1400 | 0.2731 | 0.1395 |
| 0.3062 | 1.0159 | 1600 | 0.2710 | 0.1034 |
| 0.3839 | 1.1429 | 1800 | 0.2694 | 0.1030 |
| 0.2717 | 1.2700 | 2000 | 0.2671 | 0.1030 |
| 0.3633 | 1.3970 | 2200 | 0.2654 | 0.1038 |
| 0.2349 | 1.5241 | 2400 | 0.2644 | 0.1034 |
| 0.3616 | 1.6512 | 2600 | 0.2634 | 0.1025 |
| 0.2775 | 1.7782 | 2800 | 0.2632 | 0.1031 |
| 0.2561 | 1.9053 | 3000 | 0.2625 | 0.1028 |
Framework versions
- Transformers 5.2.0
- Pytorch 2.10.0+cu128
- Datasets 4.5.0
- Tokenizers 0.22.2
- Downloads last month
- 103
Model tree for mavleo96/whisper-accent-small.en
Base model
openai/whisper-small.enDataset used to train mavleo96/whisper-accent-small.en
Collection including mavleo96/whisper-accent-small.en
Collection
14 items • Updated