You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

whisper-accent-small.en

This model is a fine-tuned version of openai/whisper-small.en on the westbrook/English_Accent_DataSet dataset. It achieves the following results on the evaluation set:

Loss: 0.2671
Wer: 0.1030
Accent Accuracy: 0.8668

Model description

Make Whisper better at transcribing diverse English accents by conditioning the decoder on learnt accent embeddings via Adaptive Layer Normalization (AdaLN). Built on top of OpenAI Whisper using Hugging Face Transformers.

Extends Whisper with per-accent conditioning via AdaLN in every decoder layer where the weights are trained with zero-initialization while the bias is initialized to pretrained LayerNorm gamma and beta values and frozen.
Accent embeddings learnt for each accent independently and used to condition the decoder hidden states.
Accents predicted from encoder hidden states via a classifier head:
- Learnable weighted sum across all layers + input embeddings
- Projection layer
- Multi-head attention pooling over time
Encoder & decoder remain completely frozen preserving the original generalization capability
Only <10% of parameters are trainable (AdaLN modulation weights, accent embeddings, accent classifier)

Parameter Count

Total: 263,442,852
Original: 241,734,144
Added: 21,707,808 (8.24%)
- Accent Classifier: 466,212
- AdaLN + Accent Embeddings: 21,242,496

Training Procedure

Training follows a two-stage optimization scheme:

Stage 1: Accent Classifier Training (mavleo96/whisper-accent-small.en-accent-head-only)
- Initialization: The model is initialized from a pretrained English Whisper checkpoint (e.g. openai/whisper-small.en), and all encoder/decoder weights are kept fixed.
- Trainable components:
  - Accent classification stack: layer-fusion weights over encoder representations, projection layer, multi-head attention pooling, and the final accent classifier.
- Learning objective:
  - The model is optimized solely for accent classification with respect to the ground-truth accent labels (lambda_ce = 0.0, lambda_accent = 1.0).
Stage 2: Decoder AdaLN + Accent Embeddings Training
- Initialization: The checkpoint obtained from Stage 1 is used as base_model_name_or_path.
- Trainable components:
  - Decoder-side AdaLN modulation parameters
  - Accent embeddings, updated with a dedicated embedding_learning_rate
- Learning objective:
  - The model is optimized only for automatic speech recognition using cross-entropy on reference transcripts (lambda_ce = 1.0, lambda_accent = 0.0).
  - Ground-truth accent labels are used to condition the decoder while training; Predicted accent labels are used at evaluation time.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 0.05
num_epochs: 2.0

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
No log	0	0	1.3076	0.1243
0.6113	0.1271	200	0.4731	0.1175
0.3969	0.2541	400	0.3300	0.1107
0.3838	0.3812	600	0.3000	0.1088
0.2818	0.5082	800	0.2886	0.1428
0.2988	0.6353	1000	0.2811	0.1044
0.3329	0.7623	1200	0.2764	0.1038
0.3173	0.8894	1400	0.2731	0.1395
0.3062	1.0159	1600	0.2710	0.1034
0.3839	1.1429	1800	0.2694	0.1030
0.2717	1.2700	2000	0.2671	0.1030
0.3633	1.3970	2200	0.2654	0.1038
0.2349	1.5241	2400	0.2644	0.1034
0.3616	1.6512	2600	0.2634	0.1025
0.2775	1.7782	2800	0.2632	0.1031
0.2561	1.9053	3000	0.2625	0.1028

Framework versions

Transformers 5.2.0
Pytorch 2.10.0+cu128
Datasets 4.5.0
Tokenizers 0.22.2

Downloads last month: 103

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for mavleo96/whisper-accent-small.en

Base model

openai/whisper-small.en

Finetuned

(69)

this model

Dataset used to train mavleo96/whisper-accent-small.en

Collection including mavleo96/whisper-accent-small.en

Whisper Accent

Collection

14 items • Updated 6 days ago