You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

whisper-accent-medium.en

This model is a fine-tuned version of openai/whisper-medium.en on the westbrook/English_Accent_DataSet dataset. It achieves the following results on the evaluation set:

Loss: 0.2273
Wer: 0.0939
Accent Accuracy: 0.9615

Model description

Make Whisper better at transcribing diverse English accents by conditioning the decoder on learnt accent embeddings via Adaptive Layer Normalization (AdaLN). Built on top of OpenAI Whisper using Hugging Face Transformers.

Extends Whisper with per-accent conditioning via AdaLN in every decoder layer where the weights are trained with zero-initialization while the bias is initialized to pretrained LayerNorm gamma and beta values and frozen.
Accent embeddings learnt for each accent independently and used to condition the decoder hidden states.
Accents predicted from encoder hidden states via a classifier head:
- Learnable weighted sum across all layers + input embeddings
- Projection layer
- Multi-head attention pooling over time
Encoder & decoder remain completely frozen preserving the original generalization capability
Only <10% of parameters are trainable (AdaLN modulation weights, accent embeddings, accent classifier)

Parameter Count

Total: 839,897,904
Original: 763,856,896
Added: 76,041,008 (9.05%)
- Accent Classifier: 531,760
- AdaLN + Accent Embeddings: 75,509,248

Training procedure

Training follows a two-stage optimization scheme:

Stage 1: Accent Classifier Training (mavleo96/whisper-accent-medium.en-accent-head-only)
- Initialization: The model is initialized from a pretrained English Whisper checkpoint (e.g. openai/whisper-small.en), and all encoder/decoder weights are kept fixed.
- Trainable components:
  - Accent classification stack: layer-fusion weights over encoder representations, projection layer, multi-head attention pooling, and the final accent classifier.
- Learning objective:
  - The model is optimized solely for accent classification with respect to the ground-truth accent labels (lambda_ce = 0.0, lambda_accent = 1.0).
Stage 2: Decoder AdaLN + Accent Embeddings Training
- Initialization: The checkpoint obtained from Stage 1 is used as base_model_name_or_path.
- Trainable components:
  - Decoder-side AdaLN modulation parameters
  - Accent embeddings, updated with a dedicated embedding_learning_rate
- Learning objective:
  - The model is optimized only for automatic speech recognition using cross-entropy on reference transcripts (lambda_ce = 1.0, lambda_accent = 0.0).
  - Ground-truth accent labels are used to condition the decoder while training; Predicted accent labels are used at evaluation time.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 4
total_train_batch_size: 32
total_eval_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 0.05
num_epochs: 2.0

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
No log	0	0	1.3238	0.1259
0.5056	0.1270	200	0.3784	0.1101
0.3482	0.2541	400	0.2783	0.1450
0.3338	0.3811	600	0.2603	0.1324
0.2383	0.5082	800	0.2497	0.0970
0.2660	0.6352	1000	0.2434	0.0959
0.2869	0.7623	1200	0.2398	0.0969
0.2723	0.8893	1400	0.2344	0.0944
0.2628	1.0159	1600	0.2324	0.0942
0.3397	1.1429	1800	0.2304	0.0944
0.2329	1.2700	2000	0.2273	0.0939
0.3111	1.3970	2200	0.2264	0.0943
0.1920	1.5241	2400	0.2262	0.0940
0.2996	1.6511	2600	0.2241	0.0942
0.2438	1.7781	2800	0.2239	0.0939
0.2148	1.9052	3000	0.2232	0.0939

Framework versions

Transformers 5.2.0
Pytorch 2.10.0+cu128
Datasets 4.5.0
Tokenizers 0.22.2

Downloads last month: 286

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for mavleo96/whisper-accent-medium.en

Base model

openai/whisper-medium.en

Finetuned

(93)

this model

Dataset used to train mavleo96/whisper-accent-medium.en

Collection including mavleo96/whisper-accent-medium.en

Whisper Accent

Collection

8 items • Updated about 12 hours ago