Model Card for whisper-medium-ht

Model Details

Model Description

This model is a fine-tuned whisper medium model using general conference talks from the Church of Jesus Christ of Latter-day Saints.

Developed by: Zachary Clement
Model type: Automatic Speech Recognition
Language(s): Haitian Creole
License: Apache 2.0
Finetuned from model [optional]: openai/whisper-medium

Model Sources [optional]

Repository: See the fine_tuning directory in my Github. Note that this was the phonetic only candidate.

Uses

This model should be used to convert Haitian Creole audio to text. I did not check for catastrophic forgetting and do not recommend using on other languages.

Bias, Risks, and Limitations

There is a risk of catastrophic forgetting for languages other than English

Training Details

Training Data

This model was trained using Haitian Creole transcriptions of General Conference meetings for the Church of Jesus Christ of Latter-day Saints.

Audio from general conference talks were broken into segments, ASR was used to get a garbled transcription of audio, and an LLM was used to match the garbled ASR outputs onto the transcriptions.

In total, 9,838 training samples were used, comprising 41 hours of labeled data.

Training Procedure

Preprocessing

Audio is resampled to 16 kHz mono. Each sample is transformed into an 80-bin log-mel spectrogram using Whisper's WhisperFeatureExtractor (30-second context window).
Transcriptions are tokenized with WhisperTokenizer configured for Haitian Creole (ht) transcription. Samples whose tokenized label exceeds 448 tokens are dropped before training. No audio augmentation is applied to the phonetic candidate — it trains on original phonetic-alignment segments only.

The dataset is split by talk ID: 15 specific talks are held out as a fixed evaluation set and fully excluded from all training splits (no segment from an eval talk appears in
any training candidate, including synthetic).

Training Hyperparameters

Training regime: bf16 mixed precision
Base model: openai/whisper-medium
Task/language: transcribe / Haitian Creole (ht)
Encoder: fully trainable (no freezing)
Max steps: 3,000
Per-device train batch size: 16
Gradient accumulation steps: 2 → effective batch size: 32
Per-device eval batch size: 8
Learning rate: 1e-5 with cosine decay
Warmup steps: 500
Gradient checkpointing: enabled
Eval/save strategy: every 500 steps (steps // 6), best checkpoint retained by WER
Save total limit: 3 checkpoints
Generation max length: 225 tokens
Optimizer: AdamW (HuggingFace Trainer default)

Speeds, Sizes, Times [optional]

Hardware: single NVIDIA L4 (24 GB VRAM), 16 GB RAM, 4 vCPUs
Training duration: 3,000 steps
Logging: every 25 steps (TensorBoard)

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Testing data: 15 held-out Haitian Creole conference talks, excluded from all training splits. Talk IDs: 2012_10_Nourise_a, 2021_10_Diyite_Pa_Vledi_Nou_Pafe,
2021_4_Kisa_Sove_nou_an_te_fe_pou_nou, 2021_10_Lanmou_Bondye_Sa_ki_bay_nanm_lan_plis_lajwa, 2012_10_Bon_Dezi_pou_Angaje, 2012_10_Eprev_lafwa_nou, 2015_4_Prezidan_Dieter_F_Uchtdorf, 2019_10_Bay_Espri_nou_Kontwol_Sou_Ko_nou, 2021_4_Chemen_alyans_lan, 2019_4_Jan_l_te_fe_a, 2014_4_Fotifye_nou_e_pran_kouraj,
2021_4_Bondye_Nan_Mitan_Nou, 2018_10_Vin_tounen_Sen_Denye_Jou_Egzanple, 2015_10_Elde_Quentin_L_Cook, 2012_4_Rapo_Depatman_Odit_Legliz_la_2011.

Metrics

WER (Word Error Rate) — primary metric; used for best-checkpoint selection during training
CER (Character Error Rate) — secondary metric

Results

The fine-tuned model showed substantial improvements over other whisper models on the holdout set.

Model	WER	CER
openai/whisper-large-v3	69.13%	31.30%
openai/whisper-medium	84.25%	40.89%
openai/whisper-small	97.28%	48.84%
clementzach/whisper-medium-ht	34.13%	20.30%

Downloads last month: 42

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for clementzach/whisper-medium-ht

Base model

openai/whisper-medium

Finetuned

(865)

this model