Model Card for whisper-medium-ht
Model Details
Model Description
This model is a fine-tuned whisper medium model using general conference talks from the Church of Jesus Christ of Latter-day Saints.
- Developed by: Zachary Clement
- Model type: Automatic Speech Recognition
- Language(s): Haitian Creole
- License: Apache 2.0
- Finetuned from model [optional]: openai/whisper-medium
Model Sources [optional]
- Repository: See the fine_tuning directory in my Github. Note that this was the phonetic only candidate.
Uses
This model should be used to convert Haitian Creole audio to text. I did not check for catastrophic forgetting and do not recommend using on other languages.
Bias, Risks, and Limitations
There is a risk of catastrophic forgetting for languages other than English
Training Details
Training Data
This model was trained using Haitian Creole transcriptions of General Conference meetings for the Church of Jesus Christ of Latter-day Saints.
Audio from general conference talks were broken into segments, ASR was used to get a garbled transcription of audio, and an LLM was used to match the garbled ASR outputs onto the transcriptions.
In total, 9,838 training samples were used, comprising 41 hours of labeled data.
Training Procedure
Preprocessing
Audio is resampled to 16 kHz mono. Each sample is transformed into an 80-bin log-mel spectrogram using Whisper's WhisperFeatureExtractor (30-second context window).
Transcriptions are tokenized with WhisperTokenizer configured for Haitian Creole (ht) transcription. Samples whose tokenized label exceeds 448 tokens are dropped before
training. No audio augmentation is applied to the phonetic candidate โ it trains on original phonetic-alignment segments only.
The dataset is split by talk ID: 15 specific talks are held out as a fixed evaluation set and fully excluded from all training splits (no segment from an eval talk appears in
any training candidate, including synthetic).
Training Hyperparameters
- Training regime: bf16 mixed precision
- Base model: openai/whisper-medium
- Task/language: transcribe / Haitian Creole (ht)
- Encoder: fully trainable (no freezing)
- Max steps: 3,000
- Per-device train batch size: 16
- Gradient accumulation steps: 2 โ effective batch size: 32
- Per-device eval batch size: 8
- Learning rate: 1e-5 with cosine decay
- Warmup steps: 500
- Gradient checkpointing: enabled
- Eval/save strategy: every 500 steps (steps // 6), best checkpoint retained by WER
- Save total limit: 3 checkpoints
- Generation max length: 225 tokens
- Optimizer: AdamW (HuggingFace Trainer default)
Speeds, Sizes, Times [optional]
- Hardware: single NVIDIA L4 (24 GB VRAM), 16 GB RAM, 4 vCPUs
- Training duration: 3,000 steps
- Logging: every 25 steps (TensorBoard)
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
Testing data: 15 held-out Haitian Creole conference talks, excluded from all training splits. Talk IDs: 2012_10_Nourise_a, 2021_10_Diyite_Pa_Vledi_Nou_Pafe,
2021_4_Kisa_Sove_nou_an_te_fe_pou_nou, 2021_10_Lanmou_Bondye_Sa_ki_bay_nanm_lan_plis_lajwa, 2012_10_Bon_Dezi_pou_Angaje, 2012_10_Eprev_lafwa_nou,
2015_4_Prezidan_Dieter_F_Uchtdorf, 2019_10_Bay_Espri_nou_Kontwol_Sou_Ko_nou, 2021_4_Chemen_alyans_lan, 2019_4_Jan_l_te_fe_a, 2014_4_Fotifye_nou_e_pran_kouraj,
2021_4_Bondye_Nan_Mitan_Nou, 2018_10_Vin_tounen_Sen_Denye_Jou_Egzanple, 2015_10_Elde_Quentin_L_Cook, 2012_4_Rapo_Depatman_Odit_Legliz_la_2011.
Metrics
- WER (Word Error Rate) โ primary metric; used for best-checkpoint selection during training
- CER (Character Error Rate) โ secondary metric
Results
The fine-tuned model showed substantial improvements over other whisper models on the holdout set.
| Model | WER | CER |
|---|---|---|
| openai/whisper-large-v3 | 69.13% | 31.30% |
| openai/whisper-medium | 84.25% | 40.89% |
| openai/whisper-small | 97.28% | 48.84% |
| clementzach/whisper-medium-ht | 34.13% | 20.30% |
- Downloads last month
- 42
Model tree for clementzach/whisper-medium-ht
Base model
openai/whisper-medium