Whisper Large — Welsh & English (techiaith/whisper-large-ft-cy-en)

A fine-tuned openai/whisper-large-v2 model for Welsh and English automatic speech recognition, with Welsh-to-English speech translation capability.

Supported Tasks

Task Description
Welsh transcription Welsh audio → Welsh text
English transcription English audio (UK/Irish accents) → English text
Welsh→English translation Welsh audio → English text

Evaluation Results

Welsh Transcription

Test Set WER CER
cymen-arfor/lleisiau-arfor (spontaneous) 29.79 11.67
techiaith/banc-trawsgrifiadau-bangor (mixed) 27.65 9.81
techiaith/commonvoice-23-0-cy (read) 14.97 4.26

English Transcription

Test Set WER CER
techiaith/commonvoice-23-0-en/GB-IE (read, UK/Irish) 9.92 3.47

Welsh→English Translation

Test Set BLEU chrF
techiaith/commonvoice-23-0-cy-en 18.17 38.13

Training Data

Total training data: ~177 hours across 153,066 clips.

Dataset Language Duration Clips Description
techiaith/banc-trawsgrifiadau-bangor Welsh 52:45h 48,569 Mixed spontaneous & read speech
techiaith/corpws-clllc-wlga Welsh 32:59h 26,216 Local government meetings
cymen-arfor/lleisiau-arfor Welsh 33:54h 33,614 Spontaneous conversational speech
techiaith/commonvoice_23_0_cy Welsh 31:11h 20,018 Read speech (CommonVoice 23.0)
techiaith/commonvoice_vad_cy Welsh 3:27h 8,209 VAD-segmented clips
techiaith/commonvoice_23_0_en__GB_IE English 22:26h 16,440 Read speech, UK/Irish accents (10% sample)

Validation: techiaith/banc-trawsgrifiadau-bangor validation split (4:00h, 3,895 clips).

Training Configuration

Parameter Value
Base model openai/whisper-large-v2
Learning rate 1e-5
LR scheduler cosine
Warmup steps 500
Max steps 8,000
Weight decay 0.05
Batch size 16 × 2 accumulation × 2 GPUs = 64 effective
FP16 True
SpecAugment False
Early stopping patience 5
Best checkpoint step 7,000
Best eval WER 28.0%

Usage

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="techiaith/whisper-large-ft-cy-en",
)

# Welsh transcription
result = pipe("welsh_audio.wav", generate_kwargs={"language": "cy", "task": "transcribe"})

# English transcription
result = pipe("english_audio.wav", generate_kwargs={"language": "en", "task": "transcribe"})

# Welsh to English translation
result = pipe("welsh_audio.wav", generate_kwargs={"language": "cy", "task": "translate"})

CTranslate2 Version

A CTranslate2 (int8 quantised) version is available at techiaith/whisper-large-ft-cy-en-ct2 for faster inference.

Acknowledgements

Developed by Uned Technolegau Iaith, Prifysgol Bangor / Language Technologies Unit, Bangor University.

Funded by the Welsh Government.

Downloads last month
9
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for techiaith/whisper-large-ft-cy-en

Finetuned
(267)
this model

Datasets used to train techiaith/whisper-large-ft-cy-en

Collection including techiaith/whisper-large-ft-cy-en