metadata
language:
- cy
- en
license: apache-2.0
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- whisper
- welsh
- cymraeg
- speech-recognition
- translation
base_model: openai/whisper-large-v2
datasets:
- techiaith/banc-trawsgrifiadau-bangor
- techiaith/corpws-clllc-wlga
- cymen-arfor/lleisiau-arfor
- techiaith/commonvoice_23_0_cy
- techiaith/commonvoice_vad_cy
- techiaith/commonvoice_23_0_en__GB_IE
- techiaith/commonvoice-23-0-cy-en
metrics:
- wer
- cer
Whisper Large — Welsh & English (techiaith/whisper-large-ft-cy-en)
A fine-tuned openai/whisper-large-v2 model for Welsh and English automatic speech recognition, with Welsh-to-English speech translation capability.
Supported Tasks
| Task | Description |
|---|---|
| Welsh transcription | Welsh audio → Welsh text |
| English transcription | English audio (UK/Irish accents) → English text |
| Welsh→English translation | Welsh audio → English text |
Evaluation Results
Welsh Transcription
| Test Set | WER | CER |
|---|---|---|
| cymen-arfor/lleisiau-arfor (spontaneous) | 29.79 | 11.67 |
| techiaith/banc-trawsgrifiadau-bangor (mixed) | 27.65 | 9.81 |
| techiaith/commonvoice-23-0-cy (read) | 14.97 | 4.26 |
English Transcription
| Test Set | WER | CER |
|---|---|---|
| techiaith/commonvoice-23-0-en/GB-IE (read, UK/Irish) | 9.92 | 3.47 |
Welsh→English Translation
| Test Set | BLEU | chrF |
|---|---|---|
| techiaith/commonvoice-23-0-cy-en | 18.17 | 38.13 |
Training Data
Total training data: ~177 hours across 153,066 clips.
| Dataset | Language | Duration | Clips | Description |
|---|---|---|---|---|
| techiaith/banc-trawsgrifiadau-bangor | Welsh | 52:45h | 48,569 | Mixed spontaneous & read speech |
| techiaith/corpws-clllc-wlga | Welsh | 32:59h | 26,216 | Local government meetings |
| cymen-arfor/lleisiau-arfor | Welsh | 33:54h | 33,614 | Spontaneous conversational speech |
| techiaith/commonvoice_23_0_cy | Welsh | 31:11h | 20,018 | Read speech (CommonVoice 23.0) |
| techiaith/commonvoice_vad_cy | Welsh | 3:27h | 8,209 | VAD-segmented clips |
| techiaith/commonvoice_23_0_en__GB_IE | English | 22:26h | 16,440 | Read speech, UK/Irish accents (10% sample) |
Validation: techiaith/banc-trawsgrifiadau-bangor validation split (4:00h, 3,895 clips).
Training Configuration
| Parameter | Value |
|---|---|
| Base model | openai/whisper-large-v2 |
| Learning rate | 1e-5 |
| LR scheduler | cosine |
| Warmup steps | 500 |
| Max steps | 8,000 |
| Weight decay | 0.05 |
| Batch size | 16 × 2 accumulation × 2 GPUs = 64 effective |
| FP16 | True |
| SpecAugment | False |
| Early stopping patience | 5 |
| Best checkpoint | step 7,000 |
| Best eval WER | 28.0% |
Usage
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="techiaith/whisper-large-ft-cy-en",
)
# Welsh transcription
result = pipe("welsh_audio.wav", generate_kwargs={"language": "cy", "task": "transcribe"})
# English transcription
result = pipe("english_audio.wav", generate_kwargs={"language": "en", "task": "transcribe"})
# Welsh to English translation
result = pipe("welsh_audio.wav", generate_kwargs={"language": "cy", "task": "translate"})
CTranslate2 Version
A CTranslate2 (int8 quantised) version is available at techiaith/whisper-large-ft-cy-en-ct2 for faster inference.
Acknowledgements
Developed by Uned Technolegau Iaith, Prifysgol Bangor / Language Technologies Unit, Bangor University.
Funded by the Welsh Government.