wav2vec2-librispeech-en-finetuned
Fine-tuned facebook/wav2vec2-large-xlsr-53 for English ASR using LibriSpeech (openslr/librispeech_asr, clean, train.360).
- Task: Automatic Speech Recognition (CTC)
- Language: English
- License: CC BY 4.0
- Base model:
facebook/wav2vec2-large-xlsr-53 - Dataset:
openslr/librispeech_asr - Validation WER (best):
0.0691 - This model is uncased: outputs are lowercased and generally do not restore punctuation or capitalization.
Usage
import torch
import librosa
from transformers import AutoProcessor, AutoModelForCTC
repo_id = "takehika/wav2vec2-librispeech-en-finetuned"
processor = AutoProcessor.from_pretrained(repo_id)
model = AutoModelForCTC.from_pretrained(repo_id)
speech, _ = librosa.load("sample.wav", sr=16_000)
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)
Input vs Output Example
- Spoken content:
HELLO, WORLD. I'M JOHN. - Model output style:
hello world im john
Data
- Dataset:
openslr/librispeech_asr - Config:
clean - Train:
train.360(104,014 samples) - Validation:
validation(2,703 samples) - Test:
test(2,620 samples)
Training
- Batch size:
8 - Gradient accumulation:
2(effective16) - Learning rate:
2e-5 - Scheduler:
constant_with_warmup - Warmup steps:
500 - Eval/save steps:
500
Evaluation
- Validation WER:
0.0691
Intended Use & Limitations
- Intended use: English ASR for 16 kHz speech close to read/clean speech conditions.
- Domain shift: performance may degrade on noisy audio, strong accents, telephone speech, overlapping speakers, or code-switching.
- Output style: due to CTC + lowercase transcript setup, punctuation and casing quality can be limited.
- Reliability: transcription errors are possible; human review is recommended for high-stakes use.
Attribution & Licenses
- License: CC BY 4.0
- Base model
facebook/wav2vec2-large-xlsr-53: Apache-2.0 - Dataset
openslr/librispeech_asr: CC BY 4.0
References:
- Base model card: https://huggingface.co/facebook/wav2vec2-large-xlsr-53
- Dataset card: https://huggingface.co/datasets/openslr/librispeech_asr
- LibriSpeech source (OpenSLR SLR12): https://openslr.org/12/
Dataset Citation
@inproceedings{panayotov2015librispeech,
title={Librispeech: an ASR corpus based on public domain audio books},
author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},
pages={5206--5210},
year={2015},
organization={IEEE}
}
- Downloads last month
- 30
Model tree for takehika/wav2vec2-librispeech-en-finetuned
Base model
facebook/wav2vec2-large-xlsr-53