wav2vec2-librispeech-en-finetuned

Fine-tuned facebook/wav2vec2-large-xlsr-53 for English ASR using LibriSpeech (openslr/librispeech_asr, clean, train.360).

  • Task: Automatic Speech Recognition (CTC)
  • Language: English
  • License: CC BY 4.0
  • Base model: facebook/wav2vec2-large-xlsr-53
  • Dataset: openslr/librispeech_asr
  • Validation WER (best): 0.0691
  • This model is uncased: outputs are lowercased and generally do not restore punctuation or capitalization.

Usage

import torch
import librosa
from transformers import AutoProcessor, AutoModelForCTC

repo_id = "takehika/wav2vec2-librispeech-en-finetuned"
processor = AutoProcessor.from_pretrained(repo_id)
model = AutoModelForCTC.from_pretrained(repo_id)

speech, _ = librosa.load("sample.wav", sr=16_000)
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)

Input vs Output Example

  • Spoken content: HELLO, WORLD. I'M JOHN.
  • Model output style: hello world im john

Data

  • Dataset: openslr/librispeech_asr
  • Config: clean
  • Train: train.360 (104,014 samples)
  • Validation: validation (2,703 samples)
  • Test: test (2,620 samples)

Training

  • Batch size: 8
  • Gradient accumulation: 2 (effective 16)
  • Learning rate: 2e-5
  • Scheduler: constant_with_warmup
  • Warmup steps: 500
  • Eval/save steps: 500

Evaluation

  • Validation WER: 0.0691

Intended Use & Limitations

  • Intended use: English ASR for 16 kHz speech close to read/clean speech conditions.
  • Domain shift: performance may degrade on noisy audio, strong accents, telephone speech, overlapping speakers, or code-switching.
  • Output style: due to CTC + lowercase transcript setup, punctuation and casing quality can be limited.
  • Reliability: transcription errors are possible; human review is recommended for high-stakes use.

Attribution & Licenses

  • License: CC BY 4.0
  • Base model facebook/wav2vec2-large-xlsr-53: Apache-2.0
  • Dataset openslr/librispeech_asr: CC BY 4.0

References:

Dataset Citation

@inproceedings{panayotov2015librispeech,
  title={Librispeech: an ASR corpus based on public domain audio books},
  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
  booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on},
  pages={5206--5210},
  year={2015},
  organization={IEEE}
}
Downloads last month
30
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for takehika/wav2vec2-librispeech-en-finetuned

Finetuned
(356)
this model

Dataset used to train takehika/wav2vec2-librispeech-en-finetuned