wav2vec2-commonvoice-ja-finetuned

Fine-tuned facebook/wav2vec2-large-xlsr-53 for Japanese ASR using Common Voice Scripted Speech 24.0 - Japanese.

  • Task: Automatic Speech Recognition (CTC)
  • Language: Japanese
  • License: apache-2.0
  • Base model: facebook/wav2vec2-large-xlsr-53
  • Dataset: Common Voice Scripted Speech 24.0 - Japanese
  • Validation CER (best): 0.3107
  • Decoding: greedy CTC, without external language model

Usage

import torch
import librosa
from transformers import AutoModelForCTC, Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor, Wav2Vec2Processor

repo_id = "takehika/wav2vec2-commonvoice-ja-finetuned"

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(repo_id)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(repo_id)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
model = AutoModelForCTC.from_pretrained(repo_id)

speech, _ = librosa.load("sample.wav", sr=16_000)
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)

Data

  • Dataset: Common Voice Scripted Speech 24.0 - Japanese

Notes: Text normalization was applied (NFKC normalization, selected symbol removal, whitespace cleanup).

Training

  • Batch size: 4
  • Gradient accumulation: 2 (effective 8)
  • Learning rate: 2e-5
  • Scheduler: constant_with_warmup
  • Warmup steps: 500
  • Eval/save steps: 500

Evaluation

  • Validation CER : 0.3107

Intended Use & Limitations

  • Intended use: Japanese ASR for 16 kHz speech.
  • Domain shift: performance may degrade for noisy audio, strong dialect/accent variation, telephone speech, or overlapping speakers.
  • Text style: punctuation handling and formatting may differ from natural writing due to CTC-style decoding and preprocessing.
  • Reliability: outputs may include transcription errors; human review is recommended for high-stakes use.

Attribution & Licenses

  • License: Apache-2.0
  • Base model facebook/wav2vec2-large-xlsr-53: Apache-2.0
  • Dataset (Common Voice Scripted Speech 24.0 - Japanese): CC0-1.0

References:

Downloads last month
29
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for takehika/wav2vec2-commonvoice-ja-finetuned

Finetuned
(358)
this model