wav2vec2-commonvoice-ja-finetuned
Fine-tuned facebook/wav2vec2-large-xlsr-53 for Japanese ASR using Common Voice Scripted Speech 24.0 - Japanese.
- Task: Automatic Speech Recognition (CTC)
- Language: Japanese
- License: apache-2.0
- Base model:
facebook/wav2vec2-large-xlsr-53 - Dataset: Common Voice Scripted Speech 24.0 - Japanese
- Validation CER (best):
0.3107 - Decoding: greedy CTC, without external language model
Usage
import torch
import librosa
from transformers import AutoModelForCTC, Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor, Wav2Vec2Processor
repo_id = "takehika/wav2vec2-commonvoice-ja-finetuned"
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(repo_id)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(repo_id)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
model = AutoModelForCTC.from_pretrained(repo_id)
speech, _ = librosa.load("sample.wav", sr=16_000)
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)
Data
- Dataset: Common Voice Scripted Speech 24.0 - Japanese
Notes: Text normalization was applied (NFKC normalization, selected symbol removal, whitespace cleanup).
Training
- Batch size:
4 - Gradient accumulation:
2(effective8) - Learning rate:
2e-5 - Scheduler:
constant_with_warmup - Warmup steps:
500 - Eval/save steps:
500
Evaluation
- Validation CER :
0.3107
Intended Use & Limitations
- Intended use: Japanese ASR for 16 kHz speech.
- Domain shift: performance may degrade for noisy audio, strong dialect/accent variation, telephone speech, or overlapping speakers.
- Text style: punctuation handling and formatting may differ from natural writing due to CTC-style decoding and preprocessing.
- Reliability: outputs may include transcription errors; human review is recommended for high-stakes use.
Attribution & Licenses
- License: Apache-2.0
- Base model
facebook/wav2vec2-large-xlsr-53: Apache-2.0 - Dataset (Common Voice Scripted Speech 24.0 - Japanese): CC0-1.0
References:
- Base model card: https://huggingface.co/facebook/wav2vec2-large-xlsr-53
- Common Voice datasets portal: https://datacollective.mozillafoundation.org/datasets
- Downloads last month
- 29
Model tree for takehika/wav2vec2-commonvoice-ja-finetuned
Base model
facebook/wav2vec2-large-xlsr-53