whisper-tiny-ja-lora
A LoRA-finetuned version of openai/whisper-tiny for Japanese Automatic Speech Recognition (ASR), trained on the ReazonSpeech dataset using Parameter-Efficient Fine-Tuning (PEFT/LoRA).
Model Details
Model Description
This model applies Low-Rank Adaptation (LoRA) on top of Whisper Tiny to improve Japanese transcription quality while keeping the number of trainable parameters small. LoRA adapters are merged post-training for easy deployment.
- Model type: Automatic Speech Recognition (ASR)
- Language: Japanese (ja)
- Base model: openai/whisper-tiny
- Fine-tuning method: LoRA (Low-Rank Adaptation) via PEFT
- License: Apache 2.0
- Developed by: dungca
Model Sources
- Training repository: dungca1512/whisper-finetune-ja-train
- Base model: openai/whisper-tiny
- Demo: 🤗 Try it on Hugging Face Spaces
Uses
Direct Use
This model is designed for Japanese speech-to-text transcription tasks:
- Transcribing Japanese audio files
- Japanese voice assistants and conversational AI
- Japanese language learning applications (e.g., pronunciation feedback)
- Subtitle generation for Japanese audio/video content
Out-of-Scope Use
- Non-Japanese speech (model is fine-tuned specifically for Japanese)
- Real-time streaming ASR in latency-critical production systems (whisper-tiny architecture may not meet accuracy requirements)
How to Get Started with the Model
Load LoRA Adapter (PEFT)
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from peft import PeftModel
# Load base model and processor
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")
model.eval()
# Transcribe audio
def transcribe(audio_array, sampling_rate=16000):
inputs = processor(
audio_array,
sampling_rate=sampling_rate,
return_tensors="pt"
)
with torch.no_grad():
predicted_ids = model.generate(
inputs["input_features"],
language="japanese",
task="transcribe"
)
return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
Quick Inference with Pipeline
from transformers import pipeline
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, AutoProcessor
config = PeftConfig.from_pretrained("dungca/whisper-tiny-ja-lora")
base_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")
processor = AutoProcessor.from_pretrained(config.base_model_name_or_path)
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
generate_kwargs={"language": "japanese", "task": "transcribe"},
)
result = asr("your_audio.wav")
print(result["text"])
Training Details
Training Data
- Dataset: ReazonSpeech (
smallsplit) - Language: Japanese (ja)
- ReazonSpeech is a large-scale Japanese speech corpus collected from broadcast TV, covering diverse speaking styles and topics.
Training Procedure
LoRA Configuration
| Parameter | Value |
|---|---|
lora_r |
16 |
lora_alpha |
32 |
lora_dropout |
0.05 |
target_modules |
q_proj, v_proj |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 1e-5 |
| Batch size | 32 |
| Epochs | ~1.55 (3000 steps) |
| Training regime | fp16 mixed precision |
| Optimizer | AdamW |
Infrastructure
| Hardware | Kaggle GPU — NVIDIA P100 (16GB) |
| Cloud Provider | Kaggle (Google Cloud) |
| Compute Region | US |
| Framework | Transformers + PEFT + Datasets |
| PEFT version | 0.18.1 |
MLOps Pipeline
Training is fully automated via GitHub Actions:
- CI: Syntax check + lightweight data validation on every push
- CT (Continuous Training): Triggers Kaggle kernel for LoRA fine-tuning on data/code changes
- CD: Quality gate checks CER before promoting model to HuggingFace Hub
Evaluation
Testing Data
Evaluated on the ReazonSpeech validation split.
Metrics
- CER (Character Error Rate): Lower is better. Standard metric for Japanese ASR (character-level, unlike WER used for English).
Results
| Metric | Value |
|---|---|
| eval/cer | 0.52497 (~52.5%) |
| eval/loss | 1.17656 |
| eval/runtime | 162.422s |
| eval/samples_per_second | 12.314 |
| eval/steps_per_second | 0.770 |
| train/global_step | 3000 |
| train/epoch | 1.547 |
| train/grad_norm | 2.161 |
Note: CER of ~52.5% reflects the constraints of
whisper-tiny(39M parameters) on a small training subset. A follow-up experiment withwhisper-smalland extended training is in progress and expected to significantly reduce CER.
Bias, Risks, and Limitations
- Model size: Whisper Tiny is optimized for speed and efficiency, not peak accuracy. Expect higher error rates on noisy audio, accented speech, or domain-specific vocabulary.
- Training data scope: Trained on broadcast Japanese; may perform worse on conversational or dialectal Japanese.
- CER baseline: The current CER reflects an early training checkpoint. Further training epochs and a larger model size (
whisper-small) are expected to improve results.
Recommendations
For production use cases requiring high accuracy, consider using openai/whisper-large-v3 or waiting for the upcoming whisper-small-ja-lora checkpoint.
Citation
If you use this model, please cite the base Whisper model and the LoRA/PEFT method:
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
year={2022},
eprint={2212.04356},
archivePrefix={arXiv}
}
@misc{hu2021lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J. and others},
year={2021},
eprint={2106.09685},
archivePrefix={arXiv}
}
Framework Versions
- PEFT: 0.18.1
- Transformers: ≥4.36.0
- PyTorch: ≥2.0.0
Model tree for dungca/whisper-tiny-ja-lora
Base model
openai/whisper-tiny