whisper-tiny-ja-lora

A LoRA-finetuned version of openai/whisper-tiny for Japanese Automatic Speech Recognition (ASR), trained on the ReazonSpeech dataset using Parameter-Efficient Fine-Tuning (PEFT/LoRA).

Model Details

Model Description

This model applies Low-Rank Adaptation (LoRA) on top of Whisper Tiny to improve Japanese transcription quality while keeping the number of trainable parameters small. LoRA adapters are merged post-training for easy deployment.

Model type: Automatic Speech Recognition (ASR)
Language: Japanese (ja)
Base model: openai/whisper-tiny
Fine-tuning method: LoRA (Low-Rank Adaptation) via PEFT
License: Apache 2.0
Developed by: dungca

Model Sources

Training repository: dungca1512/whisper-finetune-ja-train
Base model: openai/whisper-tiny
Demo: 🤗 Try it on Hugging Face Spaces

Uses

Direct Use

This model is designed for Japanese speech-to-text transcription tasks:

Transcribing Japanese audio files
Japanese voice assistants and conversational AI
Japanese language learning applications (e.g., pronunciation feedback)
Subtitle generation for Japanese audio/video content

Out-of-Scope Use

Non-Japanese speech (model is fine-tuned specifically for Japanese)
Real-time streaming ASR in latency-critical production systems (whisper-tiny architecture may not meet accuracy requirements)

How to Get Started with the Model

Load LoRA Adapter (PEFT)

import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from peft import PeftModel

# Load base model and processor
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")
model.eval()

# Transcribe audio
def transcribe(audio_array, sampling_rate=16000):
    inputs = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    )
    with torch.no_grad():
        predicted_ids = model.generate(
            inputs["input_features"],
            language="japanese",
            task="transcribe"
        )
    return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

Quick Inference with Pipeline

from transformers import pipeline
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, AutoProcessor

config = PeftConfig.from_pretrained("dungca/whisper-tiny-ja-lora")
base_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")

processor = AutoProcessor.from_pretrained(config.base_model_name_or_path)

asr = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    generate_kwargs={"language": "japanese", "task": "transcribe"},
)

result = asr("your_audio.wav")
print(result["text"])

Training Details

Training Data

Dataset: ReazonSpeech (small split)
Language: Japanese (ja)
ReazonSpeech is a large-scale Japanese speech corpus collected from broadcast TV, covering diverse speaking styles and topics.

Training Procedure

LoRA Configuration

Parameter	Value
`lora_r`	16
`lora_alpha`	32
`lora_dropout`	0.05
`target_modules`	`q_proj`, `v_proj`

Training Hyperparameters

Parameter	Value
Learning rate	`1e-5`
Batch size	32
Epochs	~1.55 (3000 steps)
Training regime	fp16 mixed precision
Optimizer	AdamW

Infrastructure


Hardware	Kaggle GPU — NVIDIA P100 (16GB)
Cloud Provider	Kaggle (Google Cloud)
Compute Region	US
Framework	Transformers + PEFT + Datasets
PEFT version	0.18.1

MLOps Pipeline

Training is fully automated via GitHub Actions:

CI: Syntax check + lightweight data validation on every push
CT (Continuous Training): Triggers Kaggle kernel for LoRA fine-tuning on data/code changes
CD: Quality gate checks CER before promoting model to HuggingFace Hub

Evaluation

Testing Data

Evaluated on the ReazonSpeech validation split.

Metrics

CER (Character Error Rate): Lower is better. Standard metric for Japanese ASR (character-level, unlike WER used for English).

Results

Metric	Value
eval/cer	0.52497 (~52.5%)
eval/loss	1.17656
eval/runtime	162.422s
eval/samples_per_second	12.314
eval/steps_per_second	0.770
train/global_step	3000
train/epoch	1.547
train/grad_norm	2.161

Note: CER of ~52.5% reflects the constraints of whisper-tiny (39M parameters) on a small training subset. A follow-up experiment with whisper-small and extended training is in progress and expected to significantly reduce CER.

Bias, Risks, and Limitations

Model size: Whisper Tiny is optimized for speed and efficiency, not peak accuracy. Expect higher error rates on noisy audio, accented speech, or domain-specific vocabulary.
Training data scope: Trained on broadcast Japanese; may perform worse on conversational or dialectal Japanese.
CER baseline: The current CER reflects an early training checkpoint. Further training epochs and a larger model size (whisper-small) are expected to improve results.

Recommendations

For production use cases requiring high accuracy, consider using openai/whisper-large-v3 or waiting for the upcoming whisper-small-ja-lora checkpoint.

Citation

If you use this model, please cite the base Whisper model and the LoRA/PEFT method:

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022},
  eprint={2212.04356},
  archivePrefix={arXiv}
}

@misc{hu2021lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J. and others},
  year={2021},
  eprint={2106.09685},
  archivePrefix={arXiv}
}