--- language: - ja license: apache-2.0 base_model: openai/whisper-tiny tags: - whisper - japanese - asr - speech-recognition - lora - peft - fine-tuned library_name: transformers metrics: - cer pipeline_tag: automatic-speech-recognition datasets: - reazon-research/reazonspeech --- # whisper-tiny-ja-lora A LoRA-finetuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for **Japanese Automatic Speech Recognition (ASR)**, trained on the [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) dataset using Parameter-Efficient Fine-Tuning (PEFT/LoRA). ## Model Details ### Model Description This model applies Low-Rank Adaptation (LoRA) on top of Whisper Tiny to improve Japanese transcription quality while keeping the number of trainable parameters small. LoRA adapters are merged post-training for easy deployment. - **Model type:** Automatic Speech Recognition (ASR) - **Language:** Japanese (ja) - **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) - **Fine-tuning method:** LoRA (Low-Rank Adaptation) via PEFT - **License:** Apache 2.0 - **Developed by:** [dungca](https://huggingface.co/dungca) ### Model Sources - **Training repository:** [dungca1512/whisper-finetune-ja-train](https://github.com/dungca1512/whisper-finetune-ja-train) - **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) - **Demo:** [🤗 Try it on Hugging Face Spaces](https://huggingface.co/spaces/dungca/whisper-tiny-ja-lora-demo) ## Uses ### Direct Use This model is designed for Japanese speech-to-text transcription tasks: - Transcribing Japanese audio files - Japanese voice assistants and conversational AI - Japanese language learning applications (e.g., pronunciation feedback) - Subtitle generation for Japanese audio/video content ### Out-of-Scope Use - Non-Japanese speech (model is fine-tuned specifically for Japanese) - Real-time streaming ASR in latency-critical production systems (whisper-tiny architecture may not meet accuracy requirements) ## How to Get Started with the Model ### Load LoRA Adapter (PEFT) ```python import torch from transformers import AutoProcessor, WhisperForConditionalGeneration from peft import PeftModel # Load base model and processor base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny") processor = AutoProcessor.from_pretrained("openai/whisper-tiny") # Load LoRA adapter model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora") model.eval() # Transcribe audio def transcribe(audio_array, sampling_rate=16000): inputs = processor( audio_array, sampling_rate=sampling_rate, return_tensors="pt" ) with torch.no_grad(): predicted_ids = model.generate( inputs["input_features"], language="japanese", task="transcribe" ) return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] ``` ### Quick Inference with Pipeline ```python from transformers import pipeline from peft import PeftModel, PeftConfig from transformers import WhisperForConditionalGeneration, AutoProcessor config = PeftConfig.from_pretrained("dungca/whisper-tiny-ja-lora") base_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path) model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora") processor = AutoProcessor.from_pretrained(config.base_model_name_or_path) asr = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, generate_kwargs={"language": "japanese", "task": "transcribe"}, ) result = asr("your_audio.wav") print(result["text"]) ``` ## Training Details ### Training Data - **Dataset:** [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) (`small` split) - **Language:** Japanese (ja) - ReazonSpeech is a large-scale Japanese speech corpus collected from broadcast TV, covering diverse speaking styles and topics. ### Training Procedure #### LoRA Configuration | Parameter | Value | |---|---| | `lora_r` | 16 | | `lora_alpha` | 32 | | `lora_dropout` | 0.05 | | `target_modules` | `q_proj`, `v_proj` | #### Training Hyperparameters | Parameter | Value | |---|---| | Learning rate | `1e-5` | | Batch size | 32 | | Epochs | ~1.55 (3000 steps) | | Training regime | fp16 mixed precision | | Optimizer | AdamW | #### Infrastructure | | | |---|---| | **Hardware** | Kaggle GPU — NVIDIA P100 (16GB) | | **Cloud Provider** | Kaggle (Google Cloud) | | **Compute Region** | US | | **Framework** | Transformers + PEFT + Datasets | | **PEFT version** | 0.18.1 | ### MLOps Pipeline Training is fully automated via GitHub Actions: - **CI:** Syntax check + lightweight data validation on every push - **CT (Continuous Training):** Triggers Kaggle kernel for LoRA fine-tuning on data/code changes - **CD:** Quality gate checks CER before promoting model to HuggingFace Hub ## Evaluation ### Testing Data Evaluated on the ReazonSpeech validation split. ### Metrics - **CER (Character Error Rate):** Lower is better. Standard metric for Japanese ASR (character-level, unlike WER used for English). ### Results | Metric | Value | |---|---| | **eval/cer** | **0.52497** (~52.5%) | | eval/loss | 1.17656 | | eval/runtime | 162.422s | | eval/samples_per_second | 12.314 | | eval/steps_per_second | 0.770 | | train/global_step | 3000 | | train/epoch | 1.547 | | train/grad_norm | 2.161 | > **Note:** CER of ~52.5% reflects the constraints of `whisper-tiny` (39M parameters) on a small training subset. A follow-up experiment with `whisper-small` and extended training is in progress and expected to significantly reduce CER. ## Bias, Risks, and Limitations - **Model size:** Whisper Tiny is optimized for speed and efficiency, not peak accuracy. Expect higher error rates on noisy audio, accented speech, or domain-specific vocabulary. - **Training data scope:** Trained on broadcast Japanese; may perform worse on conversational or dialectal Japanese. - **CER baseline:** The current CER reflects an early training checkpoint. Further training epochs and a larger model size (`whisper-small`) are expected to improve results. ### Recommendations For production use cases requiring high accuracy, consider using [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) or waiting for the upcoming `whisper-small-ja-lora` checkpoint. ## Citation If you use this model, please cite the base Whisper model and the LoRA/PEFT method: ```bibtex @misc{radford2022whisper, title={Robust Speech Recognition via Large-Scale Weak Supervision}, author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, year={2022}, eprint={2212.04356}, archivePrefix={arXiv} } @misc{hu2021lora, title={LoRA: Low-Rank Adaptation of Large Language Models}, author={Hu, Edward J. and others}, year={2021}, eprint={2106.09685}, archivePrefix={arXiv} } ``` ### Framework Versions - PEFT: 0.18.1 - Transformers: ≥4.36.0 - PyTorch: ≥2.0.0