| | --- |
| | language: |
| | - ja |
| | license: apache-2.0 |
| | base_model: openai/whisper-tiny |
| | tags: |
| | - whisper |
| | - japanese |
| | - asr |
| | - speech-recognition |
| | - lora |
| | - peft |
| | - fine-tuned |
| | library_name: transformers |
| | metrics: |
| | - cer |
| | pipeline_tag: automatic-speech-recognition |
| | datasets: |
| | - reazon-research/reazonspeech |
| | --- |
| | |
| | # whisper-tiny-ja-lora |
| |
|
| | A LoRA-finetuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for **Japanese Automatic Speech Recognition (ASR)**, trained on the [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) dataset using Parameter-Efficient Fine-Tuning (PEFT/LoRA). |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This model applies Low-Rank Adaptation (LoRA) on top of Whisper Tiny to improve Japanese transcription quality while keeping the number of trainable parameters small. LoRA adapters are merged post-training for easy deployment. |
| |
|
| | - **Model type:** Automatic Speech Recognition (ASR) |
| | - **Language:** Japanese (ja) |
| | - **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) |
| | - **Fine-tuning method:** LoRA (Low-Rank Adaptation) via PEFT |
| | - **License:** Apache 2.0 |
| | - **Developed by:** [dungca](https://huggingface.co/dungca) |
| |
|
| | ### Model Sources |
| |
|
| | - **Training repository:** [dungca1512/whisper-finetune-ja-train](https://github.com/dungca1512/whisper-finetune-ja-train) |
| | - **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) |
| | - **Demo:** [🤗 Try it on Hugging Face Spaces](https://huggingface.co/spaces/dungca/whisper-tiny-ja-lora-demo) |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | This model is designed for Japanese speech-to-text transcription tasks: |
| |
|
| | - Transcribing Japanese audio files |
| | - Japanese voice assistants and conversational AI |
| | - Japanese language learning applications (e.g., pronunciation feedback) |
| | - Subtitle generation for Japanese audio/video content |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | - Non-Japanese speech (model is fine-tuned specifically for Japanese) |
| | - Real-time streaming ASR in latency-critical production systems (whisper-tiny architecture may not meet accuracy requirements) |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | ### Load LoRA Adapter (PEFT) |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoProcessor, WhisperForConditionalGeneration |
| | from peft import PeftModel |
| | |
| | # Load base model and processor |
| | base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny") |
| | processor = AutoProcessor.from_pretrained("openai/whisper-tiny") |
| | |
| | # Load LoRA adapter |
| | model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora") |
| | model.eval() |
| | |
| | # Transcribe audio |
| | def transcribe(audio_array, sampling_rate=16000): |
| | inputs = processor( |
| | audio_array, |
| | sampling_rate=sampling_rate, |
| | return_tensors="pt" |
| | ) |
| | with torch.no_grad(): |
| | predicted_ids = model.generate( |
| | inputs["input_features"], |
| | language="japanese", |
| | task="transcribe" |
| | ) |
| | return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
| | ``` |
| |
|
| | ### Quick Inference with Pipeline |
| |
|
| | ```python |
| | from transformers import pipeline |
| | from peft import PeftModel, PeftConfig |
| | from transformers import WhisperForConditionalGeneration, AutoProcessor |
| | |
| | config = PeftConfig.from_pretrained("dungca/whisper-tiny-ja-lora") |
| | base_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path) |
| | model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora") |
| | |
| | processor = AutoProcessor.from_pretrained(config.base_model_name_or_path) |
| | |
| | asr = pipeline( |
| | "automatic-speech-recognition", |
| | model=model, |
| | tokenizer=processor.tokenizer, |
| | feature_extractor=processor.feature_extractor, |
| | generate_kwargs={"language": "japanese", "task": "transcribe"}, |
| | ) |
| | |
| | result = asr("your_audio.wav") |
| | print(result["text"]) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | - **Dataset:** [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) (`small` split) |
| | - **Language:** Japanese (ja) |
| | - ReazonSpeech is a large-scale Japanese speech corpus collected from broadcast TV, covering diverse speaking styles and topics. |
| |
|
| | ### Training Procedure |
| |
|
| | #### LoRA Configuration |
| |
|
| | | Parameter | Value | |
| | |---|---| |
| | | `lora_r` | 16 | |
| | | `lora_alpha` | 32 | |
| | | `lora_dropout` | 0.05 | |
| | | `target_modules` | `q_proj`, `v_proj` | |
| |
|
| | #### Training Hyperparameters |
| |
|
| | | Parameter | Value | |
| | |---|---| |
| | | Learning rate | `1e-5` | |
| | | Batch size | 32 | |
| | | Epochs | ~1.55 (3000 steps) | |
| | | Training regime | fp16 mixed precision | |
| | | Optimizer | AdamW | |
| |
|
| | #### Infrastructure |
| |
|
| | | | | |
| | |---|---| |
| | | **Hardware** | Kaggle GPU — NVIDIA P100 (16GB) | |
| | | **Cloud Provider** | Kaggle (Google Cloud) | |
| | | **Compute Region** | US | |
| | | **Framework** | Transformers + PEFT + Datasets | |
| | | **PEFT version** | 0.18.1 | |
| |
|
| | ### MLOps Pipeline |
| |
|
| | Training is fully automated via GitHub Actions: |
| | - **CI:** Syntax check + lightweight data validation on every push |
| | - **CT (Continuous Training):** Triggers Kaggle kernel for LoRA fine-tuning on data/code changes |
| | - **CD:** Quality gate checks CER before promoting model to HuggingFace Hub |
| |
|
| | ## Evaluation |
| |
|
| | ### Testing Data |
| |
|
| | Evaluated on the ReazonSpeech validation split. |
| |
|
| | ### Metrics |
| |
|
| | - **CER (Character Error Rate):** Lower is better. Standard metric for Japanese ASR (character-level, unlike WER used for English). |
| |
|
| | ### Results |
| |
|
| | | Metric | Value | |
| | |---|---| |
| | | **eval/cer** | **0.52497** (~52.5%) | |
| | | eval/loss | 1.17656 | |
| | | eval/runtime | 162.422s | |
| | | eval/samples_per_second | 12.314 | |
| | | eval/steps_per_second | 0.770 | |
| | | train/global_step | 3000 | |
| | | train/epoch | 1.547 | |
| | | train/grad_norm | 2.161 | |
| |
|
| | > **Note:** CER of ~52.5% reflects the constraints of `whisper-tiny` (39M parameters) on a small training subset. A follow-up experiment with `whisper-small` and extended training is in progress and expected to significantly reduce CER. |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | - **Model size:** Whisper Tiny is optimized for speed and efficiency, not peak accuracy. Expect higher error rates on noisy audio, accented speech, or domain-specific vocabulary. |
| | - **Training data scope:** Trained on broadcast Japanese; may perform worse on conversational or dialectal Japanese. |
| | - **CER baseline:** The current CER reflects an early training checkpoint. Further training epochs and a larger model size (`whisper-small`) are expected to improve results. |
| |
|
| | ### Recommendations |
| |
|
| | For production use cases requiring high accuracy, consider using [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) or waiting for the upcoming `whisper-small-ja-lora` checkpoint. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the base Whisper model and the LoRA/PEFT method: |
| |
|
| | ```bibtex |
| | @misc{radford2022whisper, |
| | title={Robust Speech Recognition via Large-Scale Weak Supervision}, |
| | author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, |
| | year={2022}, |
| | eprint={2212.04356}, |
| | archivePrefix={arXiv} |
| | } |
| | |
| | @misc{hu2021lora, |
| | title={LoRA: Low-Rank Adaptation of Large Language Models}, |
| | author={Hu, Edward J. and others}, |
| | year={2021}, |
| | eprint={2106.09685}, |
| | archivePrefix={arXiv} |
| | } |
| | ``` |
| |
|
| | ### Framework Versions |
| |
|
| | - PEFT: 0.18.1 |
| | - Transformers: ≥4.36.0 |
| | - PyTorch: ≥2.0.0 |