dungca's picture
Update README.md
0b439da verified
---
language:
- ja
license: apache-2.0
base_model: openai/whisper-tiny
tags:
- whisper
- japanese
- asr
- speech-recognition
- lora
- peft
- fine-tuned
library_name: transformers
metrics:
- cer
pipeline_tag: automatic-speech-recognition
datasets:
- reazon-research/reazonspeech
---
# whisper-tiny-ja-lora
A LoRA-finetuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for **Japanese Automatic Speech Recognition (ASR)**, trained on the [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) dataset using Parameter-Efficient Fine-Tuning (PEFT/LoRA).
## Model Details
### Model Description
This model applies Low-Rank Adaptation (LoRA) on top of Whisper Tiny to improve Japanese transcription quality while keeping the number of trainable parameters small. LoRA adapters are merged post-training for easy deployment.
- **Model type:** Automatic Speech Recognition (ASR)
- **Language:** Japanese (ja)
- **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
- **Fine-tuning method:** LoRA (Low-Rank Adaptation) via PEFT
- **License:** Apache 2.0
- **Developed by:** [dungca](https://huggingface.co/dungca)
### Model Sources
- **Training repository:** [dungca1512/whisper-finetune-ja-train](https://github.com/dungca1512/whisper-finetune-ja-train)
- **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
- **Demo:** [🤗 Try it on Hugging Face Spaces](https://huggingface.co/spaces/dungca/whisper-tiny-ja-lora-demo)
## Uses
### Direct Use
This model is designed for Japanese speech-to-text transcription tasks:
- Transcribing Japanese audio files
- Japanese voice assistants and conversational AI
- Japanese language learning applications (e.g., pronunciation feedback)
- Subtitle generation for Japanese audio/video content
### Out-of-Scope Use
- Non-Japanese speech (model is fine-tuned specifically for Japanese)
- Real-time streaming ASR in latency-critical production systems (whisper-tiny architecture may not meet accuracy requirements)
## How to Get Started with the Model
### Load LoRA Adapter (PEFT)
```python
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from peft import PeftModel
# Load base model and processor
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")
model.eval()
# Transcribe audio
def transcribe(audio_array, sampling_rate=16000):
inputs = processor(
audio_array,
sampling_rate=sampling_rate,
return_tensors="pt"
)
with torch.no_grad():
predicted_ids = model.generate(
inputs["input_features"],
language="japanese",
task="transcribe"
)
return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
```
### Quick Inference with Pipeline
```python
from transformers import pipeline
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, AutoProcessor
config = PeftConfig.from_pretrained("dungca/whisper-tiny-ja-lora")
base_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")
processor = AutoProcessor.from_pretrained(config.base_model_name_or_path)
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
generate_kwargs={"language": "japanese", "task": "transcribe"},
)
result = asr("your_audio.wav")
print(result["text"])
```
## Training Details
### Training Data
- **Dataset:** [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) (`small` split)
- **Language:** Japanese (ja)
- ReazonSpeech is a large-scale Japanese speech corpus collected from broadcast TV, covering diverse speaking styles and topics.
### Training Procedure
#### LoRA Configuration
| Parameter | Value |
|---|---|
| `lora_r` | 16 |
| `lora_alpha` | 32 |
| `lora_dropout` | 0.05 |
| `target_modules` | `q_proj`, `v_proj` |
#### Training Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | `1e-5` |
| Batch size | 32 |
| Epochs | ~1.55 (3000 steps) |
| Training regime | fp16 mixed precision |
| Optimizer | AdamW |
#### Infrastructure
| | |
|---|---|
| **Hardware** | Kaggle GPU — NVIDIA P100 (16GB) |
| **Cloud Provider** | Kaggle (Google Cloud) |
| **Compute Region** | US |
| **Framework** | Transformers + PEFT + Datasets |
| **PEFT version** | 0.18.1 |
### MLOps Pipeline
Training is fully automated via GitHub Actions:
- **CI:** Syntax check + lightweight data validation on every push
- **CT (Continuous Training):** Triggers Kaggle kernel for LoRA fine-tuning on data/code changes
- **CD:** Quality gate checks CER before promoting model to HuggingFace Hub
## Evaluation
### Testing Data
Evaluated on the ReazonSpeech validation split.
### Metrics
- **CER (Character Error Rate):** Lower is better. Standard metric for Japanese ASR (character-level, unlike WER used for English).
### Results
| Metric | Value |
|---|---|
| **eval/cer** | **0.52497** (~52.5%) |
| eval/loss | 1.17656 |
| eval/runtime | 162.422s |
| eval/samples_per_second | 12.314 |
| eval/steps_per_second | 0.770 |
| train/global_step | 3000 |
| train/epoch | 1.547 |
| train/grad_norm | 2.161 |
> **Note:** CER of ~52.5% reflects the constraints of `whisper-tiny` (39M parameters) on a small training subset. A follow-up experiment with `whisper-small` and extended training is in progress and expected to significantly reduce CER.
## Bias, Risks, and Limitations
- **Model size:** Whisper Tiny is optimized for speed and efficiency, not peak accuracy. Expect higher error rates on noisy audio, accented speech, or domain-specific vocabulary.
- **Training data scope:** Trained on broadcast Japanese; may perform worse on conversational or dialectal Japanese.
- **CER baseline:** The current CER reflects an early training checkpoint. Further training epochs and a larger model size (`whisper-small`) are expected to improve results.
### Recommendations
For production use cases requiring high accuracy, consider using [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) or waiting for the upcoming `whisper-small-ja-lora` checkpoint.
## Citation
If you use this model, please cite the base Whisper model and the LoRA/PEFT method:
```bibtex
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
year={2022},
eprint={2212.04356},
archivePrefix={arXiv}
}
@misc{hu2021lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J. and others},
year={2021},
eprint={2106.09685},
archivePrefix={arXiv}
}
```
### Framework Versions
- PEFT: 0.18.1
- Transformers: ≥4.36.0
- PyTorch: ≥2.0.0