---
language:
- ja
license: apache-2.0
base_model: openai/whisper-tiny
tags:
- whisper
- japanese
- asr
- speech-recognition
- lora
- peft
- fine-tuned
library_name: transformers
metrics:
- cer
pipeline_tag: automatic-speech-recognition
datasets:
- reazon-research/reazonspeech
---

# whisper-tiny-ja-lora

A LoRA-finetuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for **Japanese Automatic Speech Recognition (ASR)**, trained on the [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) dataset using Parameter-Efficient Fine-Tuning (PEFT/LoRA).

## Model Details

### Model Description

This model applies Low-Rank Adaptation (LoRA) on top of Whisper Tiny to improve Japanese transcription quality while keeping the number of trainable parameters small. LoRA adapters are merged post-training for easy deployment.

- **Model type:** Automatic Speech Recognition (ASR)
- **Language:** Japanese (ja)
- **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
- **Fine-tuning method:** LoRA (Low-Rank Adaptation) via PEFT
- **License:** Apache 2.0
- **Developed by:** [dungca](https://huggingface.co/dungca)

### Model Sources

- **Training repository:** [dungca1512/whisper-finetune-ja-train](https://github.com/dungca1512/whisper-finetune-ja-train)
- **Base model:** [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
- **Demo:** [🤗 Try it on Hugging Face Spaces](https://huggingface.co/spaces/dungca/whisper-tiny-ja-lora-demo)

## Uses

### Direct Use

This model is designed for Japanese speech-to-text transcription tasks:

- Transcribing Japanese audio files
- Japanese voice assistants and conversational AI
- Japanese language learning applications (e.g., pronunciation feedback)
- Subtitle generation for Japanese audio/video content

### Out-of-Scope Use

- Non-Japanese speech (model is fine-tuned specifically for Japanese)
- Real-time streaming ASR in latency-critical production systems (whisper-tiny architecture may not meet accuracy requirements)

## How to Get Started with the Model

### Load LoRA Adapter (PEFT)

```python
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from peft import PeftModel

# Load base model and processor
base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")
model.eval()

# Transcribe audio
def transcribe(audio_array, sampling_rate=16000):
    inputs = processor(
        audio_array,
        sampling_rate=sampling_rate,
        return_tensors="pt"
    )
    with torch.no_grad():
        predicted_ids = model.generate(
            inputs["input_features"],
            language="japanese",
            task="transcribe"
        )
    return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
```

### Quick Inference with Pipeline

```python
from transformers import pipeline
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, AutoProcessor

config = PeftConfig.from_pretrained("dungca/whisper-tiny-ja-lora")
base_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")

processor = AutoProcessor.from_pretrained(config.base_model_name_or_path)

asr = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    generate_kwargs={"language": "japanese", "task": "transcribe"},
)

result = asr("your_audio.wav")
print(result["text"])
```

## Training Details

### Training Data

- **Dataset:** [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) (`small` split)
- **Language:** Japanese (ja)
- ReazonSpeech is a large-scale Japanese speech corpus collected from broadcast TV, covering diverse speaking styles and topics.

### Training Procedure

#### LoRA Configuration

| Parameter | Value |
|---|---|
| `lora_r` | 16 |
| `lora_alpha` | 32 |
| `lora_dropout` | 0.05 |
| `target_modules` | `q_proj`, `v_proj` |

#### Training Hyperparameters

| Parameter | Value |
|---|---|
| Learning rate | `1e-5` |
| Batch size | 32 |
| Epochs | ~1.55 (3000 steps) |
| Training regime | fp16 mixed precision |
| Optimizer | AdamW |

#### Infrastructure

| | |
|---|---|
| **Hardware** | Kaggle GPU — NVIDIA P100 (16GB) |
| **Cloud Provider** | Kaggle (Google Cloud) |
| **Compute Region** | US |
| **Framework** | Transformers + PEFT + Datasets |
| **PEFT version** | 0.18.1 |

### MLOps Pipeline

Training is fully automated via GitHub Actions:
- **CI:** Syntax check + lightweight data validation on every push
- **CT (Continuous Training):** Triggers Kaggle kernel for LoRA fine-tuning on data/code changes
- **CD:** Quality gate checks CER before promoting model to HuggingFace Hub

## Evaluation

### Testing Data

Evaluated on the ReazonSpeech validation split.

### Metrics

- **CER (Character Error Rate):** Lower is better. Standard metric for Japanese ASR (character-level, unlike WER used for English).

### Results

| Metric | Value |
|---|---|
| **eval/cer** | **0.52497** (~52.5%) |
| eval/loss | 1.17656 |
| eval/runtime | 162.422s |
| eval/samples_per_second | 12.314 |
| eval/steps_per_second | 0.770 |
| train/global_step | 3000 |
| train/epoch | 1.547 |
| train/grad_norm | 2.161 |

> **Note:** CER of ~52.5% reflects the constraints of `whisper-tiny` (39M parameters) on a small training subset. A follow-up experiment with `whisper-small` and extended training is in progress and expected to significantly reduce CER.

## Bias, Risks, and Limitations

- **Model size:** Whisper Tiny is optimized for speed and efficiency, not peak accuracy. Expect higher error rates on noisy audio, accented speech, or domain-specific vocabulary.
- **Training data scope:** Trained on broadcast Japanese; may perform worse on conversational or dialectal Japanese.
- **CER baseline:** The current CER reflects an early training checkpoint. Further training epochs and a larger model size (`whisper-small`) are expected to improve results.

### Recommendations

For production use cases requiring high accuracy, consider using [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) or waiting for the upcoming `whisper-small-ja-lora` checkpoint.

## Citation

If you use this model, please cite the base Whisper model and the LoRA/PEFT method:

```bibtex
@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022},
  eprint={2212.04356},
  archivePrefix={arXiv}
}

@misc{hu2021lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J. and others},
  year={2021},
  eprint={2106.09685},
  archivePrefix={arXiv}
}
```

### Framework Versions

- PEFT: 0.18.1
- Transformers: ≥4.36.0
- PyTorch: ≥2.0.0