Update README.md

0b439da verified 1 day ago

7.19 kB

	---
	language:
	- ja
	license: apache-2.0
	base_model: openai/whisper-tiny
	tags:
	- whisper
	- japanese
	- asr
	- speech-recognition
	- lora
	- peft
	- fine-tuned
	library_name: transformers
	metrics:
	- cer
	pipeline_tag: automatic-speech-recognition
	datasets:
	- reazon-research/reazonspeech
	---

	# whisper-tiny-ja-lora

	A LoRA-finetuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for Japanese Automatic Speech Recognition (ASR), trained on the [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) dataset using Parameter-Efficient Fine-Tuning (PEFT/LoRA).

	## Model Details

	### Model Description

	This model applies Low-Rank Adaptation (LoRA) on top of Whisper Tiny to improve Japanese transcription quality while keeping the number of trainable parameters small. LoRA adapters are merged post-training for easy deployment.

	- Model type: Automatic Speech Recognition (ASR)
	- Language: Japanese (ja)
	- Base model: [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
	- Fine-tuning method: LoRA (Low-Rank Adaptation) via PEFT
	- License: Apache 2.0
	- Developed by: [dungca](https://huggingface.co/dungca)

	### Model Sources

	- Training repository: [dungca1512/whisper-finetune-ja-train](https://github.com/dungca1512/whisper-finetune-ja-train)
	- Base model: [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
	- Demo: [🤗 Try it on Hugging Face Spaces](https://huggingface.co/spaces/dungca/whisper-tiny-ja-lora-demo)

	## Uses

	### Direct Use

	This model is designed for Japanese speech-to-text transcription tasks:

	- Transcribing Japanese audio files
	- Japanese voice assistants and conversational AI
	- Japanese language learning applications (e.g., pronunciation feedback)
	- Subtitle generation for Japanese audio/video content

	### Out-of-Scope Use

	- Non-Japanese speech (model is fine-tuned specifically for Japanese)
	- Real-time streaming ASR in latency-critical production systems (whisper-tiny architecture may not meet accuracy requirements)

	## How to Get Started with the Model

	### Load LoRA Adapter (PEFT)

	```python
	import torch
	from transformers import AutoProcessor, WhisperForConditionalGeneration
	from peft import PeftModel

	# Load base model and processor
	base_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
	processor = AutoProcessor.from_pretrained("openai/whisper-tiny")

	# Load LoRA adapter
	model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")
	model.eval()

	# Transcribe audio
	def transcribe(audio_array, sampling_rate=16000):
	inputs = processor(
	audio_array,
	sampling_rate=sampling_rate,
	return_tensors="pt"
	)
	with torch.no_grad():
	predicted_ids = model.generate(
	inputs["input_features"],
	language="japanese",
	task="transcribe"
	)
	return processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	```

	### Quick Inference with Pipeline

	```python
	from transformers import pipeline
	from peft import PeftModel, PeftConfig
	from transformers import WhisperForConditionalGeneration, AutoProcessor

	config = PeftConfig.from_pretrained("dungca/whisper-tiny-ja-lora")
	base_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path)
	model = PeftModel.from_pretrained(base_model, "dungca/whisper-tiny-ja-lora")

	processor = AutoProcessor.from_pretrained(config.base_model_name_or_path)

	asr = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	generate_kwargs={"language": "japanese", "task": "transcribe"},
	)

	result = asr("your_audio.wav")
	print(result["text"])
	```

	## Training Details

	### Training Data

	- Dataset: [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech) (`small` split)
	- Language: Japanese (ja)
	- ReazonSpeech is a large-scale Japanese speech corpus collected from broadcast TV, covering diverse speaking styles and topics.

	### Training Procedure

	#### LoRA Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| `lora_r` \| 16 \|
	\| `lora_alpha` \| 32 \|
	\| `lora_dropout` \| 0.05 \|
	\| `target_modules` \| `q_proj`, `v_proj` \|

	#### Training Hyperparameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| Learning rate \| `1e-5` \|
	\| Batch size \| 32 \|
	\| Epochs \| ~1.55 (3000 steps) \|
	\| Training regime \| fp16 mixed precision \|
	\| Optimizer \| AdamW \|

	#### Infrastructure

	\| \| \|
	\|---\|---\|
	\| Hardware \| Kaggle GPU — NVIDIA P100 (16GB) \|
	\| Cloud Provider \| Kaggle (Google Cloud) \|
	\| Compute Region \| US \|
	\| Framework \| Transformers + PEFT + Datasets \|
	\| PEFT version \| 0.18.1 \|

	### MLOps Pipeline

	Training is fully automated via GitHub Actions:
	- CI: Syntax check + lightweight data validation on every push
	- CT (Continuous Training): Triggers Kaggle kernel for LoRA fine-tuning on data/code changes
	- CD: Quality gate checks CER before promoting model to HuggingFace Hub

	## Evaluation

	### Testing Data

	Evaluated on the ReazonSpeech validation split.

	### Metrics

	- CER (Character Error Rate): Lower is better. Standard metric for Japanese ASR (character-level, unlike WER used for English).

	### Results

	\| Metric \| Value \|
	\|---\|---\|
	\| eval/cer \| 0.52497 (~52.5%) \|
	\| eval/loss \| 1.17656 \|
	\| eval/runtime \| 162.422s \|
	\| eval/samples_per_second \| 12.314 \|
	\| eval/steps_per_second \| 0.770 \|
	\| train/global_step \| 3000 \|
	\| train/epoch \| 1.547 \|
	\| train/grad_norm \| 2.161 \|

	> Note: CER of ~52.5% reflects the constraints of `whisper-tiny` (39M parameters) on a small training subset. A follow-up experiment with `whisper-small` and extended training is in progress and expected to significantly reduce CER.

	## Bias, Risks, and Limitations

	- Model size: Whisper Tiny is optimized for speed and efficiency, not peak accuracy. Expect higher error rates on noisy audio, accented speech, or domain-specific vocabulary.
	- Training data scope: Trained on broadcast Japanese; may perform worse on conversational or dialectal Japanese.
	- CER baseline: The current CER reflects an early training checkpoint. Further training epochs and a larger model size (`whisper-small`) are expected to improve results.

	### Recommendations

	For production use cases requiring high accuracy, consider using [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) or waiting for the upcoming `whisper-small-ja-lora` checkpoint.

	## Citation

	If you use this model, please cite the base Whisper model and the LoRA/PEFT method:

	```bibtex
	@misc{radford2022whisper,
	title={Robust Speech Recognition via Large-Scale Weak Supervision},
	author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
	year={2022},
	eprint={2212.04356},
	archivePrefix={arXiv}
	}

	@misc{hu2021lora,
	title={LoRA: Low-Rank Adaptation of Large Language Models},
	author={Hu, Edward J. and others},
	year={2021},
	eprint={2106.09685},
	archivePrefix={arXiv}
	}
	```

	### Framework Versions

	- PEFT: 0.18.1
	- Transformers: ≥4.36.0
	- PyTorch: ≥2.0.0