Automatic Speech Recognition
Transformers
Vietnamese
vietnamese
whisper
speech-to-text

ASR-1

ASR-1 is a Vietnamese automatic speech recognition model developed by UnderTheSea NLP.

Model Description

  • Model Type: Fine-tuned Whisper for Vietnamese ASR
  • Base Model: openai/whisper-large-v3
  • Language: Vietnamese
  • License: Apache 2.0
  • Task: Automatic Speech Recognition (Speech-to-Text)

Installation

pip install transformers torch torchaudio datasets

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from huggingface_hub import snapshot_download
import torchaudio

# Download model
model_path = snapshot_download('undertheseanlp/asr-1')

# Load model and processor
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)

# Transcribe audio
waveform, sample_rate = torchaudio.load("audio.wav")
if sample_rate != 16000:
    waveform = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)

input_features = processor(
    waveform.squeeze().numpy(),
    sampling_rate=16000,
    return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features, language="vi", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

API (compatible with underthesea)

from asr import AsrTranscriber, transcribe

# Quick transcription
text = transcribe("audio.wav")
print(text)

# With model instance
transcriber = AsrTranscriber.load("models/asr-1")
result = transcriber.transcribe("audio.wav")
print(result.text)
print(result.confidence)

Training

uv run src/train.py
uv run src/train.py --base-model openai/whisper-large-v3 --dataset common_voice
uv run src/train.py --wandb --wandb-project asr-1

Evaluation

uv run src/evaluate.py --model models/asr-1
uv run src/evaluate.py --model models/asr-1 --dataset vivos

Datasets

Dataset Split Hours Samples
Common Voice 17.0 (vi) train ~30h ~25,000
Common Voice 17.0 (vi) test ~5h ~5,000
VIVOS train 15h 11,660
VIVOS test 0.6h 760

Metrics

  • WER (Word Error Rate): Lower is better
  • CER (Character Error Rate): Lower is better

References

Citation

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Technical Report

See TECHNICAL_REPORT.md for detailed methodology and evaluation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train undertheseanlp/asr-1

Paper for undertheseanlp/asr-1