`reazon-research/japanese-avhubert-base`

This is a Japanese AVHuBERT (Audio-Visual Hidden Unit BERT) model pretrained in the last iteration. AVHuBERT is a self-supervised model for AVSR (Audio-Visual Speech Recognition), robust for a noisy environment by leveraging both audio and visual inputs.

This model is pretrained on approximately 2,250h of Japanese audio-visual dataset.

Usage

Please install dependencies first.

$ pip install git+https://github.com/reazon-research/ReazonSpeech.git#subdirectory=pkg/avsr

Using `transformers` directly

from transformers import AutoFeatureExtractor, AutoModel

extractor = AutoFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-base", trust_remote_code=True)
model = AutoModel.from_pretrained("reazon-research/japanese-avhubert-base", trust_remote_code=True)

inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)

outputs = model(**inputs)

Using `reazonspeech.avsr` package

If you do not want to use trust_remote_code, please install reazonspeech.avsr.

from reazonspeech.avsr import AVHubertFeatureExtractor, AVHubertModel

extractor = AVHubertFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-base")
model = AVHubertModel.from_pretrained("reazon-research/japanese-avhubert-base")

inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)

outputs = model(**inputs)

Citation

@misc{reazon-research/japanese-avhubert-base,
  title={japanese-avhubert-base},
  author={Sasaki, Yuta},
  url = {https://huggingface.co/reazon-research/japanese-avhubert-base},
  year = {2025}
}

@article{shi2022avhubert,
    author  = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
    title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
    journal = {arXiv preprint arXiv:2201.02184}
    year = {2022}
}

@article{shi2022avsr,
    author  = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
    title = {Robust Self-Supervised Audio-Visual Speech Recognition},
    journal = {arXiv preprint arXiv:2201.01763}
    year = {2022}
}