AVista 🐦🔥
Collection
Toward a new vista of Human–Robot Interaction. AVista stands for Audio‑VIsual Speech Transcription & Alignment.
•
34 items
•
Updated
•
2
reazon-research/japanese-avhubert-base
This is a Japanese AVHuBERT (Audio-Visual Hidden Unit BERT) model pretrained in the last iteration. AVHuBERT is a self-supervised model for AVSR (Audio-Visual Speech Recognition), robust for a noisy environment by leveraging both audio and visual inputs.
This model is pretrained on approximately 2,250h of Japanese audio-visual dataset.
Please install dependencies first.
$ pip install git+https://github.com/reazon-research/ReazonSpeech.git#subdirectory=pkg/avsr
transformers directly
from transformers import AutoFeatureExtractor, AutoModel
extractor = AutoFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-base", trust_remote_code=True)
model = AutoModel.from_pretrained("reazon-research/japanese-avhubert-base", trust_remote_code=True)
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)
outputs = model(**inputs)
reazonspeech.avsr package
If you do not want to use trust_remote_code, please install reazonspeech.avsr.
from reazonspeech.avsr import AVHubertFeatureExtractor, AVHubertModel
extractor = AVHubertFeatureExtractor.from_pretrained("reazon-research/japanese-avhubert-base")
model = AVHubertModel.from_pretrained("reazon-research/japanese-avhubert-base")
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video")
# If mouth extraction is not performed, you can add `extract_mouth=True`
inputs = extractor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True)
outputs = model(**inputs)
@misc{reazon-research/japanese-avhubert-base,
title={japanese-avhubert-base},
author={Sasaki, Yuta},
url = {https://huggingface.co/reazon-research/japanese-avhubert-base},
year = {2025}
}
@article{shi2022avhubert,
author = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
journal = {arXiv preprint arXiv:2201.02184}
year = {2022}
}
@article{shi2022avsr,
author = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
title = {Robust Self-Supervised Audio-Visual Speech Recognition},
journal = {arXiv preprint arXiv:2201.01763}
year = {2022}
}