| | --- |
| | library_name: transformers |
| | tags: |
| | - AVSR |
| | - AVHuBERT |
| | language: |
| | - ja |
| | pipeline_tag: automatic-speech-recognition |
| | base_model: |
| | - enactic/japanese-avhubert-large_noise_pt |
| | metrics: |
| | - cer |
| | license: cc-by-nc-4.0 |
| | --- |
| | |
| | <div align="center"> |
| | <video width="80%" controls> |
| | <source src="https://huggingface.co/datasets/enactic/assets/resolve/main/AVista%20demo.mp4" type="video/mp4"> |
| | Your browser does not support the video tag. |
| | </video> |
| | </div> |
| | |
| | # AVista Large+ v2 🐦🔥 |
| |
|
| | This is AVHuBERT (Audio-Visual Hidden Unit BERT) Large model for AVSR (Audio-Visual Speech Recognition) task, derived from [`enactic/japanese-avhubert-large_noise_pt`](https://huggingface.co/enactic/japanese-avhubert-large_noise_pt). |
| |
|
| | This model is fine-tuned on approximately 1,300h of Japanese audio-visual dataset combined with 35,000h of Japanese audio-only dataset, [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech). |
| |
|
| | ## Usage |
| |
|
| | Please install dependencies first. |
| | ```bash |
| | $ pip install git+https://github.com/reazon-research/ReazonSpeech.git#subdirectory=pkg/avsr |
| | ``` |
| |
|
| | ### Using `transformers` directly |
| |
|
| | You can load AVSR models by directly using Hugging Face transformers if you trust our remote code. |
| |
|
| | ```python |
| | from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
| | |
| | processor = AutoProcessor.from_pretrained("enactic/avista-large-plus-v2", trust_remote_code=True) |
| | model = AutoModelForSpeechSeq2Seq.from_pretrained("enactic/avista-large-plus-v2", trust_remote_code=True) |
| | |
| | inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video") |
| | # If mouth extraction is not performed, you can add `extract_mouth=True` |
| | inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True) |
| | |
| | outputs = model.generate(**inputs, num_beams=5, max_new_tokens=256) |
| | transcription = processor.decode(outputs[0], skip_special_tokens=True) |
| | ``` |
| |
|
| | ### Using `reazonspeech.avsr` package |
| |
|
| | You can also load AVSR models by using reazonspeech.avsr. If you don't want to use remote code for security reasons for example, you can use the following code. |
| |
|
| | ```python |
| | from reazonspeech.avsr import AVHubertProcessor, AVHubertForConditionalGeneration |
| | |
| | processor = AVHubertProcessor.from_pretrained("enactic/avista-large-plus-v2") |
| | model = AVHubertForConditionalGeneration.from_pretrained("enactic/avista-large-plus-v2") |
| | |
| | inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video") |
| | # If mouth extraction is not performed, you can add `extract_mouth=True` |
| | inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True) |
| | |
| | outputs = model.generate(**inputs, num_beams=5, max_new_tokens=256) |
| | transcription = processor.decode(outputs[0], skip_special_tokens=True) |
| | ``` |
| |
|
| | ## Test Results |
| |
|
| | We report the Character Error Rate (CER) on an out-of-domain evaluation dataset that was internally collected for AVSR benchmarking. |
| |
|
| | The following table presents the benchmark results of this model and Japanese ASR models under different noise levels and noise types. |
| |
|
| | Details of the dataset and the complete benchmark results can be found [here](https://huggingface.co/datasets/enactic/avsr-leaderboard). |
| |
|
| | **+ ReazonSpeech Speech** |
| |
|
| | | Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
| | | :------------------ | ------: | -----: | -----: | -----: | ------: | ------: | |
| | | AVista Large+ v2 | 459M | 9.46% | 11.77% | 14.84% | 21.41% | 34.31% | |
| | | reazonspeech k2 | 159M | 7.42% | 9.13% | 19.47% | 71.61% | 104.15% | |
| | | reazonspeech nemo | 619M | 8.50% | 11.74% | 25.38% | 77.65% | 103.42% | |
| | | reazonspeech espnet | 118M | 7.44% | 9.20% | 16.58% | 69.34% | 103.22% | |
| | | whisper large-v3 | 1,550M | 7.75% | 8.70% | 12.81% | 49.34% | 100.53% | |
| | | whisper medium | 769M | 10.07% | 13.23% | 19.21% | 50.56% | 99.27% | |
| | | whisper small | 244M | 10.82% | 19.82% | 28.98% | 69.69% | 108.56% | |
| |
|
| | **+ JSUT Speech** |
| |
|
| | | Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
| | | :------------------ | ------: | -----: | -----: | -----: | ------: | ------: | |
| | | AVista Large+ v2 | 459M | 9.46% | 11.25% | 13.02% | 15.60% | 21.10% | |
| | | reazonspeech k2 | 159M | 7.42% | 8.49% | 21.94% | 70.81% | 93.04% | |
| | | reazonspeech nemo | 619M | 8.50% | 10.93% | 29.06% | 83.77% | 98.76% | |
| | | reazonspeech espnet | 118M | 7.44% | 8.30% | 14.45% | 66.15% | 69.34% | |
| | | whisper large-v3 | 1,550M | 7.75% | 8.69% | 13.03% | 60.24% | 98.67% | |
| | | whisper medium | 769M | 10.07% | 12.27% | 18.80% | 58.00% | 97.35% | |
| | | whisper small | 244M | 10.82% | 19.44% | 26.75% | 71.33% | 101.84% | |
| |
|
| | **+ Babble** |
| |
|
| | | Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
| | | :------------------ | ------: | -----: | -----: | -----: | ------: | ------: | |
| | | AVista Large+ v2 | 459M | 9.46% | 11.09% | 15.05% | 26.94% | 61.66% | |
| | | reazonspeech k2 | 159M | 7.42% | 8.24% | 10.17% | 21.65% | 61.57% | |
| | | reazonspeech nemo | 619M | 8.50% | 10.40% | 14.83% | 31.74% | 77.29% | |
| | | reazonspeech espnet | 118M | 7.44% | 8.85% | 11.75% | 24.59% | 67.27% | |
| | | whisper large-v3 | 1,550M | 7.75% | 8.95% | 12.50% | 30.09% | 81.60% | |
| | | whisper medium | 769M | 10.07% | 12.52% | 18.18% | 42.27% | 95.43% | |
| | | whisper small | 244M | 10.82% | 19.72% | 28.24% | 56.72% | 109.61% | |
| |
|
| | **+ Music** |
| |
|
| | | Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
| | | :------------------ | ------: | -----: | -----: | -----: | -----: | ------: | |
| | | AVista Large+ v2 | 459M | 9.46% | 10.33% | 11.38% | 15.63% | 25.53% | |
| | | reazonspeech k2 | 159M | 7.42% | 7.69% | 8.33% | 9.49% | 16.90% | |
| | | reazonspeech nemo | 619M | 8.50% | 9.28% | 9.97% | 13.65% | 24.61% | |
| | | reazonspeech espnet | 118M | 7.44% | 7.86% | 8.57% | 10.41% | 16.62% | |
| | | whisper large-v3 | 1,550M | 7.75% | 8.16% | 9.01% | 11.23% | 21.26% | |
| | | whisper medium | 769M | 10.07% | 11.13% | 12.97% | 16.45% | 31.62% | |
| | | whisper small | 244M | 10.82% | 18.02% | 19.86% | 26.82% | 47.69% | |
| |
|
| | **+ Environmental Noise** |
| |
|
| | | Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
| | | :------------------ | ------: | -----: | -----: | -----: | -----: | -----: | |
| | | AVista Large+ v2 | 459M | 9.46% | 10.04% | 11.56% | 14.75% | 21.68% | |
| | | reazonspeech k2 | 159M | 7.42% | 8.07% | 8.68% | 10.32% | 15.53% | |
| | | reazonspeech nemo | 619M | 8.50% | 9.31% | 10.16% | 12.71% | 18.32% | |
| | | reazonspeech espnet | 118M | 7.44% | 8.00% | 8.63% | 10.06% | 14.54% | |
| | | whisper large-v3 | 1,550M | 7.75% | 8.46% | 9.17% | 11.98% | 19.36% | |
| | | whisper medium | 769M | 10.07% | 11.77% | 13.06% | 17.04% | 24.83% | |
| | | whisper small | 244M | 10.82% | 17.62% | 19.84% | 25.55% | 33.77% | |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @misc{enactic/avista-large-plus-v2, |
| | title={avista-large-plus-v2}, |
| | author={Sasaki, Yuta}, |
| | url = {https://huggingface.co/enactic/avista-large-plus-v2}, |
| | year = {2025} |
| | } |
| | |
| | @article{shi2022avhubert, |
| | author = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed}, |
| | title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction}, |
| | journal = {arXiv preprint arXiv:2201.02184} |
| | year = {2022} |
| | } |
| | |
| | @article{shi2022avsr, |
| | author = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed}, |
| | title = {Robust Self-Supervised Audio-Visual Speech Recognition}, |
| | journal = {arXiv preprint arXiv:2201.01763} |
| | year = {2022} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | [CC-BY-NC-4.0](https://spdx.org/licenses/CC-BY-NC-4.0) |