|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- AVSR |
|
|
- AVHuBERT |
|
|
language: |
|
|
- ja |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
base_model: |
|
|
- enactic/japanese-avhubert-base_noise_pt |
|
|
metrics: |
|
|
- cer |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<video width="80%" controls> |
|
|
<source src="https://huggingface.co/datasets/enactic/assets/resolve/main/AVista%20demo.mp4" type="video/mp4"> |
|
|
Your browser does not support the video tag. |
|
|
</video> |
|
|
</div> |
|
|
|
|
|
# AVista Base+ 🐦🔥 |
|
|
|
|
|
This is AVHuBERT (Audio-Visual Hidden Unit BERT) Base model for AVSR (Audio-Visual Speech Recognition) task, derived from [`enactic/japanese-avhubert-base_noise_pt`](https://huggingface.co/enactic/japanese-avhubert-base_noise_pt). |
|
|
|
|
|
This model is fine-tuned on approximately 1,300h of Japanese audio-visual dataset. |
|
|
|
|
|
## Usage |
|
|
|
|
|
Please install dependencies first. |
|
|
```bash |
|
|
$ pip install git+https://github.com/reazon-research/ReazonSpeech.git#subdirectory=pkg/avsr |
|
|
``` |
|
|
|
|
|
### Using `transformers` directly |
|
|
|
|
|
You can load AVSR models by directly using Hugging Face transformers if you trust our remote code. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("enactic/avista-base-plus", trust_remote_code=True) |
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained("enactic/avista-base-plus", trust_remote_code=True) |
|
|
|
|
|
inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video") |
|
|
# If mouth extraction is not performed, you can add `extract_mouth=True` |
|
|
inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True) |
|
|
|
|
|
outputs = model.generate(**inputs, num_beams=5, max_new_tokens=256) |
|
|
transcription = processor.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
### Using `reazonspeech.avsr` package |
|
|
|
|
|
You can also load AVSR models by using reazonspeech.avsr. If you don't want to use remote code for security reasons for example, you can use the following code. |
|
|
|
|
|
```python |
|
|
from reazonspeech.avsr import AVHubertProcessor, AVHubertForConditionalGeneration |
|
|
|
|
|
processor = AVHubertProcessor.from_pretrained("enactic/avista-base-plus") |
|
|
model = AVHubertForConditionalGeneration.from_pretrained("enactic/avista-base-plus") |
|
|
|
|
|
inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video") |
|
|
# If mouth extraction is not performed, you can add `extract_mouth=True` |
|
|
inputs = processor(raw_audio="path/to/audio", raw_video="path/to/video", extract_mouth=True) |
|
|
|
|
|
outputs = model.generate(**inputs, num_beams=5, max_new_tokens=256) |
|
|
transcription = processor.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
## Test Results |
|
|
|
|
|
We report the Character Error Rate (CER) on an out-of-domain evaluation dataset that was internally collected for AVSR benchmarking. |
|
|
|
|
|
The following table presents the benchmark results of this model and Japanese ASR models under different noise levels and noise types. |
|
|
|
|
|
Details of the dataset and the complete benchmark results can be found [here](https://huggingface.co/datasets/enactic/avsr-leaderboard). |
|
|
|
|
|
**+ ReazonSpeech Speech** |
|
|
|
|
|
| Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
|
|
| :------------------ | ------: | -----: | -----: | -----: | ------: | ------: | |
|
|
| AVista Base+ | 156M | 26.88% | 33.24% | 38.13% | 47.64% | 63.60% | |
|
|
| reazonspeech k2 | 159M | 7.42% | 9.13% | 19.47% | 71.61% | 104.15% | |
|
|
| reazonspeech nemo | 619M | 8.50% | 11.74% | 25.38% | 77.65% | 103.42% | |
|
|
| reazonspeech espnet | 118M | 7.44% | 9.20% | 16.58% | 69.34% | 103.22% | |
|
|
| whisper large-v3 | 1,550M | 7.75% | 8.70% | 12.81% | 49.34% | 100.53% | |
|
|
| whisper medium | 769M | 10.07% | 13.23% | 19.21% | 50.56% | 99.27% | |
|
|
| whisper small | 244M | 10.82% | 19.82% | 28.98% | 69.69% | 108.56% | |
|
|
|
|
|
**+ JSUT Speech** |
|
|
|
|
|
| Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
|
|
| :------------------ | ------: | -----: | -----: | -----: | ------: | ------: | |
|
|
| AVista Base+ | 156M | 26.88% | 31.56% | 34.03% | 38.85% | 47.72% | |
|
|
| reazonspeech k2 | 159M | 7.42% | 8.49% | 21.94% | 70.81% | 93.04% | |
|
|
| reazonspeech nemo | 619M | 8.50% | 10.93% | 29.06% | 83.77% | 98.76% | |
|
|
| reazonspeech espnet | 118M | 7.44% | 8.30% | 14.45% | 66.15% | 69.34% | |
|
|
| whisper large-v3 | 1,550M | 7.75% | 8.69% | 13.03% | 60.24% | 98.67% | |
|
|
| whisper medium | 769M | 10.07% | 12.27% | 18.80% | 58.00% | 97.35% | |
|
|
| whisper small | 244M | 10.82% | 19.44% | 26.75% | 71.33% | 101.84% | |
|
|
|
|
|
**+ Babble** |
|
|
|
|
|
| Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
|
|
| :------------------ | ------: | -----: | -----: | -----: | ------: | ------: | |
|
|
| AVista Base+ | 156M | 26.88% | 30.02% | 36.83% | 54.02% | 82.00% | |
|
|
| reazonspeech k2 | 159M | 7.42% | 8.24% | 10.17% | 21.65% | 61.57% | |
|
|
| reazonspeech nemo | 619M | 8.50% | 10.40% | 14.83% | 31.74% | 77.29% | |
|
|
| reazonspeech espnet | 118M | 7.44% | 8.85% | 11.75% | 24.59% | 67.27% | |
|
|
| whisper large-v3 | 1,550M | 7.75% | 8.95% | 12.50% | 30.09% | 81.60% | |
|
|
| whisper medium | 769M | 10.07% | 12.52% | 18.18% | 42.27% | 95.43% | |
|
|
| whisper small | 244M | 10.82% | 19.72% | 28.24% | 56.72% | 109.61% | |
|
|
|
|
|
**+ Music** |
|
|
|
|
|
| Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
|
|
| :------------------ | ------: | -----: | -----: | -----: | -----: | ------: | |
|
|
| AVista Base+ | 156M | 26.88% | 27.91% | 31.29% | 41.01% | 56.38% | |
|
|
| reazonspeech k2 | 159M | 7.42% | 7.69% | 8.33% | 9.49% | 16.90% | |
|
|
| reazonspeech nemo | 619M | 8.50% | 9.28% | 9.97% | 13.65% | 24.61% | |
|
|
| reazonspeech espnet | 118M | 7.44% | 7.86% | 8.57% | 10.41% | 16.62% | |
|
|
| whisper large-v3 | 1,550M | 7.75% | 8.16% | 9.01% | 11.23% | 21.26% | |
|
|
| whisper medium | 769M | 10.07% | 11.13% | 12.97% | 16.45% | 31.62% | |
|
|
| whisper small | 244M | 10.82% | 18.02% | 19.86% | 26.82% | 47.69% | |
|
|
|
|
|
**+ Environmental Noise** |
|
|
|
|
|
| Model | #Params | N/A | SNR=10 | SNR=5 | SNR=0 | SNR=-5 | |
|
|
| :------------------ | ------: | -----: | -----: | -----: | -----: | -----: | |
|
|
| AVista Base+ | 156M | 26.88% | 28.65% | 30.60% | 35.43% | 44.16% | |
|
|
| reazonspeech k2 | 159M | 7.42% | 8.07% | 8.68% | 10.32% | 15.53% | |
|
|
| reazonspeech nemo | 619M | 8.50% | 9.31% | 10.16% | 12.71% | 18.32% | |
|
|
| reazonspeech espnet | 118M | 7.44% | 8.00% | 8.63% | 10.06% | 14.54% | |
|
|
| whisper large-v3 | 1,550M | 7.75% | 8.46% | 9.17% | 11.98% | 19.36% | |
|
|
| whisper medium | 769M | 10.07% | 11.77% | 13.06% | 17.04% | 24.83% | |
|
|
| whisper small | 244M | 10.82% | 17.62% | 19.84% | 25.55% | 33.77% | |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@misc{enactic/avista-base-plus, |
|
|
title={avista-base-plus}, |
|
|
author={Sasaki, Yuta}, |
|
|
url = {https://huggingface.co/enactic/avista-base-plus}, |
|
|
year = {2025} |
|
|
} |
|
|
|
|
|
@article{shi2022avhubert, |
|
|
author = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed}, |
|
|
title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction}, |
|
|
journal = {arXiv preprint arXiv:2201.02184} |
|
|
year = {2022} |
|
|
} |
|
|
|
|
|
@article{shi2022avsr, |
|
|
author = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed}, |
|
|
title = {Robust Self-Supervised Audio-Visual Speech Recognition}, |
|
|
journal = {arXiv preprint arXiv:2201.01763} |
|
|
year = {2022} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/) |