Instructions to use nguyenvulebinh/AV-HuBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nguyenvulebinh/AV-HuBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nguyenvulebinh/AV-HuBERT")# Load model directly from transformers import AutoProcessor, AutoModelForSeq2SeqLM processor = AutoProcessor.from_pretrained("nguyenvulebinh/AV-HuBERT") model = AutoModelForSeq2SeqLM.from_pretrained("nguyenvulebinh/AV-HuBERT") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nguyenvulebinh/AV-HuBERT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nguyenvulebinh/AV-HuBERT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nguyenvulebinh/AV-HuBERT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/nguyenvulebinh/AV-HuBERT
- SGLang
How to use nguyenvulebinh/AV-HuBERT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nguyenvulebinh/AV-HuBERT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nguyenvulebinh/AV-HuBERT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nguyenvulebinh/AV-HuBERT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nguyenvulebinh/AV-HuBERT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use nguyenvulebinh/AV-HuBERT with Docker Model Runner:
docker model run hf.co/nguyenvulebinh/AV-HuBERT
Huggingface Implementation of AV-HuBERT on the MuAViC Dataset
This repository contains a Huggingface implementation of the AV-HuBERT (Audio-Visual Hidden Unit BERT) model, specifically trained and tested on the MuAViC (Multilingual Audio-Visual Corpus) dataset. AV-HuBERT is a self-supervised model designed for audio-visual speech recognition, leveraging both audio and visual modalities to achieve robust performance, especially in noisy environments.
Key features of this repository include:
Pre-trained Models: Access pre-trained AV-HuBERT models fine-tuned on the MuAViC dataset. The pre-trained model been exported from MuAViC repository.
Inference scripts: Easily pipelines using Huggingface’s interface.
Data preprocessing scripts: Including normalize frame rate, extract lips and audio.
Inference code
git clone https://github.com/nguyenvulebinh/AV-HuBERT-S2S.git
cd AV-HuBERT-S2S
conda create -n avhuberts2s python=3.9
conda activate avhuberts2s
pip install -r requirements.txt
python run_example.py
from src.model.avhubert2text import AV2TextForConditionalGeneration
from src.dataset.load_data import load_feature
from transformers import Speech2TextTokenizer
import torch
if __name__ == "__main__":
# Choose language to run example
AVAILABEL_LANGUAGES = ["ar", "de", "el", "en", "es", "fr", "it", "pt", "ru", "multilingual"]
language = "ru"
assert language in AVAILABEL_LANGUAGES, f"Language {language} is not available, please choose one of {AVAILABEL_LANGUAGES}"
# Load model and tokenizer
model_name_or_path = f"nguyenvulebinh/AV-HuBERT-MuAViC-{language}"
model = AV2TextForConditionalGeneration.from_pretrained(model_name_or_path, cache_dir='./model-bin')
tokenizer = Speech2TextTokenizer.from_pretrained(model_name_or_path, cache_dir='./model-bin')
model = model.cuda().eval()
# Load example video and audio
video_example = f"./example/video_processed/{language}_lip_movement.mp4"
audio_example = f"./example/video_processed/{language}_audio.wav"
if not os.path.exists(video_example) or not os.path.exists(audio_example):
print(f"WARNING: Example video and audio for {language} is not available english will be used instead")
video_example = f"./example/video_processed/en_lip_movement.mp4"
audio_example = f"./example/video_processed/en_audio.wav"
# Load and process example
sample = load_feature(
video_example,
audio_example
)
audio_feats = sample['audio_source'].cuda()
video_feats = sample['video_source'].cuda()
attention_mask = torch.BoolTensor(audio_feats.size(0), audio_feats.size(-1)).fill_(False).cuda()
# Generate text
output = model.generate(
audio_feats,
attention_mask=attention_mask,
video=video_feats,
max_length=1024,
)
print(tokenizer.batch_decode(output, skip_special_tokens=True))
Data preprocessing scripts
mkdir model-bin
cd model-bin
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/20words_mean_face.npy .
wget https://huggingface.co/nguyenvulebinh/AV-HuBERT/resolve/main/shape_predictor_68_face_landmarks.dat .
# raw video only support 4:3 ratio now
cp raw_video.mp4 ./example/
python src/dataset/video_to_audio_lips.py
Pretrained AVSR model
| Languages | Huggingface |
|---|---|
| Arabic | Checkpoint-AR |
| German | Checkpoint-DE |
| Greek | Checkpoint-EL |
| English | Checkpoint-EN |
| Spanish | Checkpoint-ES |
| French | Checkpoint-FR |
| Italian | Checkpoint-IT |
| Portuguese | Checkpoint-PT |
| Russian | Checkpoint-RU |
| Multilingual | Checkpoint-ar_de_el_es_fr_it_pt_ru |
Acknowledgments
AV-HuBERT: A significant portion of the codebase in this repository has been adapted from the original AV-HuBERT implementation.
MuAViC Repository: We also gratefully acknowledge the creators of the MuAViC dataset and repository for providing the pre-trained models used in this project
License
CC-BY-NC 4.0
Citation
@article{anwar2023muavic,
title={MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation},
author={Anwar, Mohamed and Shi, Bowen and Goswami, Vedanuj and Hsu, Wei-Ning and Pino, Juan and Wang, Changhan},
journal={arXiv preprint arXiv:2303.00628},
year={2023}
}
- Downloads last month
- 138