Instructions to use InflexionLab/VibeVoice-ASR-Kazakh with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VibeVoice
How to use InflexionLab/VibeVoice-ASR-Kazakh with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("InflexionLab/VibeVoice-ASR-Kazakh") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "InflexionLab/VibeVoice-ASR-Kazakh", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
Transformers VibeVoice-ASR-HF adoption
Hello, the original VibeVoice is not available for inference through the transformers library, but they recently adapted it https://huggingface.co/microsoft/VibeVoice-ASR-HF! Do you plan to adapt this model to that?
Hello, I wonder which version of VibeVoice is inference-friendly? Because so far I have discovered that this version inferenced via vLLM is slow. Does ASR-HF version fast for inference?
I took the example from the model's page, processing of their 41 seconds audio took 4.98 seconds on my NVIDIA GeForce RTX 4090
Can you test this model and benchmark inference speed? If HF version is really quicker, we will watch on it
By "this" you mean your Kazakh finetune? If you tell me, what is the proper way to inference it, I will))
And if I'll have enough vRAM. When I tried to just change the model ID in the Microsoft's example code, the model showed a lot of warnings and exceeded memory
By "this" yes, I mean kazakh model. https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md check for this, i guess you have enough vRAM if you follow all instructions. If vLLM hits OOM, try to lower max-model-len parameter
Hm, well, the main advantage of the HF transformers integration is the ease of use, because I seem lost with this tutorial, haha
I don't seem to understand, how to lower GPU usage parameters, CUDA always runs out of memory. And the worst part is that it does that discreetly, you have to manyally check docker logs to see that.
❯ docker run -d --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8001:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py" \
--gpu-memory-utilization 0.9 \
--max-model-len 2048
The tutorial says to do the following if "CUDA out of memory"
Reduce --gpu-memory-utilization
Reduce --max-num-seqs
Use smaller --max-model-len
But I have no idea where I should do that and to what values to set them to.