Transformers VibeVoice-ASR-HF adoption

by ZovutVanya - opened May 22

May 22

Hello, the original VibeVoice is not available for inference through the transformers library, but they recently adapted it https://huggingface.co/microsoft/VibeVoice-ASR-HF! Do you plan to adapt this model to that?

diaslmb

Inflexion Lab org May 22

Hello, I wonder which version of VibeVoice is inference-friendly? Because so far I have discovered that this version inferenced via vLLM is slow. Does ASR-HF version fast for inference?

ZovutVanya

May 22

I took the example from the model's page, processing of their 41 seconds audio took 4.98 seconds on my NVIDIA GeForce RTX 4090

diaslmb

Inflexion Lab org May 22

Can you test this model and benchmark inference speed? If HF version is really quicker, we will watch on it

ZovutVanya

May 22

By "this" you mean your Kazakh finetune? If you tell me, what is the proper way to inference it, I will))
And if I'll have enough vRAM. When I tried to just change the model ID in the Microsoft's example code, the model showed a lot of warnings and exceeded memory

diaslmb

Inflexion Lab org May 22

By "this" yes, I mean kazakh model. https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md check for this, i guess you have enough vRAM if you follow all instructions. If vLLM hits OOM, try to lower max-model-len parameter

ZovutVanya

May 22

Hm, well, the main advantage of the HF transformers integration is the ease of use, because I seem lost with this tutorial, haha
I don't seem to understand, how to lower GPU usage parameters, CUDA always runs out of memory. And the worst part is that it does that discreetly, you have to manyally check docker logs to see that.

❯ docker run -d --gpus all --name vibevoice-vllm \
  --ipc=host \
  -p 8001:8000 \
  -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v $(pwd):/app \
  -w /app \
  --entrypoint bash \
  vllm/vllm-openai:v0.14.1 \
  -c "python3 /app/vllm_plugin/scripts/start_server.py" \
  --gpu-memory-utilization 0.9 \
  --max-model-len 2048

The tutorial says to do the following if "CUDA out of memory"
Reduce --gpu-memory-utilization
Reduce --max-num-seqs
Use smaller --max-model-len

But I have no idea where I should do that and to what values to set them to.

diaslmb

Inflexion Lab org May 24

Hi, I have checked on 24GB VRAM, seems it is impossible to run vLLM server with this model on that VRAM. About HF version of model, i think we will not adapt it, because we need vLLM friendly version especially

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment