Transformers VibeVoice-ASR-HF adoption

#1
by ZovutVanya - opened

Hello, the original VibeVoice is not available for inference through the transformers library, but they recently adapted it https://huggingface.co/microsoft/VibeVoice-ASR-HF! Do you plan to adapt this model to that?

Inflexion Lab org

Hello, I wonder which version of VibeVoice is inference-friendly? Because so far I have discovered that this version inferenced via vLLM is slow. Does ASR-HF version fast for inference?

I took the example from the model's page, processing of their 41 seconds audio took 4.98 seconds on my NVIDIA GeForce RTX 4090

Inflexion Lab org

Can you test this model and benchmark inference speed? If HF version is really quicker, we will watch on it

By "this" you mean your Kazakh finetune? If you tell me, what is the proper way to inference it, I will))
And if I'll have enough vRAM. When I tried to just change the model ID in the Microsoft's example code, the model showed a lot of warnings and exceeded memory

Inflexion Lab org

By "this" yes, I mean kazakh model. https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-vllm-asr.md check for this, i guess you have enough vRAM if you follow all instructions. If vLLM hits OOM, try to lower max-model-len parameter

Hm, well, the main advantage of the HF transformers integration is the ease of use, because I seem lost with this tutorial, haha
I don't seem to understand, how to lower GPU usage parameters, CUDA always runs out of memory. And the worst part is that it does that discreetly, you have to manyally check docker logs to see that.

❯ docker run -d --gpus all --name vibevoice-vllm \
  --ipc=host \
  -p 8001:8000 \
  -e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -v $(pwd):/app \
  -w /app \
  --entrypoint bash \
  vllm/vllm-openai:v0.14.1 \
  -c "python3 /app/vllm_plugin/scripts/start_server.py" \
  --gpu-memory-utilization 0.9 \
  --max-model-len 2048

The tutorial says to do the following if "CUDA out of memory"
Reduce --gpu-memory-utilization
Reduce --max-num-seqs
Use smaller --max-model-len

But I have no idea where I should do that and to what values to set them to.

Sign up or log in to comment