32GiB VRAM inference using Docker- my working setup
Production-Ready, Self-Hosted VibeVoice-ASR Server (Open Source, MIT)
I built a complete production server around VibeVoice-ASR that turns Microsoft's proof-of-concept into a robust, deployable inference pipeline. It's fully open source and could serve as a foundation for inference providers looking to host this model.
GitHub: https://github.com/BigBIueWhale/vibe-voice-vendor-1
What it does
End-to-end streaming ASR server: audio in, transcribed text out, token-by-token with zero buffering at any layer.
Client (HTTPS) β Rust TLS Proxy β FastAPI Queue Server β vLLM (OpenAI-compatible API) β GPU
Problems solved beyond Microsoft's proof of concept
1. Deterministic inference (VIBEVOICE_USE_MEAN=1)
The default acoustic tokenizer uses doubly stochastic Gaussian sampling β the noise magnitude itself is random (a single scalar per recording from N(0, 0.625)). Some requests get near-zero noise, others get noise >1, causing unpredictable quality variance for identical audio. We disable this by using the VAE's mean directly, giving deterministic results every time.
2. Tuned vLLM memory parameters
Microsoft's defaults (gpu_memory_utilization=0.8, max_model_len=65536) don't account for the audio encoder's VRAM needs during forward pass (~700 MiB peak). We use 0.90 utilization with max_model_len=48000, leaving ~3 GiB free for the encoder β preventing OOM on files longer than ~1 minute while still supporting up to ~48 minutes of audio.
3. Production infrastructure
- True end-to-end SSE streaming (vLLM deltas β httpx β asyncio.Queue β FastAPI StreamingResponse, no buffering anywhere)
- JWT bearer auth (ES256) with token revocation
- Rust-based TLS proxy with self-signed cert auto-renewal
- FIFO job queue with ETA estimation
- systemd services with auto-restart
- Zero data storage β audio cleared immediately after processing
- Docker image with everything baked in (zero network requests at runtime, ~85s cold start)
4. Verified configuration parity
Every inference parameter was audited against Microsoft's reference code: prompt template, temperature (0.0), top_p (1.0), audio preprocessing (24kHz, -25 dBFS normalization), dtype (bfloat16), token mapping, chat template β all match. Full investigation documented in the repo.
Pinned versions
- VibeVoice: commit
1807b858 - vLLM:
v0.14.1(required for compatible multimodal APIs) - Base image:
vllm/vllm-openai:v0.14.1
Why this matters for inference providers
There's currently no hosted inference API for VibeVoice-ASR anywhere β not on Azure, OpenRouter, Together, Replicate, or HuggingFace Inference. The TTS variant is hosted on some platforms, but not ASR. This project provides a working, tested, production-hardened foundation that any provider could build on:
- The Dockerfile is self-contained and ready to deploy
- The
VIBEVOICE_USE_MEAN=1fix alone is essential for consistent production quality - Memory tuning parameters are documented and battle-tested
- The vLLM OpenAI-compatible API means standard tooling works out of the box
If you're an inference provider considering adding VibeVoice-ASR to your platform, feel free to use this as a starting point. MIT licensed.
Links
Out of curiousity: This is a 7b model, why does this require 32GB of Vram?
I'm running the Gradio demo locally on a 4080 without issues.
Out of curiousity: This is a 7b model, why does this require 32GB of Vram?
I'm running the Gradio demo locally on a 4080 without issues.
The Microsoft sort of template uses 80% of available VRAM. I had to increase it to 90% of my available 32GiB as not to fail the initial "test" stage where the Microsoft code tries to sort of stress test the system upon startup with the full ~1 hour context length.
In my implementation, I disabled this slow process of the Microsoft stress test- but I kept the 90% utilization of VRAM because of the whole context-length thing.
Interestingly, when I set the VRAM even higher than that, it failed- Because apparently that's only the VRAM for something. I'm not even sure- and it actually needs more VRAM for context length or something.
So my suggestion, try it on your 16GB of VRAM system with an hour-long recording.