32GiB VRAM inference using Docker- my working setup

#21

by BigBlueWhale - opened Mar 5

Mar 5

Production-Ready, Self-Hosted VibeVoice-ASR Server (Open Source, MIT)

I built a complete production server around VibeVoice-ASR that turns Microsoft's proof-of-concept into a robust, deployable inference pipeline. It's fully open source and could serve as a foundation for inference providers looking to host this model.

GitHub: https://github.com/BigBIueWhale/vibe-voice-vendor-1

What it does

End-to-end streaming ASR server: audio in, transcribed text out, token-by-token with zero buffering at any layer.

Client (HTTPS) → Rust TLS Proxy → FastAPI Queue Server → vLLM (OpenAI-compatible API) → GPU

Problems solved beyond Microsoft's proof of concept

1. Deterministic inference (VIBEVOICE_USE_MEAN=1)

The default acoustic tokenizer uses doubly stochastic Gaussian sampling — the noise magnitude itself is random (a single scalar per recording from N(0, 0.625)). Some requests get near-zero noise, others get noise >1, causing unpredictable quality variance for identical audio. We disable this by using the VAE's mean directly, giving deterministic results every time.

2. Tuned vLLM memory parameters

Microsoft's defaults (gpu_memory_utilization=0.8, max_model_len=65536) don't account for the audio encoder's VRAM needs during forward pass (~700 MiB peak). We use 0.90 utilization with max_model_len=48000, leaving ~3 GiB free for the encoder — preventing OOM on files longer than ~1 minute while still supporting up to ~48 minutes of audio.

3. Production infrastructure

True end-to-end SSE streaming (vLLM deltas → httpx → asyncio.Queue → FastAPI StreamingResponse, no buffering anywhere)
JWT bearer auth (ES256) with token revocation
Rust-based TLS proxy with self-signed cert auto-renewal
FIFO job queue with ETA estimation
systemd services with auto-restart
Zero data storage — audio cleared immediately after processing
Docker image with everything baked in (zero network requests at runtime, ~85s cold start)

4. Verified configuration parity

Every inference parameter was audited against Microsoft's reference code: prompt template, temperature (0.0), top_p (1.0), audio preprocessing (24kHz, -25 dBFS normalization), dtype (bfloat16), token mapping, chat template — all match. Full investigation documented in the repo.

Pinned versions

VibeVoice: commit 1807b858
vLLM: v0.14.1 (required for compatible multimodal APIs)
Base image: vllm/vllm-openai:v0.14.1

Why this matters for inference providers

There's currently no hosted inference API for VibeVoice-ASR anywhere — not on Azure, OpenRouter, Together, Replicate, or HuggingFace Inference. The TTS variant is hosted on some platforms, but not ASR. This project provides a working, tested, production-hardened foundation that any provider could build on:

The Dockerfile is self-contained and ready to deploy
The VIBEVOICE_USE_MEAN=1 fix alone is essential for consistent production quality
Memory tuning parameters are documented and battle-tested
The vLLM OpenAI-compatible API means standard tooling works out of the box

If you're an inference provider considering adding VibeVoice-ASR to your platform, feel free to use this as a starting point. MIT licensed.

Links

Project repo: https://github.com/BigBIueWhale/vibe-voice-vendor-1
Quality investigation doc: https://github.com/BigBIueWhale/vibe-voice-vendor-1/blob/master/doc/vibevoice-asr-quality-investigation.md
Dockerfile: https://github.com/BigBIueWhale/vibe-voice-vendor-1/blob/master/Dockerfile

andypotato

Mar 6

Out of curiousity: This is a 7b model, why does this require 32GB of Vram?

I'm running the Gradio demo locally on a 4080 without issues.

BigBlueWhale

Mar 6

Out of curiousity: This is a 7b model, why does this require 32GB of Vram?

I'm running the Gradio demo locally on a 4080 without issues.

The Microsoft sort of template uses 80% of available VRAM. I had to increase it to 90% of my available 32GiB as not to fail the initial "test" stage where the Microsoft code tries to sort of stress test the system upon startup with the full ~1 hour context length.

In my implementation, I disabled this slow process of the Microsoft stress test- but I kept the 90% utilization of VRAM because of the whole context-length thing.

Interestingly, when I set the VRAM even higher than that, it failed- Because apparently that's only the VRAM for something. I'm not even sure- and it actually needs more VRAM for context length or something.

So my suggestion, try it on your 16GB of VRAM system with an hour-long recording.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment