32GiB VRAM inference using Docker- my working setup

#21
by BigBlueWhale - opened

Production-Ready, Self-Hosted VibeVoice-ASR Server (Open Source, MIT)

I built a complete production server around VibeVoice-ASR that turns Microsoft's proof-of-concept into a robust, deployable inference pipeline. It's fully open source and could serve as a foundation for inference providers looking to host this model.

GitHub: https://github.com/BigBIueWhale/vibe-voice-vendor-1

What it does

End-to-end streaming ASR server: audio in, transcribed text out, token-by-token with zero buffering at any layer.

Client (HTTPS) β†’ Rust TLS Proxy β†’ FastAPI Queue Server β†’ vLLM (OpenAI-compatible API) β†’ GPU

Problems solved beyond Microsoft's proof of concept

1. Deterministic inference (VIBEVOICE_USE_MEAN=1)

The default acoustic tokenizer uses doubly stochastic Gaussian sampling β€” the noise magnitude itself is random (a single scalar per recording from N(0, 0.625)). Some requests get near-zero noise, others get noise >1, causing unpredictable quality variance for identical audio. We disable this by using the VAE's mean directly, giving deterministic results every time.

2. Tuned vLLM memory parameters

Microsoft's defaults (gpu_memory_utilization=0.8, max_model_len=65536) don't account for the audio encoder's VRAM needs during forward pass (~700 MiB peak). We use 0.90 utilization with max_model_len=48000, leaving ~3 GiB free for the encoder β€” preventing OOM on files longer than ~1 minute while still supporting up to ~48 minutes of audio.

3. Production infrastructure

  • True end-to-end SSE streaming (vLLM deltas β†’ httpx β†’ asyncio.Queue β†’ FastAPI StreamingResponse, no buffering anywhere)
  • JWT bearer auth (ES256) with token revocation
  • Rust-based TLS proxy with self-signed cert auto-renewal
  • FIFO job queue with ETA estimation
  • systemd services with auto-restart
  • Zero data storage β€” audio cleared immediately after processing
  • Docker image with everything baked in (zero network requests at runtime, ~85s cold start)

4. Verified configuration parity

Every inference parameter was audited against Microsoft's reference code: prompt template, temperature (0.0), top_p (1.0), audio preprocessing (24kHz, -25 dBFS normalization), dtype (bfloat16), token mapping, chat template β€” all match. Full investigation documented in the repo.

Pinned versions

  • VibeVoice: commit 1807b858
  • vLLM: v0.14.1 (required for compatible multimodal APIs)
  • Base image: vllm/vllm-openai:v0.14.1

Why this matters for inference providers

There's currently no hosted inference API for VibeVoice-ASR anywhere β€” not on Azure, OpenRouter, Together, Replicate, or HuggingFace Inference. The TTS variant is hosted on some platforms, but not ASR. This project provides a working, tested, production-hardened foundation that any provider could build on:

  • The Dockerfile is self-contained and ready to deploy
  • The VIBEVOICE_USE_MEAN=1 fix alone is essential for consistent production quality
  • Memory tuning parameters are documented and battle-tested
  • The vLLM OpenAI-compatible API means standard tooling works out of the box

If you're an inference provider considering adding VibeVoice-ASR to your platform, feel free to use this as a starting point. MIT licensed.

Links

Out of curiousity: This is a 7b model, why does this require 32GB of Vram?

I'm running the Gradio demo locally on a 4080 without issues.

Out of curiousity: This is a 7b model, why does this require 32GB of Vram?

I'm running the Gradio demo locally on a 4080 without issues.

The Microsoft sort of template uses 80% of available VRAM. I had to increase it to 90% of my available 32GiB as not to fail the initial "test" stage where the Microsoft code tries to sort of stress test the system upon startup with the full ~1 hour context length.

In my implementation, I disabled this slow process of the Microsoft stress test- but I kept the 90% utilization of VRAM because of the whole context-length thing.

Interestingly, when I set the VRAM even higher than that, it failed- Because apparently that's only the VRAM for something. I'm not even sure- and it actually needs more VRAM for context length or something.

So my suggestion, try it on your 16GB of VRAM system with an hour-long recording.

Sign up or log in to comment