Voxtral Streaming ASR
Installable Python package, realtime clients, benchmark utilities, and test scaffolding for multilingual streaming speech recognition using Voxtral Realtime served through vLLM.
Author: Patrick Lumbantobing, Vertox AI
License: Apache License 2.0
Repository target: Hugging Face org vertox-ai
Overview
This repository packages a production-oriented client-side workflow for experimenting with Voxtral Mini 4B Realtime 2602 on vLLM. It includes:
- a reusable internal package under
voxtral_asr/, - a realtime file playback client,
- a realtime microphone client,
- benchmark sweep tooling for preset comparison,
- JSON preset configurations,
- unit tests for the audio frontend and configuration loading,
- packaging metadata for editable and standard installs.
The design goal is simple: make it easy to reproduce and tune end-to-end realtime ASR behavior, especially around time to first partial, partial cadence, finalization latency, voice activity detection, and mic signal quality.
Features
- Installable package with
pyproject.toml - Realtime file streaming client for deterministic benchmarking
- Realtime microphone streaming client for live testing
- Benchmark sweep runner that iterates presets and writes CSV output
- Voice activity detection (VAD) and audio frontend utilities
- Mic diagnostics for RMS levels, clipping, and speech/silence statistics
- Unit tests with
pytest - Linting with Ruff
- Apache-2.0 licensing and citation metadata
Repository layout
.
├── LICENSE
├── NOTICE
├── CITATION.cff
├── README.md
├── pyproject.toml
├── .gitignore
├── Makefile
├── realtime_test.py
├── realtime_test_mic.py
├── configs/
│ ├── preset_low_latency.json
│ ├── preset_balanced.json
│ ├── preset_high_accuracy.json
│ └── preset_low_latency_plus.json
├── benchmarks/
│ ├── __init__.py
│ └── run_latency_sweep.py
├── samples/
│ └── sample.wav
├── tests/
│ ├── test_audio.py
│ └── test_config.py
└── voxtral_asr/
├── __init__.py
├── audio.py
├── config.py
├── metrics.py
└── protocol.py
Prerequisites
Server-side
You need a running vLLM server exposing Voxtral Realtime on the /v1/realtime endpoint.
Recommended to use at least NVIDIA L4, e.g., g6x.large. With L4, it will be about 2x RTFx.
Example:
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--trust-remote-code \
--compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
--tensor-parallel-size 1 \
--max-model-len 16384 \
--max-num-batched-tokens 512 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8766
Client-side
- Python 3.10+
- Linux recommended for microphone streaming
- A functional microphone device for
realtime_test_mic.py
Installation
Standard install
pip install .
Editable install for development
pip install -e .[dev]
Optional file resampling support
pip install -e .[dev,file]
Quick start
1. Realtime file test
python realtime_test.py sample.wav 127.0.0.1 8766 --config configs/preset_low_latency_plus.json
Or through the installed console script:
voxtral-file sample.wav --host 127.0.0.1 --port 8766 --config configs/preset_low_latency_plus.json
2. Realtime microphone test
python realtime_test_mic.py \
--config configs/preset_low_latency_plus.json \
--host 127.0.0.1 \
--port 8766 \
--wav-out mic_capture.wav \
--log-level INFO
Or through the installed console script:
voxtral-mic \
--config configs/preset_low_latency_plus.json \
--host 127.0.0.1 \
--port 8766 \
--wav-out mic_capture.wav \
--log-level INFO
3. Benchmark preset sweep
python benchmarks/run_latency_sweep.py \
--audio sample.wav \
--host 127.0.0.1 \
--port 8766 \
--output results/latency_sweep.csv \
--repeat 3 \
--final-wait-s 5.0
Presets
preset_low_latency.json
Fastest partials, useful when responsiveness matters more than transcript stability.
preset_balanced.json
A conservative baseline for general testing.
preset_high_accuracy.json
Longer commits and more conservative endpointing for better stability.
preset_low_latency_plus.json
Recommended default for live demos based on tuning results: low first-partial latency with improved mic levels and slightly more conservative segmentation.
Tuning guidance
Time to first partial
The most important client-side levers are:
commit_interval_msvad_rms_thresholdvad_hangover_ms- microphone input level
If first partial is too slow:
- reduce
commit_interval_ms, - lower
vad_rms_thresholdcautiously, - ensure raw mic RMS is high enough,
- avoid over-aggressive silence suppression that delays speech onset.
Accuracy vs latency
In practice, better mic gain often matters more than minor frontend tweaks. If the transcript misses rare words or proper nouns, improve the acoustic signal first, then adjust commit interval and VAD conservatively.
Punctuation
Realtime punctuation is partly model-intrinsic. Minor punctuation smoothing can be added as a post-processing layer if you need a cleaner demo output without changing the streaming behavior.
Development workflow
Run tests
pytest
Run linter
ruff check .
Auto-fix common lint issues
ruff check . --fix
ruff format .
Makefile targets
If you keep the provided Makefile, typical commands are:
make install-dev
make test
make lint
make test-file AUDIO=sample.wav
make test-mic WAV_OUT=mic_capture.wav
make latency-sweep AUDIO=sample.wav CSV_OUT=results/latency_sweep.csv
Citation
BibTeX
@software{lumbantobing2026voxtralstreamingasr,
author = {Patrick Lumbantobing},
title = {Voxtral Streaming ASR},
year = {2026},
organization = {VertoX AI},
url = {https://huggingface.co/vertox-ai/voxtral-streaming-asr},
license = {Apache-2.0}
}
Author
Patrick Lumbantobing
VertoX AI
License
This repository is licensed under the Apache License 2.0. See LICENSE and NOTICE.
- Downloads last month
- 55
Model tree for vertox-ai/voxtral-streaming-asr
Base model
mistralai/Ministral-3-3B-Base-2512