Voxtral Streaming ASR

Installable Python package, realtime clients, benchmark utilities, and test scaffolding for multilingual streaming speech recognition using Voxtral Realtime served through vLLM.

Author: Patrick Lumbantobing, Vertox AI
License: Apache License 2.0
Repository target: Hugging Face org vertox-ai

Overview

This repository packages a production-oriented client-side workflow for experimenting with Voxtral Mini 4B Realtime 2602 on vLLM. It includes:

a reusable internal package under voxtral_asr/,
a realtime file playback client,
a realtime microphone client,
benchmark sweep tooling for preset comparison,
JSON preset configurations,
unit tests for the audio frontend and configuration loading,
packaging metadata for editable and standard installs.

The design goal is simple: make it easy to reproduce and tune end-to-end realtime ASR behavior, especially around time to first partial, partial cadence, finalization latency, voice activity detection, and mic signal quality.

Features

Installable package with pyproject.toml
Realtime file streaming client for deterministic benchmarking
Realtime microphone streaming client for live testing
Benchmark sweep runner that iterates presets and writes CSV output
Voice activity detection (VAD) and audio frontend utilities
Mic diagnostics for RMS levels, clipping, and speech/silence statistics
Unit tests with pytest
Linting with Ruff
Apache-2.0 licensing and citation metadata

Repository layout

.
├── LICENSE
├── NOTICE
├── CITATION.cff
├── README.md
├── pyproject.toml
├── .gitignore
├── Makefile
├── realtime_test.py
├── realtime_test_mic.py
├── configs/
│   ├── preset_low_latency.json
│   ├── preset_balanced.json
│   ├── preset_high_accuracy.json
│   └── preset_low_latency_plus.json
├── benchmarks/
│   ├── __init__.py
│   └── run_latency_sweep.py
├── samples/
│   └── sample.wav
├── tests/
│   ├── test_audio.py
│   └── test_config.py
└── voxtral_asr/
    ├── __init__.py
    ├── audio.py
    ├── config.py
    ├── metrics.py
    └── protocol.py

Prerequisites

Server-side

You need a running vLLM server exposing Voxtral Realtime on the /v1/realtime endpoint. Recommended to use at least NVIDIA L4, e.g., g6x.large. With L4, it will be about 2x RTFx.

Example:

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode":"PIECEWISE"}' \
  --tensor-parallel-size 1 \
  --max-model-len 16384 \
  --max-num-batched-tokens 512 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8766

Client-side

Python 3.10+
Linux recommended for microphone streaming
A functional microphone device for realtime_test_mic.py

Installation

Standard install

pip install .

Editable install for development

pip install -e .[dev]

Optional file resampling support

pip install -e .[dev,file]

Quick start

1. Realtime file test

python realtime_test.py sample.wav 127.0.0.1 8766 --config configs/preset_low_latency_plus.json

Or through the installed console script:

voxtral-file sample.wav --host 127.0.0.1 --port 8766 --config configs/preset_low_latency_plus.json

2. Realtime microphone test

python realtime_test_mic.py \
  --config configs/preset_low_latency_plus.json \
  --host 127.0.0.1 \
  --port 8766 \
  --wav-out mic_capture.wav \
  --log-level INFO

Or through the installed console script:

voxtral-mic \
  --config configs/preset_low_latency_plus.json \
  --host 127.0.0.1 \
  --port 8766 \
  --wav-out mic_capture.wav \
  --log-level INFO

3. Benchmark preset sweep

python benchmarks/run_latency_sweep.py \
  --audio sample.wav \
  --host 127.0.0.1 \
  --port 8766 \
  --output results/latency_sweep.csv \
  --repeat 3 \
  --final-wait-s 5.0

Presets

`preset_low_latency.json`

Fastest partials, useful when responsiveness matters more than transcript stability.

`preset_balanced.json`

A conservative baseline for general testing.

`preset_high_accuracy.json`

Longer commits and more conservative endpointing for better stability.

`preset_low_latency_plus.json`

Recommended default for live demos based on tuning results: low first-partial latency with improved mic levels and slightly more conservative segmentation.

Tuning guidance

Time to first partial

The most important client-side levers are:

commit_interval_ms
vad_rms_threshold
vad_hangover_ms
microphone input level

If first partial is too slow:

reduce commit_interval_ms,
lower vad_rms_threshold cautiously,
ensure raw mic RMS is high enough,
avoid over-aggressive silence suppression that delays speech onset.

Accuracy vs latency

In practice, better mic gain often matters more than minor frontend tweaks. If the transcript misses rare words or proper nouns, improve the acoustic signal first, then adjust commit interval and VAD conservatively.

Punctuation

Realtime punctuation is partly model-intrinsic. Minor punctuation smoothing can be added as a post-processing layer if you need a cleaner demo output without changing the streaming behavior.

Development workflow

Run tests

pytest

Run linter

ruff check .

Auto-fix common lint issues

ruff check . --fix
ruff format .

Makefile targets

If you keep the provided Makefile, typical commands are:

make install-dev
make test
make lint
make test-file AUDIO=sample.wav
make test-mic WAV_OUT=mic_capture.wav
make latency-sweep AUDIO=sample.wav CSV_OUT=results/latency_sweep.csv

Citation

BibTeX

@software{lumbantobing2026voxtralstreamingasr,
  author       = {Patrick Lumbantobing},
  title        = {Voxtral Streaming ASR},
  year         = {2026},
  organization = {VertoX AI},
  url          = {https://huggingface.co/vertox-ai/voxtral-streaming-asr},
  license      = {Apache-2.0}
}

Author

Patrick Lumbantobing
VertoX AI

License

This repository is licensed under the Apache License 2.0. See LICENSE and NOTICE.

Downloads last month: 4

Model tree for vertox-ai/voxtral-streaming-asr

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Finetuned

(19)

this model