Instructions to use ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic") model = AutoModelForSpeechSeq2Seq.from_pretrained("ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic") - Notebooks
- Google Colab
- Kaggle
ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic
FP8-quantized version of
mistralai/Voxtral-Mini-4B-Realtime-2602for faster inference and reduced memory usage.
Overview
| Property | Value |
|---|---|
| Base Model | mistralai/Voxtral-Mini-4B-Realtime-2602 |
| Quantization | FP8 Dynamic (FP8_DYNAMIC) |
| Weight Quantization | Symmetric, static, per-channel → FP8 (E4M3) |
| Activation Quantization | Symmetric, dynamic, per-token → FP8 (E4M3) |
| Format | compressed-tensors (vLLM-native) |
| Quantized Size | ~5.43 GB |
| Tool | llm-compressor |
| Date | 2026-05-22 |
What is this?
This is an FP8-quantized version of Mistral AI's Voxtral Mini 4B Realtime — a multilingual, streaming speech-to-text model. The quantization reduces:
- Memory footprint by ~50% (from ~8 GB to ~4 GB)
- Inference latency through hardware-accelerated FP8 tensor operations
- Time to first token with smaller weight transfers
All while maintaining near-identical transcription quality to the original BF16 model.
Supported Languages
English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Finnish, Norwegian (Bokmål), Hindi
Quantization Details
The quantization was performed using llm-compressor with the FP8_DYNAMIC scheme:
- Weights: Quantized with symmetric, static, per-channel scaling to FP8 (E4M3)
- Activations: Quantized with symmetric, dynamic, per-token scaling to FP8 (E4M3)
- Ignored layers:
lm_head(kept in original precision to preserve output quality) - No calibration data required — the dynamic activation scheme computes scales at inference time
How to Use
With vLLM (Recommended)
This model is designed for deployment with vLLM, which natively supports the compressed-tensors format.
Serve
vllm serve ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic \
--compilation_config '{"cudagraph_mode": "PIECEWISE"}'
Docker
docker run --runtime nvidia --gpus all \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=your_token_here \
vllm/vllm-openai:latest \
--model ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic
Realtime Streaming API
The model supports vLLM's Realtime WebSocket API for live audio streaming:
import asyncio
import websockets
import json
import base64
import soundfile as sf
async def stream_audio(audio_path):
uri = "ws://localhost:8000/v1/realtime"
async with websockets.connect(uri) as ws:
# Read and encode audio
audio, sr = sf.read(audio_path)
audio_b64 = base64.b64encode(audio.tobytes()).decode()
# Send audio
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": audio_b64,
}))
# Receive transcription
async for message in ws:
data = json.loads(message)
if data.get("type") == "response.audio_transcript.delta":
print(data["delta"], end="", flush=True)
asyncio.run(stream_audio("your_audio.wav"))
Hardware Requirements
| Precision | Min VRAM | Recommended GPU |
|---|---|---|
| FP8 (this model) | ~4 GB | NVIDIA H100, L40S, Blackwell (GB10+), Ada Lovelace |
| BF16 (original) | ~8 GB | Any CUDA GPU with ≥16 GB |
Note: FP8 hardware acceleration requires NVIDIA GPUs with Compute Capability ≥ 8.9 (Ada Lovelace, Hopper, Blackwell).
Evaluation
FP8 dynamic quantization typically preserves >99% of the original model's accuracy. For Voxtral Mini 4B Realtime's benchmark results on the original BF16 model, see the base model card.
License
This model inherits the Apache 2.0 License from the base model.
Acknowledgments
- Mistral AI for the original Voxtral Mini 4B Realtime model
- vLLM team for
llm-compressorand FP8 inference support - RedHatAI for pioneering the FP8 quantization approach for Voxtral models
- Downloads last month
- 25,703
Model tree for ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic
Base model
mistralai/Ministral-3-3B-Base-2512