ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic

FP8-quantized version of mistralai/Voxtral-Mini-4B-Realtime-2602 for faster inference and reduced memory usage.

Overview

Property	Value
Base Model	`mistralai/Voxtral-Mini-4B-Realtime-2602`
Quantization	FP8 Dynamic (`FP8_DYNAMIC`)
Weight Quantization	Symmetric, static, per-channel → FP8 (E4M3)
Activation Quantization	Symmetric, dynamic, per-token → FP8 (E4M3)
Format	`compressed-tensors` (vLLM-native)
Quantized Size	~5.43 GB
Tool	`llm-compressor`
Date	2026-05-22

What is this?

This is an FP8-quantized version of Mistral AI's Voxtral Mini 4B Realtime — a multilingual, streaming speech-to-text model. The quantization reduces:

Memory footprint by ~50% (from ~8 GB to ~4 GB)
Inference latency through hardware-accelerated FP8 tensor operations
Time to first token with smaller weight transfers

All while maintaining near-identical transcription quality to the original BF16 model.

Supported Languages

English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Finnish, Norwegian (Bokmål), Hindi

Quantization Details

The quantization was performed using llm-compressor with the FP8_DYNAMIC scheme:

Weights: Quantized with symmetric, static, per-channel scaling to FP8 (E4M3)
Activations: Quantized with symmetric, dynamic, per-token scaling to FP8 (E4M3)
Ignored layers: lm_head (kept in original precision to preserve output quality)
No calibration data required — the dynamic activation scheme computes scales at inference time

How to Use

With vLLM (Recommended)

This model is designed for deployment with vLLM, which natively supports the compressed-tensors format.

Serve

vllm serve ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic \
    --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

Docker

docker run --runtime nvidia --gpus all \
    --ipc=host \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=your_token_here \
    vllm/vllm-openai:latest \
    --model ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic

Realtime Streaming API

The model supports vLLM's Realtime WebSocket API for live audio streaming:

import asyncio
import websockets
import json
import base64
import soundfile as sf

async def stream_audio(audio_path):
    uri = "ws://localhost:8000/v1/realtime"
    async with websockets.connect(uri) as ws:
        # Read and encode audio
        audio, sr = sf.read(audio_path)
        audio_b64 = base64.b64encode(audio.tobytes()).decode()

        # Send audio
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": audio_b64,
        }))

        # Receive transcription
        async for message in ws:
            data = json.loads(message)
            if data.get("type") == "response.audio_transcript.delta":
                print(data["delta"], end="", flush=True)

asyncio.run(stream_audio("your_audio.wav"))

Hardware Requirements

Precision	Min VRAM	Recommended GPU
FP8 (this model)	~4 GB	NVIDIA H100, L40S, Blackwell (GB10+), Ada Lovelace
BF16 (original)	~8 GB	Any CUDA GPU with ≥16 GB

Note: FP8 hardware acceleration requires NVIDIA GPUs with Compute Capability ≥ 8.9 (Ada Lovelace, Hopper, Blackwell).

Evaluation

FP8 dynamic quantization typically preserves >99% of the original model's accuracy. For Voxtral Mini 4B Realtime's benchmark results on the original BF16 model, see the base model card.

License

This model inherits the Apache 2.0 License from the base model.

Acknowledgments

Mistral AI for the original Voxtral Mini 4B Realtime model
vLLM team for llm-compressor and FP8 inference support
RedHatAI for pioneering the FP8 quantization approach for Voxtral models

Downloads last month: 25,703

Safetensors

Model size

4B params

Tensor type

BF16

F8_E4M3

Model tree for ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Quantized

(23)

this model