ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic

FP8-quantized version of mistralai/Voxtral-Mini-4B-Realtime-2602 for faster inference and reduced memory usage.

Overview

Property Value
Base Model mistralai/Voxtral-Mini-4B-Realtime-2602
Quantization FP8 Dynamic (FP8_DYNAMIC)
Weight Quantization Symmetric, static, per-channel → FP8 (E4M3)
Activation Quantization Symmetric, dynamic, per-token → FP8 (E4M3)
Format compressed-tensors (vLLM-native)
Quantized Size ~5.43 GB
Tool llm-compressor
Date 2026-05-22

What is this?

This is an FP8-quantized version of Mistral AI's Voxtral Mini 4B Realtime — a multilingual, streaming speech-to-text model. The quantization reduces:

  • Memory footprint by ~50% (from ~8 GB to ~4 GB)
  • Inference latency through hardware-accelerated FP8 tensor operations
  • Time to first token with smaller weight transfers

All while maintaining near-identical transcription quality to the original BF16 model.

Supported Languages

English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Swedish, Danish, Finnish, Norwegian (Bokmål), Hindi

Quantization Details

The quantization was performed using llm-compressor with the FP8_DYNAMIC scheme:

  • Weights: Quantized with symmetric, static, per-channel scaling to FP8 (E4M3)
  • Activations: Quantized with symmetric, dynamic, per-token scaling to FP8 (E4M3)
  • Ignored layers: lm_head (kept in original precision to preserve output quality)
  • No calibration data required — the dynamic activation scheme computes scales at inference time

How to Use

With vLLM (Recommended)

This model is designed for deployment with vLLM, which natively supports the compressed-tensors format.

Serve

vllm serve ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic \
    --compilation_config '{"cudagraph_mode": "PIECEWISE"}'

Docker

docker run --runtime nvidia --gpus all \
    --ipc=host \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=your_token_here \
    vllm/vllm-openai:latest \
    --model ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic

Realtime Streaming API

The model supports vLLM's Realtime WebSocket API for live audio streaming:

import asyncio
import websockets
import json
import base64
import soundfile as sf

async def stream_audio(audio_path):
    uri = "ws://localhost:8000/v1/realtime"
    async with websockets.connect(uri) as ws:
        # Read and encode audio
        audio, sr = sf.read(audio_path)
        audio_b64 = base64.b64encode(audio.tobytes()).decode()

        # Send audio
        await ws.send(json.dumps({
            "type": "input_audio_buffer.append",
            "audio": audio_b64,
        }))

        # Receive transcription
        async for message in ws:
            data = json.loads(message)
            if data.get("type") == "response.audio_transcript.delta":
                print(data["delta"], end="", flush=True)

asyncio.run(stream_audio("your_audio.wav"))

Hardware Requirements

Precision Min VRAM Recommended GPU
FP8 (this model) ~4 GB NVIDIA H100, L40S, Blackwell (GB10+), Ada Lovelace
BF16 (original) ~8 GB Any CUDA GPU with ≥16 GB

Note: FP8 hardware acceleration requires NVIDIA GPUs with Compute Capability ≥ 8.9 (Ada Lovelace, Hopper, Blackwell).

Evaluation

FP8 dynamic quantization typically preserves >99% of the original model's accuracy. For Voxtral Mini 4B Realtime's benchmark results on the original BF16 model, see the base model card.

License

This model inherits the Apache 2.0 License from the base model.

Acknowledgments

  • Mistral AI for the original Voxtral Mini 4B Realtime model
  • vLLM team for llm-compressor and FP8 inference support
  • RedHatAI for pioneering the FP8 quantization approach for Voxtral models
Downloads last month
25,703
Safetensors
Model size
4B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ghecko78/Voxtral-Mini-4B-Realtime-2602-FP8-Dynamic