Voxtral-Mini-4B-Realtime GPTQ-Int4
This is a 4-bit GPTQ quantized version of Mistral's Voxtral-Mini-4B-Realtime, specifically optimized for high-throughput, low-latency deployment on single GPUs like the NVIDIA L4 (24GB).
Using the llmcompressor library, the language model weights have been compressed to W4A16 (4-bit weights, 16-bit activations), reducing the memory footprint to ~2.1 GB while preserving near-FP16 accuracy.
π Quick Start with vLLM (Recommended)
This model is designed to be served using vLLM (v0.19.1+). Because the heavy audio tower is kept in FP16 and the LLM is in INT4, it fits perfectly on a single 24GB L4 GPU with massive context headroom.
1. Create vllm_config.yaml
model: "./voxtral-mini-4b-gptq"
tensor_parallel_size: 1
gpu_memory_utilization: 0.92
dtype: "bfloat16"
max_model_len: 32768
max_num_seqs: 64
enforce_eager: false
2. Launch the Server
vllm serve ./voxtral-mini-4b-gptq --config vllm_config.yaml --port 8000
3. Send an Inference Request (Python)
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Convert your audio file to base64
with open("audio_sample.mp3", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="./voxtral-mini-4b-gptq",
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {"data": audio_b64, "format": "mp3"}
},
{
"type": "text",
"text": "Transcribe the following audio."
}
]
}
]
)
print(response.choices[0].message.content)
π₯οΈ Hardware & Performance
NVIDIA L4 (24GB VRAM) Profile
By offloading the audio tower processing and utilizing 4-bit LLM weights, the VRAM allocation on an L4 is highly efficient:
| Component | VRAM Usage |
|---|---|
| LLM Weights (W4A16) | ~2.1 GB |
| Audio Tower (BF16) | ~0.3 GB |
| vLLM / CUDA Overhead | ~1.5 GB |
| KV Cache (Available) | ~20.1 GB |
| Total Utilization | ~92% (gpu_memory_utilization: 0.92) |
Context Window
- Recommended (
max_model_len: 32768): ~40 minutes of continuous audio transcription with ultra-fast Time-To-First-Token (TTFT). - Maximum (
max_model_len: 131072): Supports multi-hour audio streams (if permitted by base model config) at the cost of slightly increased TTFT.
βοΈ Quantization Details
This model was quantized using Neural Magic's llmcompressor to ensure maximum accuracy retention.
- Algorithm: GPTQ (Group-wise 128)
- Scheme: W4A16 (4-bit weights, 16-bit activations)
- Calibration Dataset: WikiText-2 (512 samples, seq_len 2048)
- Block Size: 128
π‘οΈ Quality-Preserving Exclusions
To prevent accuracy degradation and shape-mismatch errors inherent to small-layer quantization, the following modules were intentionally kept in their native BF16 precision and excluded from the GPTQ pass:
- Audio Tower: Completely bypassed during calibration to preserve acoustic feature extraction fidelity.
- AdaRMSNorm Layers (
linear1,linear2): These layers contain dimensions smaller than the GPTQ block size (e.g.,in_features=32). Quantizing them causes severe rounding errors. Keeping them in BF16 preserves model stability with negligible VRAM impact.
β οΈ Limitations
- Text-Only Calibration: Because text datasets (WikiText) were used for GPTQ calibration to avoid OOM issues during audio-forward passes, there may be a minor (<0.5%) degradation specifically on highly complex audio-overlapping tasks compared to the FP16 base model.
- Not for Fine-Tuning: This repository contains compressed weights. It is intended strictly for inference/vLLM deployment. If you wish to fine-tune, use the original FP16 Mistral checkpoint.
π License
This model inherits the Apache 2.0 License from the original Mistral AI Voxtral-Mini-4B-Realtime model.
Why this README is highly effective:
- Immediate ROI: Developers see exactly how to run it on an L4 within 10 seconds of opening the page.
- Explains the "Why": The "Quality-Preserving Exclusions" section turns your debugging struggle (
ada_rms_normshape mismatch) into a massive feature/contributor to model quality. - Sets Expectations: Explicitly mentioning why text-only calibration was used prevents users from complaining about 0.1% WER differences, while reassuring them it doesn't break the audio tower.
- vLLM > 0.19.1 Native: Uses the exact
input_audiobase64 payload format required by modern vLLM multimodal endpoints.
- Downloads last month
- 37
Model tree for amir22010/voxtral-mini-4b-gptq
Base model
mistralai/Ministral-3-3B-Base-2512