---
license: apache-2.0
base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
base_model_relation: quantized
tags:
  - speech-to-text
  - voxtral
  - mistral
  - int4
  - quantized
  - marlin
  - jetson
  - edge
  - realtime
  - streaming
language:
  - en
  - fr
  - es
  - de
  - ru
  - zh
  - ja
  - it
  - pt
  - nl
  - ar
  - hi
  - ko
---

# Voxtral Mini 4B INT4 — Jetson Orin Nano

INT4 quantized [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) for edge deployment on NVIDIA Jetson Orin Nano (8 GB).

**4.4 GB** — fits in 8 GB unified memory with room for KV cache and runtime.

## What's in this repo

| File | Size | Description |
|------|------|-------------|
| `consolidated.safetensors` | 4.4 GB | Marlin-packed INT4 decoder + BF16 encoder/norms/embeddings |
| `params.json` | 1.6 KB | Model architecture config (Mistral native format) |
| `tekken.json` | 15 MB | Mistral tekken tokenizer |
| `requirements.txt` | — | Pinned Python dependencies for Jetson |
| `scripts/jetson_serve_sdpa.py` | ~50 KB | Self-contained inference server (no HF/vLLM deps) |
| `scripts/quantize_marlin.py` | ~10 KB | Quantization script to reproduce this model |
| `kernels/fused_ops.cu` | 8.5 KB | Fused CUDA kernels (JIT compiled, SM87) |

## Quantization details

- **Method**: RTN (Round-To-Nearest) quantized directly into Marlin-packed format
- **Bits**: 4-bit (decoder linear layers), BF16 (audio encoder, norms, embeddings)
- **Group size**: 128
- **Encoding**: uint4b8 (value + 8 bias), Marlin tiled INT4 layout
- **Why RTN over GPTQ**: GPTQ's Hessian optimization destroys the critical SPAD-to-text transition boundary in Voxtral's streaming architecture. RTN preserves it perfectly. See [below](#why-rtn-not-gptq).

### Reproducing the quantization

```bash
pip install torch safetensors numpy

# From the original HuggingFace model:
python scripts/quantize_marlin.py \
    --model-dir path/to/Voxtral-Mini-4B-Realtime-2602 \
    --output-dir ./output
```

## Architecture

| Component | Params | Precision | Size |
|-----------|--------|-----------|------|
| Audio encoder (Whisper-style, 32 layers) | ~600M | BF16 | 1.86 GB |
| Projector (5120 → 3072 → 3072) | ~25M | BF16 | 0.05 GB |
| LM decoder (26 layers, 3072 hidden, GQA 32/8 heads) | ~3B | Marlin INT4 | ~1.70 GB |
| Token embeddings (131072 × 3072) | ~400M | BF16 | 0.77 GB |
| ada_rms_norm_t_cond + norms | ~1M | BF16 | 0.01 GB |
| **Total** | **~4B** | | **4.4 GB** |

## Transcription quality

Tested on Fleurs en_us samples — near-perfect output matching the fp16 baseline:

| Sample | Quality | Notes |
|--------|---------|-------|
| 0 — communication channels | Excellent | Punctuation added, matches reference |
| 1 — capital letters | Good | "sie" → "say" (phonetic) |
| 2 — town of Sintra | Excellent | Full match |
| 3 — cabbage juice | Excellent | Full match |
| 4 — dinosaurs with feathers | Perfect | Exact match |

## Usage

### Self-contained server (recommended for Jetson)

No HuggingFace or vLLM dependencies needed. Tested on JetPack 6.x (R36.5.0), Python 3.10, CUDA 12.6.

```bash
pip install -r requirements.txt

# Test with an audio file
python scripts/jetson_serve_sdpa.py --test audio.wav

# Start WebSocket server on port 8000
python scripts/jetson_serve_sdpa.py
```

The server exposes `ws://localhost:8000/v1/realtime` for streaming transcription.

**Key optimizations in the server:**
- Marlin fused INT4 dequant+matmul (~50x faster than on-the-fly dequant)
- F.scaled_dot_product_attention (fused attention kernel)
- Pre-allocated KV cache (eliminates per-token torch.cat)
- Fused CUDA kernels for RMSNorm, RoPE, SiLU·Mul (~500 kernel launches/token → ~80)

### WebSocket client example

```python
import asyncio, base64, json, numpy as np, soundfile as sf, websockets

async def transcribe(audio_path):
    audio, sr = sf.read(audio_path, dtype="float32")
    pcm16 = (audio * 32768.0).clip(-32768, 32767).astype(np.int16)

    async with websockets.connect("ws://localhost:8000/v1/realtime") as ws:
        await ws.recv()  # session.created
        await ws.send(json.dumps({"type": "session.update"}))

        # Send audio in 500ms chunks
        for i in range(0, len(pcm16), 8000):
            chunk = base64.b64encode(pcm16[i:i+8000].tobytes()).decode()
            await ws.send(json.dumps({"type": "input_audio_buffer.append", "audio": chunk}))

        await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))

        text = ""
        while True:
            msg = json.loads(await asyncio.wait_for(ws.recv(), timeout=60))
            if msg["type"] == "transcription.delta":
                text += msg["delta"]
            elif msg["type"] == "transcription.done":
                break
        return text
```

## Memory budget (Jetson Orin Nano 8 GB)

| Component | Size |
|-----------|------|
| Model weights | 4.4 GB |
| Runtime + KV cache | ~1.5 GB |
| OS + system | ~2 GB |
| **Total** | **~7.9 GB** |

## Why RTN, not GPTQ?

GPTQ quantization fails on this model at every bit precision (4-bit and 8-bit) with every calibration strategy tested. The root cause:

1. **Architecture mismatch during calibration**: GPTQ processes layers through standard `MistralDecoderLayer` which lacks `ada_rms_norm_t_cond`. The MLP sees wrong activations during Hessian estimation.

2. **Critical decision boundary**: Voxtral's streaming protocol requires the model to transition from STREAMING_PAD tokens to text tokens at precise positions. This transition margin is only ~5-10 logit points. GPTQ's optimization noise is enough to prevent the transition entirely.

3. **RTN preserves the boundary**: Simple round-to-nearest quantization at 4-bit with group_size=128 preserves the SPAD→text transition perfectly, producing output identical to the fp16 baseline.

## Credits

- Base model: [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) by Mistral AI
- Marlin INT4 kernel: [IST-DASLab/marlin](https://github.com/IST-DASLab/marlin) (Apache 2.0)
- Quantization and Jetson optimization by [Teaspoon AI](https://huggingface.co/Teaspoon-AI)