File size: 3,658 Bytes
fb4bec1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---

tags:
  - audio
  - speech-to-text
  - streaming
  - voxtral
  - mistral
language:
  - en
library_name: custom
pipeline_tag: automatic-speech-recognition
license: apache-2.0
---


# Voxtral Realtime 4B

Streaming speech-to-text model with ~4 billion parameters. Weights in BF16 safetensors format, extracted from [mistralai/Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602).

## Architecture

**Pipeline:**
```

WAV β†’ 16kHz β†’ Mel Spectrogram β†’ Conv Stem β†’ Encoder β†’ Downsample 4x β†’ Adapter β†’ Decoder β†’ Tokens

```

- **Audio Encoder**: ~0.6B params β€” causal transformer, 32 layers
- **Audio-Language Adapter**: 2-layer MLP with 4x downsample
- **LLM Decoder**: ~3.4B params β€” Ministral-3 based, 26 layers with GQA

### Audio Preprocessing

| Parameter | Value |
|-----------|-------|
| Sample rate | 16,000 Hz |
| Frame rate | 12.5 Hz |
| Mel bins | 128 |
| Hop length | 160 samples (10ms) |
| Window size | 400 samples (25ms) |
| 1 text token | 80ms of audio |

### Encoder (Causal Transformer)

| Parameter | Value |
|-----------|-------|
| dim | 1280 |
| layers | 32 |
| heads | 32 (MHA) |
| head_dim | 64 |

| hidden_dim | 5120 |
| FFN | SwiGLU |
| Norm | RMSNorm (eps=1e-5) |
| Position | RoPE (theta=1e6, interleaved) |
| Attention | causal, sliding window=750 |

Conv stem: `conv1d(128β†’1280, k=3, s=1)` β†’ GELU β†’ `conv1d(1280β†’1280, k=3, s=2)` β†’ GELU

### Adapter

```

[seq/4, 5120] β†’ Linear(5120β†’3072) β†’ GELU β†’ Linear(3072β†’3072) β†’ [seq/4, 3072]

```

### Decoder (LLM)

| Parameter | Value |
|-----------|-------|
| dim | 3072 |
| layers | 26 |
| heads | 32 |
| KV heads | 8 (GQA 4:1) |
| head_dim | 128 |

| hidden_dim | 9216 |
| Norm | RMSNorm (eps=1e-5) |
| Position | RoPE (theta=1e6) |
| Attention | causal, sliding window=8192 |
| Vocab size | 131,072 |
| Tied embeddings | yes |

The decoder uses adaptive RMS normalization conditioned on transcription delay (6 delay tokens = 480ms).

## Weight Format

- **`consolidated.safetensors`** (8.3 GB) β€” 711 tensors, all BF16
- **`params.json`** β€” model config
- **`tekken.json`** (14.9 MB) β€” Tekken tokenizer

## Tokenizer (Tekken)

| Token | ID |
|-------|----|
| BOS | 1 |
| EOS | 2 |
| STREAMING_PAD | 32 |



Token IDs 0–999 are special tokens. IDs 1000+ index into the vocabulary (base64-encoded byte sequences in `tekken.json`).



### Audio Streaming Config



| Parameter | Value |

|-----------|-------|

| sampling_rate | 16,000 |
| frame_rate | 12.5 (80ms per token) |

| transcription_delay_ms | 480 (6 delay tokens) |

| left_pad_tokens | 32 |

| right_pad_tokens (offline) | 17 |



## Decode Schedule (Offline)



1. **Prompt**: `[BOS] + [STREAMING_PAD] Γ— 38` (1 + 32 left-pad + 6 delay)
2. **Prefill**: Feed `audio_embed[i] + tok_embed(prompt[i])` for positions 0..L-2
3. **First token**: Greedy argmax from position L-1
4. **Autoregressive decode**: For each remaining audio position, feed `audio_embed[pos] + tok_embed(prev_token)`, greedy argmax
5. **Stop**: On EOS or end of audio span

## C Implementation

A pure C implementation of this model is available at [voxtral.c](https://github.com/tantk/mistralhack/tree/master/voxtral.c) β€” runs on Apple Silicon (Metal) and CPU (BLAS), with streaming microphone input.

## Credits

Original model by [Mistral AI](https://mistral.ai/): [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602)

Built for the [Mistral Hackathon 2026](https://huggingface.co/mistral-hackaton-2026).