tantk commited on
Commit
fb4bec1
Β·
verified Β·
1 Parent(s): 64725dc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +123 -0
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - audio
4
+ - speech-to-text
5
+ - streaming
6
+ - voxtral
7
+ - mistral
8
+ language:
9
+ - en
10
+ library_name: custom
11
+ pipeline_tag: automatic-speech-recognition
12
+ license: apache-2.0
13
+ ---
14
+
15
+ # Voxtral Realtime 4B
16
+
17
+ Streaming speech-to-text model with ~4 billion parameters. Weights in BF16 safetensors format, extracted from [mistralai/Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602).
18
+
19
+ ## Architecture
20
+
21
+ **Pipeline:**
22
+ ```
23
+ WAV β†’ 16kHz β†’ Mel Spectrogram β†’ Conv Stem β†’ Encoder β†’ Downsample 4x β†’ Adapter β†’ Decoder β†’ Tokens
24
+ ```
25
+
26
+ - **Audio Encoder**: ~0.6B params β€” causal transformer, 32 layers
27
+ - **Audio-Language Adapter**: 2-layer MLP with 4x downsample
28
+ - **LLM Decoder**: ~3.4B params β€” Ministral-3 based, 26 layers with GQA
29
+
30
+ ### Audio Preprocessing
31
+
32
+ | Parameter | Value |
33
+ |-----------|-------|
34
+ | Sample rate | 16,000 Hz |
35
+ | Frame rate | 12.5 Hz |
36
+ | Mel bins | 128 |
37
+ | Hop length | 160 samples (10ms) |
38
+ | Window size | 400 samples (25ms) |
39
+ | 1 text token | 80ms of audio |
40
+
41
+ ### Encoder (Causal Transformer)
42
+
43
+ | Parameter | Value |
44
+ |-----------|-------|
45
+ | dim | 1280 |
46
+ | layers | 32 |
47
+ | heads | 32 (MHA) |
48
+ | head_dim | 64 |
49
+ | hidden_dim | 5120 |
50
+ | FFN | SwiGLU |
51
+ | Norm | RMSNorm (eps=1e-5) |
52
+ | Position | RoPE (theta=1e6, interleaved) |
53
+ | Attention | causal, sliding window=750 |
54
+
55
+ Conv stem: `conv1d(128β†’1280, k=3, s=1)` β†’ GELU β†’ `conv1d(1280β†’1280, k=3, s=2)` β†’ GELU
56
+
57
+ ### Adapter
58
+
59
+ ```
60
+ [seq/4, 5120] β†’ Linear(5120β†’3072) β†’ GELU β†’ Linear(3072β†’3072) β†’ [seq/4, 3072]
61
+ ```
62
+
63
+ ### Decoder (LLM)
64
+
65
+ | Parameter | Value |
66
+ |-----------|-------|
67
+ | dim | 3072 |
68
+ | layers | 26 |
69
+ | heads | 32 |
70
+ | KV heads | 8 (GQA 4:1) |
71
+ | head_dim | 128 |
72
+ | hidden_dim | 9216 |
73
+ | Norm | RMSNorm (eps=1e-5) |
74
+ | Position | RoPE (theta=1e6) |
75
+ | Attention | causal, sliding window=8192 |
76
+ | Vocab size | 131,072 |
77
+ | Tied embeddings | yes |
78
+
79
+ The decoder uses adaptive RMS normalization conditioned on transcription delay (6 delay tokens = 480ms).
80
+
81
+ ## Weight Format
82
+
83
+ - **`consolidated.safetensors`** (8.3 GB) β€” 711 tensors, all BF16
84
+ - **`params.json`** β€” model config
85
+ - **`tekken.json`** (14.9 MB) β€” Tekken tokenizer
86
+
87
+ ## Tokenizer (Tekken)
88
+
89
+ | Token | ID |
90
+ |-------|----|
91
+ | BOS | 1 |
92
+ | EOS | 2 |
93
+ | STREAMING_PAD | 32 |
94
+
95
+ Token IDs 0–999 are special tokens. IDs 1000+ index into the vocabulary (base64-encoded byte sequences in `tekken.json`).
96
+
97
+ ### Audio Streaming Config
98
+
99
+ | Parameter | Value |
100
+ |-----------|-------|
101
+ | sampling_rate | 16,000 |
102
+ | frame_rate | 12.5 (80ms per token) |
103
+ | transcription_delay_ms | 480 (6 delay tokens) |
104
+ | left_pad_tokens | 32 |
105
+ | right_pad_tokens (offline) | 17 |
106
+
107
+ ## Decode Schedule (Offline)
108
+
109
+ 1. **Prompt**: `[BOS] + [STREAMING_PAD] Γ— 38` (1 + 32 left-pad + 6 delay)
110
+ 2. **Prefill**: Feed `audio_embed[i] + tok_embed(prompt[i])` for positions 0..L-2
111
+ 3. **First token**: Greedy argmax from position L-1
112
+ 4. **Autoregressive decode**: For each remaining audio position, feed `audio_embed[pos] + tok_embed(prev_token)`, greedy argmax
113
+ 5. **Stop**: On EOS or end of audio span
114
+
115
+ ## C Implementation
116
+
117
+ A pure C implementation of this model is available at [voxtral.c](https://github.com/tantk/mistralhack/tree/master/voxtral.c) β€” runs on Apple Silicon (Metal) and CPU (BLAS), with streaming microphone input.
118
+
119
+ ## Credits
120
+
121
+ Original model by [Mistral AI](https://mistral.ai/): [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602)
122
+
123
+ Built for the [Mistral Hackathon 2026](https://huggingface.co/mistral-hackaton-2026).