warshanks commited on
Commit
1da1107
·
verified ·
1 Parent(s): 81c5254

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +242 -0
  3. consolidated.safetensors +3 -0
  4. params.json +67 -0
  5. tekken.json +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tekken.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - it
8
+ - pt
9
+ - nl
10
+ - hi
11
+ license: apache-2.0
12
+ library_name: vllm
13
+ inference: false
14
+ extra_gated_description: >-
15
+ If you want to learn more about how we process your personal data, please read
16
+ our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
17
+ pipeline_tag: audio-text-to-text
18
+ base_model: mistralai/Voxtral-Mini-3B-2507
19
+ ---
20
+
21
+ FP8 quant of Voxtral Mini. Whisper layers were ignored and are unquantized.
22
+
23
+ # Voxtral Mini 1.0 (3B) - 2507
24
+
25
+ Voxtral Mini is an enhancement of [Ministral 3B](https://mistral.ai/news/ministraux), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
26
+
27
+ Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral).
28
+
29
+ ## Key Features
30
+
31
+ Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
32
+ - **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
33
+ - **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
34
+ - **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
35
+ - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
36
+ - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
37
+ - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Ministral-3B
38
+
39
+ ## Benchmark Results
40
+
41
+ ### Audio
42
+
43
+ Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
44
+
45
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)
46
+
47
+ ### Text
48
+
49
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/iH9V8JVtMoaGlqJd6FIri.png)
50
+
51
+ ## Usage
52
+
53
+ The model can be used with the following frameworks;
54
+ - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
55
+
56
+ **Notes**:
57
+
58
+ - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
59
+ - Multiple audios per message and multiple user turns with audio are supported
60
+ - System prompts are not yet supported
61
+
62
+ ### vLLM (recommended)
63
+
64
+ We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).
65
+
66
+ #### Installation
67
+
68
+ Make sure to install vllm from "main", we recommend using `uv`:
69
+
70
+ ```
71
+ uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
72
+ ```
73
+
74
+ Doing so should automatically install [`mistral_common >= 1.8.1`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.1).
75
+
76
+ To check:
77
+ ```
78
+ python -c "import mistral_common; print(mistral_common.__version__)"
79
+ ```
80
+
81
+ #### Offline
82
+
83
+ You can test that your vLLM setup works as expected by cloning the vLLM repo:
84
+
85
+ ```sh
86
+ git clone https://github.com/vllm-project/vllm && cd vllm
87
+ ```
88
+
89
+ and then running:
90
+
91
+ ```sh
92
+ python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
93
+ ```
94
+
95
+ #### Serve
96
+
97
+ We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
98
+
99
+ 1. Spin up a server:
100
+
101
+ ```
102
+ vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
103
+ ```
104
+
105
+ **Note:** Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.
106
+
107
+
108
+ 2. To ping the client you can use a simple Python snippet. See the following examples.
109
+
110
+
111
+ ### Audio Instruct
112
+
113
+ Leverage the audio capabilities of Voxtral-Mini-3B-2507 to chat.
114
+
115
+ Make sure that your client has `mistral-common` with audio installed:
116
+
117
+ ```sh
118
+ pip install --upgrade mistral_common\[audio\]
119
+ ```
120
+
121
+ <details>
122
+ <summary>Python snippet</summary>
123
+
124
+ ```py
125
+ from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
126
+ from mistral_common.audio import Audio
127
+ from huggingface_hub import hf_hub_download
128
+
129
+ from openai import OpenAI
130
+
131
+ # Modify OpenAI's API key and API base to use vLLM's API server.
132
+ openai_api_key = "EMPTY"
133
+ openai_api_base = "http://<your-server-host>:8000/v1"
134
+
135
+ client = OpenAI(
136
+ api_key=openai_api_key,
137
+ base_url=openai_api_base,
138
+ )
139
+
140
+ models = client.models.list()
141
+ model = models.data[0].id
142
+
143
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
144
+ bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
145
+
146
+ def file_to_chunk(file: str) -> AudioChunk:
147
+ audio = Audio.from_file(file, strict=False)
148
+ return AudioChunk.from_audio(audio)
149
+
150
+ text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?")
151
+ user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
152
+
153
+ print(30 * "=" + "USER 1" + 30 * "=")
154
+ print(text_chunk.text)
155
+ print("\n\n")
156
+
157
+ response = client.chat.completions.create(
158
+ model=model,
159
+ messages=[user_msg],
160
+ temperature=0.2,
161
+ top_p=0.95,
162
+ )
163
+ content = response.choices[0].message.content
164
+
165
+ print(30 * "=" + "BOT 1" + 30 * "=")
166
+ print(content)
167
+ print("\n\n")
168
+ # The speaker who is more inspiring is the one who delivered the farewell address, as they express
169
+ # gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of
170
+ # self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast,
171
+ # the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it
172
+ # lacks the emotional and motivational content of the farewell address.
173
+
174
+ # **Differences:**
175
+ # - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy.
176
+ # - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content.
177
+
178
+
179
+ messages = [
180
+ user_msg,
181
+ AssistantMessage(content=content).to_openai(),
182
+ UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
183
+ ]
184
+ print(30 * "=" + "USER 2" + 30 * "=")
185
+ print(messages[-1]["content"])
186
+ print("\n\n")
187
+
188
+ response = client.chat.completions.create(
189
+ model=model,
190
+ messages=messages,
191
+ temperature=0.2,
192
+ top_p=0.95,
193
+ )
194
+ content = response.choices[0].message.content
195
+ print(30 * "=" + "BOT 2" + 30 * "=")
196
+ print(content)
197
+ ```
198
+ </details>
199
+
200
+ #### Transcription
201
+
202
+ Voxtral-Mini-3B-2507 has powerful transcription capabilities!
203
+
204
+ Make sure that your client has `mistral-common` with audio installed:
205
+
206
+ ```sh
207
+ pip install --upgrade mistral_common\[audio\]
208
+ ```
209
+
210
+ <details>
211
+ <summary>Python snippet</summary>
212
+
213
+ ```python
214
+ from mistral_common.protocol.transcription.request import TranscriptionRequest
215
+ from mistral_common.protocol.instruct.messages import RawAudio
216
+ from mistral_common.audio import Audio
217
+ from huggingface_hub import hf_hub_download
218
+
219
+ from openai import OpenAI
220
+
221
+ # Modify OpenAI's API key and API base to use vLLM's API server.
222
+ openai_api_key = "EMPTY"
223
+ openai_api_base = "http://<your-server-host>:8000/v1"
224
+
225
+ client = OpenAI(
226
+ api_key=openai_api_key,
227
+ base_url=openai_api_base,
228
+ )
229
+
230
+ models = client.models.list()
231
+ model = models.data[0].id
232
+
233
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
234
+ audio = Audio.from_file(obama_file, strict=False)
235
+
236
+ audio = RawAudio.from_audio(audio)
237
+ req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
238
+
239
+ response = client.audio.transcriptions.create(**req)
240
+ print(response)
241
+ ```
242
+ </details>
consolidated.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f4af8d6c21edbc95a47bd1e1c92e1d371d98a088540ffb03e64a9138cca0243
3
+ size 6140184152
params.json ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dim": 3072,
3
+ "n_layers": 30,
4
+ "head_dim": 128,
5
+ "hidden_dim": 8192,
6
+ "n_heads": 32,
7
+ "n_kv_heads": 8,
8
+ "rope_theta": 100000000.0,
9
+ "norm_eps": 1e-05,
10
+ "vocab_size": 131072,
11
+ "max_position_embeddings": 32768,
12
+ "multimodal": {
13
+ "whisper_model_args": {
14
+ "encoder_args": {
15
+ "dim": 1280,
16
+ "n_layers": 32,
17
+ "head_dim": 64,
18
+ "hidden_dim": 5120,
19
+ "n_heads": 20,
20
+ "vocab_size": 51866,
21
+ "max_source_positions": 1500,
22
+ "audio_encoding_args": {
23
+ "sampling_rate": 16000,
24
+ "num_mel_bins": 128,
25
+ "hop_length": 160,
26
+ "window_size": 400
27
+ }
28
+ },
29
+ "downsample_args": {
30
+ "downsample_factor": 4
31
+ }
32
+ }
33
+ },
34
+ "quantization": {
35
+ "config_groups": {
36
+ "group_0": {
37
+ "input_activations": {
38
+ "dynamic": true,
39
+ "num_bits": 8,
40
+ "observer": null,
41
+ "strategy": "token",
42
+ "symmetric": true,
43
+ "type": "float"
44
+ },
45
+ "targets": [
46
+ "Linear"
47
+ ],
48
+ "weights": {
49
+ "dynamic": false,
50
+ "num_bits": 8,
51
+ "observer": "minmax",
52
+ "strategy": "tensor",
53
+ "symmetric": true,
54
+ "type": "float"
55
+ }
56
+ }
57
+ },
58
+ "format": "float-quantized",
59
+ "ignore": [
60
+ "lm_head",
61
+ "output",
62
+ "*whisper*"
63
+ ],
64
+ "quant_method": "compressed-tensors",
65
+ "quantization_status": "compressed"
66
+ }
67
+ }
tekken.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4aaf3836c2a5332f029ce85a7a62255c966f47b6797ef81dedd0ade9c862e4a8
3
+ size 14894206