alexmarques commited on
Commit
8b0a16e
·
verified ·
1 Parent(s): a27be61

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +293 -0
README.md ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - es
6
+ - pt
7
+ - hi
8
+ - de
9
+ - nl
10
+ - it
11
+ base_model:
12
+ - mistralai/Voxtral-Small-24B-2507
13
+ pipeline_tag: automatic-speech-recognition
14
+ tags:
15
+ - voxtral
16
+ - fp8
17
+ - quantized
18
+ - multimodal
19
+ - conversational
20
+ - text-generation-inference
21
+ - automatic-speech-recognition
22
+ - automatic-speech-translation
23
+ - audio-text-to-text
24
+ - video-text-to-text
25
+ - compressed-tensors
26
+ license: apache-2.0
27
+ license_name: apache-2.0
28
+ name: RedHatAI/Voxtral-Small-24B-2507-FP8-dynamic
29
+ description: A quantized version of the Voxtral-Small-24B-2507 model, optimized for speech transcription, translation, and audio understanding with FP8 data type quantization.
30
+ readme: https://huggingface.co/RedHatAI/Voxtral-Small-24B-2507-FP8-dynamic/main/README.md
31
+ tasks:
32
+ - automatic-speech-recognition
33
+ - automatic-speech-translation
34
+ - audio-to-text
35
+ - text-to-text
36
+ provider: RedHatAI
37
+ license_link: https://www.apache.org/licenses/LICENSE-2.0
38
+ ---
39
+
40
+ # Voxtral-Small-24B-2507-FP8-dynamic
41
+
42
+ ## Model Overview
43
+ - **Model Architecture:** VoxtralForConditionalGeneration
44
+ - **Input:** Audio-Text
45
+ - **Output:** Text
46
+ - **Model Optimizations:**
47
+ - **Weight quantization:** FP8
48
+ - **Activation quantization:** FP8
49
+ - **Intended Use Cases:** Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
50
+ - **Dedicated transcription mode:** Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
51
+ - **Long-form context:** With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
52
+ - **Built-in Q&A and summarization:** Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
53
+ - **Natively multilingual:** Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
54
+ - **Function-calling straight from voice:** Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
55
+ - **Highly capable at text:** Retains the text understanding capabilities of its language model backbone, Ministral-3B
56
+ - **Release Date:** 08/21/2025
57
+ - **Version:** 1.0
58
+ - **Model Developers:** Red Hat
59
+
60
+ Quantized version of [Voxtral-Small-24B-2507](https://huggingface.co/mistralai/Voxtral-Small-24B-2507).
61
+
62
+ ### Model Optimizations
63
+
64
+ This model was obtained by quantizing activation and weights of [Voxtral-Small-24B-2507](https://huggingface.co//Llama-3.3-70B-Instruct) to FP8 data type.
65
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
66
+ Weight quantization also reduces disk size requirements by approximately 50%.
67
+
68
+ Only weights and activations of the MLP operators within transformers blocks of the language model are quantized.
69
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
70
+ The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
71
+
72
+ ## Deployment
73
+
74
+ ### Use with vLLM
75
+
76
+ 1. Initialize vLLM server:
77
+ ```
78
+ vllm serve RedHatAI/Voxtral-Small-24B-2507-FP8-dynamic --tokenizer_mode mistral --config_format mistral --load_format mistral
79
+ ```
80
+
81
+ 2. Send requests to the server, according to the use case. See the following examples.
82
+
83
+ <details>
84
+ <summary>Audio Instruct</summary>
85
+
86
+ ```python
87
+ from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
88
+ from mistral_common.audio import Audio
89
+ from huggingface_hub import hf_hub_download
90
+
91
+ from openai import OpenAI
92
+
93
+ # Modify OpenAI's API key and API base to use vLLM's API server.
94
+ openai_api_key = "EMPTY"
95
+ openai_api_base = "http://<your-server-host>:8000/v1"
96
+
97
+ client = OpenAI(
98
+ api_key=openai_api_key,
99
+ base_url=openai_api_base,
100
+ )
101
+
102
+ models = client.models.list()
103
+ model = models.data[0].id
104
+
105
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
106
+ bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
107
+
108
+ def file_to_chunk(file: str) -> AudioChunk:
109
+ audio = Audio.from_file(file, strict=False)
110
+ return AudioChunk.from_audio(audio)
111
+
112
+ text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?")
113
+ user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
114
+
115
+ print(30 * "=" + "USER 1" + 30 * "=")
116
+ print(text_chunk.text)
117
+ print("\n\n")
118
+
119
+ response = client.chat.completions.create(
120
+ model=model,
121
+ messages=[user_msg],
122
+ temperature=0.2,
123
+ top_p=0.95,
124
+ )
125
+ content = response.choices[0].message.content
126
+
127
+ print(30 * "=" + "BOT 1" + 30 * "=")
128
+ print(content)
129
+ print("\n\n")
130
+ # The speaker who is more inspiring is the one who delivered the farewell address, as they express
131
+ # gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of
132
+ # self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast,
133
+ # the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it
134
+ # lacks the emotional and motivational content of the farewell address.
135
+
136
+ # **Differences:**
137
+ # - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy.
138
+ # - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content.
139
+
140
+
141
+ messages = [
142
+ user_msg,
143
+ AssistantMessage(content=content).to_openai(),
144
+ UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
145
+ ]
146
+ print(30 * "=" + "USER 2" + 30 * "=")
147
+ print(messages[-1]["content"])
148
+ print("\n\n")
149
+
150
+ response = client.chat.completions.create(
151
+ model=model,
152
+ messages=messages,
153
+ temperature=0.2,
154
+ top_p=0.95,
155
+ )
156
+ content = response.choices[0].message.content
157
+ print(30 * "=" + "BOT 2" + 30 * "=")
158
+ print(content)
159
+ ```
160
+ </details>
161
+
162
+ <details>
163
+ <summary>Transcription</summary>
164
+
165
+ ```python
166
+ from mistral_common.protocol.transcription.request import TranscriptionRequest
167
+ from mistral_common.protocol.instruct.messages import RawAudio
168
+ from mistral_common.audio import Audio
169
+ from huggingface_hub import hf_hub_download
170
+
171
+ from openai import OpenAI
172
+
173
+ # Modify OpenAI's API key and API base to use vLLM's API server.
174
+ openai_api_key = "EMPTY"
175
+ openai_api_base = "http://<your-server-host>:8000/v1"
176
+
177
+ client = OpenAI(
178
+ api_key=openai_api_key,
179
+ base_url=openai_api_base,
180
+ )
181
+
182
+ models = client.models.list()
183
+ model = models.data[0].id
184
+
185
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
186
+ audio = Audio.from_file(obama_file, strict=False)
187
+
188
+ audio = RawAudio.from_audio(audio)
189
+ req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
190
+
191
+ response = client.audio.transcriptions.create(**req)
192
+ print(response)
193
+ ```
194
+ </details>
195
+
196
+ ## Creation
197
+
198
+ This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
199
+
200
+ <details>
201
+ <summary>Creation details</summary>
202
+
203
+ ```python
204
+ import torch
205
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
206
+ from llmcompressor import oneshot
207
+ from llmcompressor.modifiers.quantization import QuantizationModifier
208
+
209
+ # Select model and load it.
210
+ MODEL_ID = "mistralai/Voxtral-Small-24B-2507"
211
+
212
+ model = VoxtralForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
213
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
214
+
215
+ # Recipe
216
+ recipe = QuantizationModifier(
217
+ targets="Linear",
218
+ scheme="FP8_DYNAMIC",
219
+ ignore=["language_model.lm_head", "re:audio_tower.*" ,"re:multi_modal_projector.*", "re:.*self_attn"],
220
+ )
221
+
222
+ # Apply algorithms.
223
+ oneshot(
224
+ model=model,
225
+ recipe=recipe,
226
+ processor=processor,
227
+ )
228
+
229
+ SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-dynamic"
230
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
231
+ processor.save_pretrained(SAVE_DIR)
232
+ ```
233
+
234
+ After quantization, the model can be converted back into the mistralai format using the `convert_voxtral_hf_to_mistral.py` script included with the model.
235
+ </details>
236
+
237
+ ## Evaluation
238
+
239
+ The model was evaluated on the Fleurs transcription task.
240
+ Recovery is computed with respect to the complement of the word error rate (WER).
241
+
242
+ <table border="1" cellspacing="0" cellpadding="6">
243
+ <tr>
244
+ <th>Benchmark</th>
245
+ <th>Language</th>
246
+ <th>Voxtral-Small-24B-2507</th>
247
+ <th>Voxtral-Small-24B-2507-FP8-dynamic<br>(this model)</th>
248
+ <th>Recovery</th>
249
+ </tr>
250
+ <tr>
251
+ <td rowspan="7"><strong>Fleurs<br>WER</strong></td>
252
+ <td>English</td>
253
+ <td>3.45%</td>
254
+ <td>3.43%</td>
255
+ <td>100.0%</td>
256
+ </tr>
257
+ <tr>
258
+ <td>French</td>
259
+ <td>3.91%</td>
260
+ <td>3.96%</td>
261
+ <td>100.0%</td>
262
+ </tr>
263
+ <tr>
264
+ <td>Spanish</td>
265
+ <td>2.91%</td>
266
+ <td>2.84%</td>
267
+ <td>100.1%</td>
268
+ </tr>
269
+ <tr>
270
+ <td>German</td>
271
+ <td>3.41%</td>
272
+ <td>3.36%</td>
273
+ <td>100.1%</td>
274
+ </tr>
275
+ <tr>
276
+ <td>Italian</td>
277
+ <td>2.27%</td>
278
+ <td>2.54%</td>
279
+ <td>99.7%</td>
280
+ </tr>
281
+ <tr>
282
+ <td>Portuguese</td>
283
+ <td>3.59%</td>
284
+ <td>3.57%</td>
285
+ <td>100.0%</td>
286
+ </tr>
287
+ <tr>
288
+ <td>Dutch</td>
289
+ <td>5.35%</td>
290
+ <td>5.29%</td>
291
+ <td>100.1%</td>
292
+ </tr>
293
+ </table>