RedHatAI
/

Voxtral-Small-24B-2507-FP8-dynamic

+---
+language:
+- en
+- fr
+- es
+- pt
+- hi
+- de
+- nl
+- it
+base_model:
+- mistralai/Voxtral-Small-24B-2507
+pipeline_tag: automatic-speech-recognition
+tags:
+- voxtral
+- fp8
+- quantized
+- multimodal
+- conversational
+- text-generation-inference
+- automatic-speech-recognition
+- automatic-speech-translation
+- audio-text-to-text
+- video-text-to-text
+- compressed-tensors
+license: apache-2.0
+license_name: apache-2.0
+name: RedHatAI/Voxtral-Small-24B-2507-FP8-dynamic
+description: A quantized version of the Voxtral-Small-24B-2507 model, optimized for speech transcription, translation, and audio understanding with FP8 data type quantization.
+readme: https://huggingface.co/RedHatAI/Voxtral-Small-24B-2507-FP8-dynamic/main/README.md
+tasks:
+- automatic-speech-recognition
+- automatic-speech-translation
+- audio-to-text
+- text-to-text
+provider: RedHatAI
+license_link: https://www.apache.org/licenses/LICENSE-2.0
+---
+# Voxtral-Small-24B-2507-FP8-dynamic
+## Model Overview
+- **Model Architecture:** VoxtralForConditionalGeneration
+  - **Input:** Audio-Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP8
+  - **Activation quantization:** FP8
+- **Intended Use Cases:** Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
+  - **Dedicated transcription mode:** Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
+  - **Long-form context:** With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
+  - **Built-in Q&A and summarization:** Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
+  - **Natively multilingual:** Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
+  - **Function-calling straight from voice:** Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
+  - **Highly capable at text:** Retains the text understanding capabilities of its language model backbone, Ministral-3B
+- **Release Date:** 08/21/2025
+- **Version:** 1.0
+- **Model Developers:** Red Hat
+Quantized version of [Voxtral-Small-24B-2507](https://huggingface.co/mistralai/Voxtral-Small-24B-2507).
+### Model Optimizations
+This model was obtained by quantizing activation and weights of [Voxtral-Small-24B-2507](https://huggingface.co//Llama-3.3-70B-Instruct) to FP8 data type.
+This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
+Weight quantization also reduces disk size requirements by approximately 50%.
+Only weights and activations of the MLP operators within transformers blocks of the language model are quantized.
+Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
+The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
+## Deployment
+### Use with vLLM
+1. Initialize vLLM server:
+```
+vllm serve RedHatAI/Voxtral-Small-24B-2507-FP8-dynamic --tokenizer_mode mistral --config_format mistral --load_format mistral
+```
+2. Send requests to the server, according to the use case. See the following examples.
+<details>
+  <summary>Audio Instruct</summary>
+```python
+from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
+from mistral_common.audio import Audio
+from huggingface_hub import hf_hub_download
+from openai import OpenAI
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://<your-server-host>:8000/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+models = client.models.list()
+model = models.data[0].id
+obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
+bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
+def file_to_chunk(file: str) -> AudioChunk:
+    audio = Audio.from_file(file, strict=False)
+    return AudioChunk.from_audio(audio)
+text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?")
+user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
+print(30 * "=" + "USER 1" + 30 * "=")
+print(text_chunk.text)
+print("\n\n")
+response = client.chat.completions.create(
+    model=model,
+    messages=[user_msg],
+    temperature=0.2,
+    top_p=0.95,
+)
+content = response.choices[0].message.content
+print(30 * "=" + "BOT 1" + 30 * "=")
+print(content)
+print("\n\n")
+# The speaker who is more inspiring is the one who delivered the farewell address, as they express
+# gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of
+# self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast,
+# the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it
+# lacks the emotional and motivational content of the farewell address.
+# **Differences:**
+# - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy.
+# - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content.
+messages = [
+    user_msg,
+    AssistantMessage(content=content).to_openai(),
+    UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
+]
+print(30 * "=" + "USER 2" + 30 * "=")
+print(messages[-1]["content"])
+print("\n\n")
+response = client.chat.completions.create(
+    model=model,
+    messages=messages,
+    temperature=0.2,
+    top_p=0.95,
+)
+content = response.choices[0].message.content
+print(30 * "=" + "BOT 2" + 30 * "=")
+print(content)
+```
+</details>
+<details>
+  <summary>Transcription</summary>
+```python
+from mistral_common.protocol.transcription.request import TranscriptionRequest
+from mistral_common.protocol.instruct.messages import RawAudio
+from mistral_common.audio import Audio
+from huggingface_hub import hf_hub_download
+from openai import OpenAI
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://<your-server-host>:8000/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+models = client.models.list()
+model = models.data[0].id
+obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
+audio = Audio.from_file(obama_file, strict=False)
+audio = RawAudio.from_audio(audio)
+req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
+response = client.audio.transcriptions.create(**req)
+print(response)
+```
+</details>
+## Creation
+This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library as shown below.
+<details>
+  <summary>Creation details</summary>
+```python
+import torch
+from transformers import VoxtralForConditionalGeneration, AutoProcessor
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+# Select model and load it.
+MODEL_ID = "mistralai/Voxtral-Small-24B-2507"
+model = VoxtralForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
+processor = AutoProcessor.from_pretrained(MODEL_ID)
+# Recipe
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=["language_model.lm_head", "re:audio_tower.*" ,"re:multi_modal_projector.*", "re:.*self_attn"],
+)
+# Apply algorithms.
+oneshot(
+    model=model,
+    recipe=recipe,
+    processor=processor,
+)
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-dynamic"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+processor.save_pretrained(SAVE_DIR)
+```
+After quantization, the model can be converted back into the mistralai format using the `convert_voxtral_hf_to_mistral.py` script included with the model.
+</details>
+## Evaluation
+The model was evaluated on the Fleurs transcription task.
+Recovery is computed with respect to the complement of the word error rate (WER).
+<table border="1" cellspacing="0" cellpadding="6">
+  <tr>
+    <th>Benchmark</th>
+    <th>Language</th>
+    <th>Voxtral-Small-24B-2507</th>
+    <th>Voxtral-Small-24B-2507-FP8-dynamic<br>(this model)</th>
+    <th>Recovery</th>
+  </tr>
+  <tr>
+    <td rowspan="7"><strong>Fleurs<br>WER</strong></td>
+    <td>English</td>
+    <td>3.45%</td>
+    <td>3.43%</td>
+    <td>100.0%</td>
+  </tr>
+  <tr>
+    <td>French</td>
+    <td>3.91%</td>
+    <td>3.96%</td>
+    <td>100.0%</td>
+  </tr>
+  <tr>
+    <td>Spanish</td>
+    <td>2.91%</td>
+    <td>2.84%</td>
+    <td>100.1%</td>
+  </tr>
+  <tr>
+    <td>German</td>
+    <td>3.41%</td>
+    <td>3.36%</td>
+    <td>100.1%</td>
+  </tr>
+  <tr>
+    <td>Italian</td>
+    <td>2.27%</td>
+    <td>2.54%</td>
+    <td>99.7%</td>
+  </tr>
+  <tr>
+    <td>Portuguese</td>
+    <td>3.59%</td>
+    <td>3.57%</td>
+    <td>100.0%</td>
+  </tr>
+  <tr>
+    <td>Dutch</td>
+    <td>5.35%</td>
+    <td>5.29%</td>
+    <td>100.1%</td>
+  </tr>
+</table>