File size: 5,124 Bytes
aadbb6b 5fadb00 aadbb6b 92d9d71 93eac58 906942f 93eac58 906942f 93eac58 92d9d71 93eac58 935276f 93eac58 906942f aadbb6b 93eac58 aadbb6b 59f0ad7 d28b051 59f0ad7 aadbb6b 93eac58 935276f 93eac58 b5c50d1 aadbb6b 5fadb00 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
datasets:
- Rombo-Org/Optimized_Reasoning
base_model:
- stepfun-ai/Step-Audio-R1
license: apache-2.0
---
# Step-Audio-R1-nvfp4
**Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.
**Base model:** `stepfun-ai/Step-Audio-R1`
**How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with Rombo-Org/Optimized_Reasoning.
> Notes: Keep `lm_head` in high precision; calibrate on long, domain-relevant sequences.
# 📘 About This Model
This is a quantized NVFP4 (W4A4) version of Step-Audio-R1, an open-weights Audio–based multimodal model for audio understanding and reasoning.
The original BF16 model requires ~67 GB VRAM.
Step-Audio-R1 combines:
A high-capacity audio encoder
A projection layer that maps audio features into the transformer
A language backbone for reasoning and text generation
The model is designed for:
Speech transcription and interpretation
Emotional / tonal analysis
Speaker characteristics
Music and sound-scene understanding
High-quality step-by-step reasoning about audio inputs
It does not generate audio; it produces text based on audio input.
# 📦 What This Quantized Version Enables
This NVFP4 quantized version reduces memory requirements significantly:
Size: ~22 GB (down from ~67 GB)
Should fit comfortably on a single RTX 5090
Preserves most reasoning performance from the BF16 release
Because of this, anyone with a high-end consumer GPU can experiment with advanced audio reasoning locally.
Check the original model card for more information about this model.
# Running the model with VLLM in Docker
It requires a specific vllm container released by the model authors.
```sh
docker run --rm -ti --gpus all \
-v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
-p 9999:9999 \
stepfun2025/vllm:step-audio-2-v20250909 \
vllm serve /Step-Audio-R1 \
--served-model-name Step-Audio-R1 \
--port 9999 \
--max-model-len 16384 \
--max-num-seqs 32 \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
--enable-log-requests \
--interleave-mm-strings \
--trust-remote-code
```
This example script should allow an audio wave file to be streamed to the model and get a response based on the prompt.
```py
import requests
import base64
with open("audio.wav", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
payload = {
"model": "Step-Audio-R1",
"stream": True,
"messages": [
{
"role": "user",
"content": [
{"type": "input_audio", "audio_data": audio_b64, "mime_type": "audio/wav"},
{"type": "text", "text": "Transcribe this and describe the speaker."}
]
}
]
}
with requests.post(
"http://localhost:9999/v1/chat/completions",
json=payload,
stream=True,
) as r:
for line in r.iter_lines():
if line:
print(line.decode())
```
This was tested on an RTX Pro 6000 Blackwell cloud instance.
If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out. |