File size: 5,124 Bytes
aadbb6b
 
 
 
 
5fadb00
aadbb6b
 
 
 
 
 
 
 
 
92d9d71
93eac58
906942f
93eac58
 
 
 
 
906942f
93eac58
 
 
 
 
 
 
 
 
 
92d9d71
93eac58
 
 
935276f
93eac58
 
 
 
 
906942f
aadbb6b
 
93eac58
aadbb6b
59f0ad7
 
 
 
d28b051
59f0ad7
 
 
 
 
 
 
 
aadbb6b
93eac58
935276f
93eac58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5c50d1
aadbb6b
5fadb00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
datasets:
- Rombo-Org/Optimized_Reasoning
base_model:
- stepfun-ai/Step-Audio-R1
license: apache-2.0
---
# Step-Audio-R1-nvfp4

**Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.  
**Base model:** `stepfun-ai/Step-Audio-R1`  
**How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with Rombo-Org/Optimized_Reasoning.  

> Notes: Keep `lm_head` in high precision; calibrate on long, domain-relevant sequences.

# 📘 About This Model

This is a quantized NVFP4 (W4A4) version of Step-Audio-R1, an open-weights Audio–based multimodal model for audio understanding and reasoning.
The original BF16 model requires ~67 GB VRAM.

Step-Audio-R1 combines:
A high-capacity audio encoder
A projection layer that maps audio features into the transformer
A language backbone for reasoning and text generation

The model is designed for:
Speech transcription and interpretation
Emotional / tonal analysis
Speaker characteristics
Music and sound-scene understanding
High-quality step-by-step reasoning about audio inputs

It does not generate audio; it produces text based on audio input.

# 📦 What This Quantized Version Enables

This NVFP4 quantized version reduces memory requirements significantly:
Size: ~22 GB (down from ~67 GB)
Should fit comfortably on a single RTX 5090

Preserves most reasoning performance from the BF16 release
Because of this, anyone with a high-end consumer GPU can experiment with advanced audio reasoning locally.


Check the original model card for more information about this model.

# Running the model with VLLM in Docker
It requires a specific vllm container released by the model authors.
```sh
docker run --rm -ti --gpus all \
    -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
    -p 9999:9999 \
    stepfun2025/vllm:step-audio-2-v20250909 \
    vllm serve /Step-Audio-R1 \
    --served-model-name Step-Audio-R1 \
    --port 9999 \
    --max-model-len 16384 \
    --max-num-seqs 32 \
    --chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("<audio_patch>\n", "<audio_patch>") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}<audio_patch>{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'<|BOT|>system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{{- '"'"'<|BOT|>tool_json_schemas\n'"'"' + tools|tojson + '"'"'<|EOT|>'"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'<|BOT|>system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'<|BOT|>human\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'<|BOT|>assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"'<|BOT|>'"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"'<|EOT|>'"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'<|BOT|>assistant\n<think>\n'"'"' -}}{%- endif -%}' \
    --enable-log-requests \
    --interleave-mm-strings \
    --trust-remote-code
```

This example script should allow an audio wave file to be streamed to the model and get a response based on the prompt.
```py
import requests
import base64

with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

payload = {
    "model": "Step-Audio-R1",
    "stream": True,
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "audio_data": audio_b64, "mime_type": "audio/wav"},
                {"type": "text", "text": "Transcribe this and describe the speaker."}
            ]
        }
    ]
}

with requests.post(
    "http://localhost:9999/v1/chat/completions",
    json=payload,
    stream=True,
) as r:
    for line in r.iter_lines():
        if line:
            print(line.decode())
```

This was tested on an RTX Pro 6000 Blackwell cloud instance.

If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.