Update README.md
Browse files
README.md
CHANGED
|
@@ -12,9 +12,60 @@ base_model:
|
|
| 12 |
|
| 13 |
> Notes: Keep `lm_head` in high precision; calibrate on long, domain-relevant sequences.
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
# Running the model with VLLM in Docker
|
|
|
|
| 18 |
```sh
|
| 19 |
docker run --rm -ti --gpus all \
|
| 20 |
-v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
|
|
@@ -30,6 +81,39 @@ docker run --rm -ti --gpus all \
|
|
| 30 |
--interleave-mm-strings \
|
| 31 |
--trust-remote-code
|
| 32 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
This was tested on an RTX Pro 6000 Blackwell cloud instance.
|
| 34 |
|
| 35 |
If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.
|
|
|
|
| 12 |
|
| 13 |
> Notes: Keep `lm_head` in high precision; calibrate on long, domain-relevant sequences.
|
| 14 |
|
| 15 |
+
📘 About This Model
|
| 16 |
+
|
| 17 |
+
This is a quantized NVFP4 (W4A4) version of Step-Audio-R1, an open-weights Qwen2-Audio–based multimodal model for audio understanding and reasoning.
|
| 18 |
+
The original BF16 model requires ~67 GB VRAM.
|
| 19 |
+
|
| 20 |
+
Step-Audio-R1 combines:
|
| 21 |
+
|
| 22 |
+
A high-capacity audio encoder
|
| 23 |
+
|
| 24 |
+
A projection layer that maps audio features into the transformer
|
| 25 |
+
|
| 26 |
+
A Qwen2 language backbone for reasoning and text generation
|
| 27 |
+
|
| 28 |
+
The model is designed for:
|
| 29 |
+
|
| 30 |
+
Speech transcription and interpretation
|
| 31 |
+
|
| 32 |
+
Emotional / tonal analysis
|
| 33 |
+
|
| 34 |
+
Speaker characteristics
|
| 35 |
+
|
| 36 |
+
Music and sound-scene understanding
|
| 37 |
+
|
| 38 |
+
High-quality step-by-step reasoning about audio inputs
|
| 39 |
+
|
| 40 |
+
It does not generate audio; it produces text based on audio input.
|
| 41 |
+
|
| 42 |
+
📦 What This Quantized Version Enables
|
| 43 |
+
|
| 44 |
+
This NVFP4 quantized version reduces memory requirements significantly:
|
| 45 |
+
|
| 46 |
+
Size: ~22 GB (down from ~67 GB)
|
| 47 |
+
|
| 48 |
+
Runs comfortably on a single RTX 5090
|
| 49 |
+
|
| 50 |
+
Fully compatible with vLLM (including streaming text output)
|
| 51 |
+
|
| 52 |
+
Preserves most reasoning performance from the BF16 release
|
| 53 |
+
|
| 54 |
+
Because of this, anyone with a high-end consumer GPU can experiment with advanced audio reasoning locally.
|
| 55 |
+
|
| 56 |
+
🖥 Supported Audio Behavior
|
| 57 |
+
|
| 58 |
+
The model supports:
|
| 59 |
+
|
| 60 |
+
✔ Streaming text output through vLLM
|
| 61 |
+
✔ Reading uploaded audio files (WAV/MP3/etc) via ffmpeg
|
| 62 |
+
✘ It does not synthesize audio
|
| 63 |
+
✘ It does not require pre-burned waveforms — any user-provided audio file works
|
| 64 |
+
|
| 65 |
+
Check the original model card for information about this model for more info.
|
| 66 |
|
| 67 |
# Running the model with VLLM in Docker
|
| 68 |
+
It requires a specific vllm container released by the model authors.
|
| 69 |
```sh
|
| 70 |
docker run --rm -ti --gpus all \
|
| 71 |
-v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
|
|
|
|
| 81 |
--interleave-mm-strings \
|
| 82 |
--trust-remote-code
|
| 83 |
```
|
| 84 |
+
|
| 85 |
+
This example script should allow an audio wave full to be streamed to the model and get a response based on the prompt.
|
| 86 |
+
```py
|
| 87 |
+
import requests
|
| 88 |
+
import base64
|
| 89 |
+
|
| 90 |
+
with open("audio.wav", "rb") as f:
|
| 91 |
+
audio_b64 = base64.b64encode(f.read()).decode()
|
| 92 |
+
|
| 93 |
+
payload = {
|
| 94 |
+
"model": "Step-Audio-R1",
|
| 95 |
+
"stream": True,
|
| 96 |
+
"messages": [
|
| 97 |
+
{
|
| 98 |
+
"role": "user",
|
| 99 |
+
"content": [
|
| 100 |
+
{"type": "input_audio", "audio_data": audio_b64, "mime_type": "audio/wav"},
|
| 101 |
+
{"type": "text", "text": "Transcribe this and describe the speaker."}
|
| 102 |
+
]
|
| 103 |
+
}
|
| 104 |
+
]
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
with requests.post(
|
| 108 |
+
"http://localhost:9999/v1/chat/completions",
|
| 109 |
+
json=payload,
|
| 110 |
+
stream=True,
|
| 111 |
+
) as r:
|
| 112 |
+
for line in r.iter_lines():
|
| 113 |
+
if line:
|
| 114 |
+
print(line.decode())
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
This was tested on an RTX Pro 6000 Blackwell cloud instance.
|
| 118 |
|
| 119 |
If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.
|