Firworks commited on
Commit
93eac58
·
verified ·
1 Parent(s): d28b051

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -1
README.md CHANGED
@@ -12,9 +12,60 @@ base_model:
12
 
13
  > Notes: Keep `lm_head` in high precision; calibrate on long, domain-relevant sequences.
14
 
15
- Check the original model card for information about this model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  # Running the model with VLLM in Docker
 
18
  ```sh
19
  docker run --rm -ti --gpus all \
20
  -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
@@ -30,6 +81,39 @@ docker run --rm -ti --gpus all \
30
  --interleave-mm-strings \
31
  --trust-remote-code
32
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  This was tested on an RTX Pro 6000 Blackwell cloud instance.
34
 
35
  If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.
 
12
 
13
  > Notes: Keep `lm_head` in high precision; calibrate on long, domain-relevant sequences.
14
 
15
+ 📘 About This Model
16
+
17
+ This is a quantized NVFP4 (W4A4) version of Step-Audio-R1, an open-weights Qwen2-Audio–based multimodal model for audio understanding and reasoning.
18
+ The original BF16 model requires ~67 GB VRAM.
19
+
20
+ Step-Audio-R1 combines:
21
+
22
+ A high-capacity audio encoder
23
+
24
+ A projection layer that maps audio features into the transformer
25
+
26
+ A Qwen2 language backbone for reasoning and text generation
27
+
28
+ The model is designed for:
29
+
30
+ Speech transcription and interpretation
31
+
32
+ Emotional / tonal analysis
33
+
34
+ Speaker characteristics
35
+
36
+ Music and sound-scene understanding
37
+
38
+ High-quality step-by-step reasoning about audio inputs
39
+
40
+ It does not generate audio; it produces text based on audio input.
41
+
42
+ 📦 What This Quantized Version Enables
43
+
44
+ This NVFP4 quantized version reduces memory requirements significantly:
45
+
46
+ Size: ~22 GB (down from ~67 GB)
47
+
48
+ Runs comfortably on a single RTX 5090
49
+
50
+ Fully compatible with vLLM (including streaming text output)
51
+
52
+ Preserves most reasoning performance from the BF16 release
53
+
54
+ Because of this, anyone with a high-end consumer GPU can experiment with advanced audio reasoning locally.
55
+
56
+ 🖥 Supported Audio Behavior
57
+
58
+ The model supports:
59
+
60
+ ✔ Streaming text output through vLLM
61
+ ✔ Reading uploaded audio files (WAV/MP3/etc) via ffmpeg
62
+ ✘ It does not synthesize audio
63
+ ✘ It does not require pre-burned waveforms — any user-provided audio file works
64
+
65
+ Check the original model card for information about this model for more info.
66
 
67
  # Running the model with VLLM in Docker
68
+ It requires a specific vllm container released by the model authors.
69
  ```sh
70
  docker run --rm -ti --gpus all \
71
  -v $(pwd)/Step-Audio-R1:/Step-Audio-R1 \
 
81
  --interleave-mm-strings \
82
  --trust-remote-code
83
  ```
84
+
85
+ This example script should allow an audio wave full to be streamed to the model and get a response based on the prompt.
86
+ ```py
87
+ import requests
88
+ import base64
89
+
90
+ with open("audio.wav", "rb") as f:
91
+ audio_b64 = base64.b64encode(f.read()).decode()
92
+
93
+ payload = {
94
+ "model": "Step-Audio-R1",
95
+ "stream": True,
96
+ "messages": [
97
+ {
98
+ "role": "user",
99
+ "content": [
100
+ {"type": "input_audio", "audio_data": audio_b64, "mime_type": "audio/wav"},
101
+ {"type": "text", "text": "Transcribe this and describe the speaker."}
102
+ ]
103
+ }
104
+ ]
105
+ }
106
+
107
+ with requests.post(
108
+ "http://localhost:9999/v1/chat/completions",
109
+ json=payload,
110
+ stream=True,
111
+ ) as r:
112
+ for line in r.iter_lines():
113
+ if line:
114
+ print(line.decode())
115
+ ```
116
+
117
  This was tested on an RTX Pro 6000 Blackwell cloud instance.
118
 
119
  If there are other models you're interested in seeing quantized to NVFP4 for use on the DGX Spark, or other modern Blackwell (or newer) cards let me know. I'm trying to make more NVFP4 models available to allow more people to try them out.