Spaces:
Configuration error
Fix TTS bug - Error: input 'depth_slices_in' is missing in 'feeds'.
TTS Bug Fix: depth_slices_in missing in vocoder feeds
Error
input 'depth_slices_in' is missing in 'feeds'
Thrown when switching to TTS mode and sending a text message. The sampleAudioCodes
function in audio-model.js calls vocoderSession.run(feeds) with only 5 of the 8
required inputs.
Root Cause Analysis
What the model actually expects
I fetched and inspected vocoder_depthformer_q4.onnx directly. It needs 8 inputs,
but the JS code only provides 5.
| Input | Type | Shape | Status |
|---|---|---|---|
hidden_states |
float32 | [batch, 2048] |
provided |
depth_slices_in |
float32 | [batch, 8, 1024] |
MISSING |
step_idx |
int64 | [] scalar |
provided |
prev_token |
int64 | [batch] |
provided |
past_keys |
float32 | [6, batch, 8, past_len, 32] |
provided (wrong initial shape) |
past_values |
float32 | [6, batch, 8, past_len, 32] |
provided (wrong initial shape) |
seqlens_k |
int32 | [batch] |
MISSING |
total_seq_len |
int32 | [] scalar |
MISSING |
It also produces 4 outputs, but the JS code only reads 3:
| Output | Status |
|---|---|
logits |
read |
depth_slices |
NOT captured |
new_keys |
read |
new_values |
read |
What each missing input is
depth_slices_in / depth_slices (the main bug)
Inspecting the model graph reveals an If node that produces depth_slices:
is_first = (step_idx == 0)
if is_first:
depth_slices = depth_linear(hidden_states) // MatMul + bias + Reshape -> [1, 8, 1024]
else:
depth_slices = depth_slices_in // pass-through
This is the depth_linear projection from the Python model (self.depth_linear(hidden_state)),
which maps [hidden_size] to [8 codebooks, 1024 depthformer_dim]. The ONNX model
computes this projection internally at step 0 (ignoring depth_slices_in), outputs the
result as depth_slices, and then at steps 1-7 it just returns depth_slices_in unchanged.
The caller feeds the output depth_slices back as depth_slices_in for subsequent steps,
avoiding 7 redundant linear projections.
At step 0: pass zeros [1, 8, 1024] for depth_slices_in (the model ignores it).
At steps 1-7: pass back outputs.depth_slices from the previous step.
seqlens_k and total_seq_len
Required by the 6 GroupQueryAttention (GQA) operators inside the depthformer. They
track the attention window boundaries:
seqlens_k[0]= number of past key tokens = current step indexitotal_seq_len= past tokens + 1 current token =i + 1
Why ASR works but TTS breaks
ASR (transcribe()) never calls sampleAudioCodes. Its path is: audio input, mel
spectrogram, audio encoder, main decoder, then sampleToken() for text tokens. The
vocoder session is never touched.
TTS (generateSpeech()) calls sampleAudioCodes for every audio frame, which is
exactly where vocoderSession.run(feeds) is called with the incomplete feeds object.
So the bug is 100% isolated to the vocoder path, which is why one direction works
and the other does not.
Initial past_keys/past_values shape mismatch
The model expects [6, batch, numKvHeads, past_len, headDim] = [6, 1, 8, 0, 32]
initially, but the code creates [6, 1, 0, 8, 32] (dims 2 and 3 are swapped). With
0 elements this does not crash, but it is incorrect.
Changes in audio-model.js
All changes are contained within sampleAudioCodes and _initVocoderCache. No other functions need modification.
Step 1: Update _initVocoderCache
Pre-allocate the new reusable typed arrays and tensors:
seqlensKData: new Int32Array(1),
seqlensKTensor: new ort.Tensor('int32', seqlensKData, [1]),
totalSeqLenData: new Int32Array(1),
totalSeqLenTensor: new ort.Tensor('int32', totalSeqLenData, []),
emptyDepthSlicesData: new Float32Array(8 * 1024), // zeros for step 0
Step 2: Fix initial past_keys/past_values shape
Change [numLayers, 1, 0, numKvHeads, headDim] to [numLayers, 1, numKvHeads, 0, headDim].
Step 3: Add depth_slices_in recurrent state to the codebook loop
Before the loop:
let depthSlicesIn = new ort.Tensor('float32', cache.emptyDepthSlicesData, [1, 8, 1024]);
Inside feeds: depth_slices_in: depthSlicesIn
After vocoderSession.run(feeds): depthSlicesIn = outputs.depth_slices
Step 4: Add seqlens_k and total_seq_len to feeds
cache.seqlensKData[0] = i; // past tokens = current step index
cache.totalSeqLenData[0] = i + 1; // total = past + 1 current
Step 5: Add depth_slices to WebGPU preferredOutputLocation
Avoids a GPU-CPU-GPU roundtrip for depth_slices, which is fed back immediately in
the next codebook step.
{ preferredOutputLocation: { new_keys: 'gpu-buffer', new_values: 'gpu-buffer', depth_slices: 'gpu-buffer' } }