Fix TTS bug - Error: input 'depth_slices_in' is missing in 'feeds'.

#3
Liquid AI org
edited Mar 10

TTS Bug Fix: depth_slices_in missing in vocoder feeds

Error

input 'depth_slices_in' is missing in 'feeds'

Thrown when switching to TTS mode and sending a text message. The sampleAudioCodes
function in audio-model.js calls vocoderSession.run(feeds) with only 5 of the 8
required inputs.

Root Cause Analysis

What the model actually expects

I fetched and inspected vocoder_depthformer_q4.onnx directly. It needs 8 inputs,
but the JS code only provides 5.

Input Type Shape Status
hidden_states float32 [batch, 2048] provided
depth_slices_in float32 [batch, 8, 1024] MISSING
step_idx int64 [] scalar provided
prev_token int64 [batch] provided
past_keys float32 [6, batch, 8, past_len, 32] provided (wrong initial shape)
past_values float32 [6, batch, 8, past_len, 32] provided (wrong initial shape)
seqlens_k int32 [batch] MISSING
total_seq_len int32 [] scalar MISSING

It also produces 4 outputs, but the JS code only reads 3:

Output Status
logits read
depth_slices NOT captured
new_keys read
new_values read

What each missing input is

depth_slices_in / depth_slices (the main bug)

Inspecting the model graph reveals an If node that produces depth_slices:

is_first = (step_idx == 0)
if is_first:
    depth_slices = depth_linear(hidden_states)   // MatMul + bias + Reshape -> [1, 8, 1024]
else:
    depth_slices = depth_slices_in               // pass-through

This is the depth_linear projection from the Python model (self.depth_linear(hidden_state)),
which maps [hidden_size] to [8 codebooks, 1024 depthformer_dim]. The ONNX model
computes this projection internally at step 0 (ignoring depth_slices_in), outputs the
result as depth_slices, and then at steps 1-7 it just returns depth_slices_in unchanged.
The caller feeds the output depth_slices back as depth_slices_in for subsequent steps,
avoiding 7 redundant linear projections.

At step 0: pass zeros [1, 8, 1024] for depth_slices_in (the model ignores it).
At steps 1-7: pass back outputs.depth_slices from the previous step.

seqlens_k and total_seq_len

Required by the 6 GroupQueryAttention (GQA) operators inside the depthformer. They
track the attention window boundaries:

  • seqlens_k[0] = number of past key tokens = current step index i
  • total_seq_len = past tokens + 1 current token = i + 1

Why ASR works but TTS breaks

ASR (transcribe()) never calls sampleAudioCodes. Its path is: audio input, mel
spectrogram, audio encoder, main decoder, then sampleToken() for text tokens. The
vocoder session is never touched.

TTS (generateSpeech()) calls sampleAudioCodes for every audio frame, which is
exactly where vocoderSession.run(feeds) is called with the incomplete feeds object.
So the bug is 100% isolated to the vocoder path, which is why one direction works
and the other does not.

Initial past_keys/past_values shape mismatch

The model expects [6, batch, numKvHeads, past_len, headDim] = [6, 1, 8, 0, 32]
initially, but the code creates [6, 1, 0, 8, 32] (dims 2 and 3 are swapped). With
0 elements this does not crash, but it is incorrect.

Changes in audio-model.js

All changes are contained within sampleAudioCodes and _initVocoderCache. No other functions need modification.

Step 1: Update _initVocoderCache

Pre-allocate the new reusable typed arrays and tensors:

seqlensKData: new Int32Array(1),
seqlensKTensor: new ort.Tensor('int32', seqlensKData, [1]),
totalSeqLenData: new Int32Array(1),
totalSeqLenTensor: new ort.Tensor('int32', totalSeqLenData, []),
emptyDepthSlicesData: new Float32Array(8 * 1024),  // zeros for step 0

Step 2: Fix initial past_keys/past_values shape

Change [numLayers, 1, 0, numKvHeads, headDim] to [numLayers, 1, numKvHeads, 0, headDim].

Step 3: Add depth_slices_in recurrent state to the codebook loop

Before the loop:

let depthSlicesIn = new ort.Tensor('float32', cache.emptyDepthSlicesData, [1, 8, 1024]);

Inside feeds: depth_slices_in: depthSlicesIn

After vocoderSession.run(feeds): depthSlicesIn = outputs.depth_slices

Step 4: Add seqlens_k and total_seq_len to feeds

cache.seqlensKData[0] = i;        // past tokens = current step index
cache.totalSeqLenData[0] = i + 1; // total = past + 1 current

Step 5: Add depth_slices to WebGPU preferredOutputLocation

Avoids a GPU-CPU-GPU roundtrip for depth_slices, which is fed back immediately in
the next codebook step.

{ preferredOutputLocation: { new_keys: 'gpu-buffer', new_values: 'gpu-buffer', depth_slices: 'gpu-buffer' } }
Paulescu changed pull request status to open
ykhrustalev changed pull request status to merged

Sign up or log in to comment