Spaces:

LiquidAI
/

LFM2.5-Audio-1.5B-transformers-js

Configuration error

App Files Files Community

Fix TTS bug - Error: input 'depth_slices_in' is missing in 'feeds'.

by Paulescu - opened Mar 10

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+19

-3

Add TTS vocoder bug fix implementation planfc95264d

Paulescu

Liquid AI org Mar 10

•

edited Mar 10

TTS Bug Fix: `depth_slices_in` missing in vocoder feeds

Error

input 'depth_slices_in' is missing in 'feeds'

Thrown when switching to TTS mode and sending a text message. The sampleAudioCodes
function in audio-model.js calls vocoderSession.run(feeds) with only 5 of the 8
required inputs.

Root Cause Analysis

What the model actually expects

I fetched and inspected vocoder_depthformer_q4.onnx directly. It needs 8 inputs,
but the JS code only provides 5.

Input	Type	Shape	Status
`hidden_states`	float32	`[batch, 2048]`	provided
`depth_slices_in`	float32	`[batch, 8, 1024]`	MISSING
`step_idx`	int64	`[]` scalar	provided
`prev_token`	int64	`[batch]`	provided
`past_keys`	float32	`[6, batch, 8, past_len, 32]`	provided (wrong initial shape)
`past_values`	float32	`[6, batch, 8, past_len, 32]`	provided (wrong initial shape)
`seqlens_k`	int32	`[batch]`	MISSING
`total_seq_len`	int32	`[]` scalar	MISSING

It also produces 4 outputs, but the JS code only reads 3:

Output	Status
`logits`	read
`depth_slices`	NOT captured
`new_keys`	read
`new_values`	read

What each missing input is

`depth_slices_in` / `depth_slices` (the main bug)

Inspecting the model graph reveals an If node that produces depth_slices:

is_first = (step_idx == 0)
if is_first:
    depth_slices = depth_linear(hidden_states)   // MatMul + bias + Reshape -> [1, 8, 1024]
else:
    depth_slices = depth_slices_in               // pass-through

This is the depth_linear projection from the Python model (self.depth_linear(hidden_state)),
which maps [hidden_size] to [8 codebooks, 1024 depthformer_dim]. The ONNX model
computes this projection internally at step 0 (ignoring depth_slices_in), outputs the
result as depth_slices, and then at steps 1-7 it just returns depth_slices_in unchanged.
The caller feeds the output depth_slices back as depth_slices_in for subsequent steps,
avoiding 7 redundant linear projections.

At step 0: pass zeros [1, 8, 1024] for depth_slices_in (the model ignores it).
At steps 1-7: pass back outputs.depth_slices from the previous step.

`seqlens_k` and `total_seq_len`

Required by the 6 GroupQueryAttention (GQA) operators inside the depthformer. They
track the attention window boundaries:

seqlens_k[0] = number of past key tokens = current step index i
total_seq_len = past tokens + 1 current token = i + 1

Why ASR works but TTS breaks

ASR (transcribe()) never calls sampleAudioCodes. Its path is: audio input, mel
spectrogram, audio encoder, main decoder, then sampleToken() for text tokens. The
vocoder session is never touched.

TTS (generateSpeech()) calls sampleAudioCodes for every audio frame, which is
exactly where vocoderSession.run(feeds) is called with the incomplete feeds object.
So the bug is 100% isolated to the vocoder path, which is why one direction works
and the other does not.

Initial `past_keys`/`past_values` shape mismatch

The model expects [6, batch, numKvHeads, past_len, headDim] = [6, 1, 8, 0, 32]
initially, but the code creates [6, 1, 0, 8, 32] (dims 2 and 3 are swapped). With
0 elements this does not crash, but it is incorrect.

Changes in `audio-model.js`

All changes are contained within sampleAudioCodes and _initVocoderCache. No other functions need modification.

Step 1: Update `_initVocoderCache`

Pre-allocate the new reusable typed arrays and tensors:

seqlensKData: new Int32Array(1),
seqlensKTensor: new ort.Tensor('int32', seqlensKData, [1]),
totalSeqLenData: new Int32Array(1),
totalSeqLenTensor: new ort.Tensor('int32', totalSeqLenData, []),
emptyDepthSlicesData: new Float32Array(8 * 1024),  // zeros for step 0

Step 2: Fix initial `past_keys`/`past_values` shape

Change [numLayers, 1, 0, numKvHeads, headDim] to [numLayers, 1, numKvHeads, 0, headDim].

Step 3: Add `depth_slices_in` recurrent state to the codebook loop

Before the loop:

let depthSlicesIn = new ort.Tensor('float32', cache.emptyDepthSlicesData, [1, 8, 1024]);

Inside feeds: depth_slices_in: depthSlicesIn

After vocoderSession.run(feeds): depthSlicesIn = outputs.depth_slices

Step 4: Add `seqlens_k` and `total_seq_len` to feeds

cache.seqlensKData[0] = i;        // past tokens = current step index
cache.totalSeqLenData[0] = i + 1; // total = past + 1 current

Step 5: Add `depth_slices` to WebGPU `preferredOutputLocation`

Avoids a GPU-CPU-GPU roundtrip for depth_slices, which is fed back immediately in
the next codebook step.

{ preferredOutputLocation: { new_keys: 'gpu-buffer', new_values: 'gpu-buffer', depth_slices: 'gpu-buffer' } }

Fix vocoder missing inputs: depth_slices_in, seqlens_k, total_seq_lenf99f0de4

Paulescu changed pull request status to open Mar 10

Remove implementation plan (content moved to PR description)6cca669d

ykhrustalev changed pull request status to merged Mar 10

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Fix TTS bug - Error: input 'depth_slices_in' is missing in 'feeds'.

TTS Bug Fix: depth_slices_in missing in vocoder feeds

Error

Root Cause Analysis

What the model actually expects

What each missing input is

depth_slices_in / depth_slices (the main bug)

seqlens_k and total_seq_len

Why ASR works but TTS breaks

Initial past_keys/past_values shape mismatch

Changes in audio-model.js

Step 1: Update _initVocoderCache

Step 2: Fix initial past_keys/past_values shape

Step 3: Add depth_slices_in recurrent state to the codebook loop

Step 4: Add seqlens_k and total_seq_len to feeds

Step 5: Add depth_slices to WebGPU preferredOutputLocation