Is this model can called "VAE" (Varietional AutoEncoder)?

#10

by oayk - opened 21 days ago

Hello, I'm investigating codec architectures for tts models and when I tried to train an LLM based tts model with Meta's Encodec, there is problems with injecting Encodec models encoder outputs to LLM. So when I make a research about it I learned that Encodec is not a has a VAE feature...

So I wondering that is Mimi can called VAE? Is that model has a regularized latent space or not regularized like Encodec?

Thanks in advance.

ManuKyutai

Kyutai org 21 days ago

Mimi has RVQ like Encodec, so no kl regularization loss indeed. Look for VQVAE for more infos.

ManuKyutai changed discussion status to closed 21 days ago

oayk

21 days ago

"As I understand, Pocket-TTS utilizes Mimi for audio tokenization and conditioning. Regarding this, I have two technical follow-ups:

Is the Mimi architecture used in Pocket-TTS identical to the original Kyutai-Mimi, or are there custom architectural modifications (e.g., changes in codebook size or residual depth) to better suit the TTS task?
Since these architectures lack KL regularization—unlike standard VAEs—how did you handle the latent space stability during training? Did the discrete nature of the codebook tokens make the conditioning process harder for the LLM to converge, or did you find that the lack of regularization actually helped in maintaining the speaker identity more sharply during the cloning process?

I'm trying to understand if the quantization noise or the 'hard' decision boundaries of the codebook created any bottleneck in your speaker cloning performance."

ManuKyutai

Kyutai org 21 days ago

The "mimi" for Pocket TTS is a VAE indeed, and it is regularized

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment