Is this model can called "VAE" (Varietional AutoEncoder)?

#10
by oayk - opened

Hello, I'm investigating codec architectures for tts models and when I tried to train an LLM based tts model with Meta's Encodec, there is problems with injecting Encodec models encoder outputs to LLM. So when I make a research about it I learned that Encodec is not a has a VAE feature...

So I wondering that is Mimi can called VAE? Is that model has a regularized latent space or not regularized like Encodec?

Thanks in advance.

Kyutai org

Mimi has RVQ like Encodec, so no kl regularization loss indeed. Look for VQVAE for more infos.

ManuKyutai changed discussion status to closed

"As I understand, Pocket-TTS utilizes Mimi for audio tokenization and conditioning. Regarding this, I have two technical follow-ups:

  1. Is the Mimi architecture used in Pocket-TTS identical to the original Kyutai-Mimi, or are there custom architectural modifications (e.g., changes in codebook size or residual depth) to better suit the TTS task?

  2. Since these architectures lack KL regularization—unlike standard VAEs—how did you handle the latent space stability during training? Did the discrete nature of the codebook tokens make the conditioning process harder for the LLM to converge, or did you find that the lack of regularization actually helped in maintaining the speaker identity more sharply during the cloning process?

I'm trying to understand if the quantization noise or the 'hard' decision boundaries of the codebook created any bottleneck in your speaker cloning performance."

Kyutai org

The "mimi" for Pocket TTS is a VAE indeed, and it is regularized

Sign up or log in to comment