jspaulsen
/

unmute-encoder

+---
+license: mit
+tags:
+  - audio
+  - speaker-embedding
+  - voice-cloning
+  - moshi
+  - tts
+language:
+  - en
+  - fr
+library_name: transformers
+base_model: kyutai/tts-1.6b-en_fr
+---
+# Unmute Encoder
+A speaker embedding encoder trained to replicate Kyutai's unreleased "unmute encoder".
+This model extracts speaker embeddings from audio for use with Kyutai's Moshi TTS system.
+## Model Description
+The encoder is built on top of Kyutai's Mimi neural audio codec:
+1. **Mimi Encoder**: Frozen Mimi encoder extracts latent audio representations
+2. **MLP Projector**: Trainable MLP head projects Mimi's latents to the target embedding space
+3. **Output**: Speaker embeddings of shape `[512, 125]` (512 channels, 125 time steps for 10s audio)
+```
+Audio (24kHz, 10s) -> Mimi Encoder -> Latent [512, T] -> MLP Projector -> Embedding [512, 125]
+```
+## Usage
+```python
+from src.models.mimi import MimiEncoder
+# Load the encoder
+encoder = MimiEncoder.from_pretrained(
+    model_name="jspaulsen/unmute-encoder",
+    device="cuda",
+    num_codebooks=32,
+)
+# Create embedding from audio tensor [1, 1, T] at 24kHz
+output = encoder(audio_tensor)
+embedding = output.embedding  # [1, 512, 125]
+```
+## Training
+Trained using supervised learning with a hybrid loss (L1 + cosine similarity) against
+speaker embeddings from [kyutai/tts-voices](https://huggingface.co/kyutai/tts-voices).
+### Training Details
+- **Global step**: 950
+- **Epoch**: 158.33333333333334
+- **Best metric**: 0.40268129110336304
+## Acknowledgments
+- [Kyutai](https://kyutai.org/) for releasing the Moshi TTS models and speaker embeddings