---
license: cc-by-4.0
tags:
- audio
- text-sync
- mimi
- codec
---
# TextSyncMimi-v1
**TextSyncMimi** provides a *text‑synchronous* speech representation designed to plug into LLM‑based speech generation. Instead of operating at a fixed frame rate (time‑synchronous), it represents speech **per text token** and reconstructs high‑fidelity audio through a [Mimi](https://huggingface.co/kyutai/mimi)‑compatible neural audio decoder.
> TL;DR: We turn **time‑synchronous** Mimi latents into **text‑synchronous** token latents \([tᵢ, sᵢ]\), then expand them back to Mimi latents and decode to waveform. This makes token‑level control and alignment with LLM text outputs straightforward.
## Model overview
- **TextSyncMimi components:**
- **Cross‑attention encoder** — aligns Mimi’s time‑synchronous sequence (length *T*) to the text sequence (length *N*), producing one continuous speech latent per text token.
- **Causal decoder** — expands token‑level latents back to a Mimi‑rate latent sequence suitable for a Mimi decoder. The decoder is streaming.
- [Mimi backbone](https://huggingface.co/kyutai/mimi)
## Training / Evaluation
- **Lossess**: (i) **L2** distance between predicted and ground‑truth continuous Mimi latents, and (ii) **BCE** for the stop token during expansion.
- **Training Data**: LibriSpeech (960 hours) + LibriTTS (585 hours) -- around 1.5K hours in total
- **Results**: ASR WER on audio reconstructed from different methods (NB: non-zero WER of ground-truth audio came from ASR errors):
| Method | Train data | WER ↓ |
|------------------|------------------------------------------|------:|
| Ground‑truth | – | 2.12 |
| Mimi | – | 2.29 |
| TASTE | Emilia + LibriTTS | 4.40 |
| **TextSyncMimi v1** | **LibriTTS‑R + LibriSpeech** | **3.06** |
## Usage
### Loading the Model
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("potsawee/TextSyncMimi-v1", trust_remote_code=True)
```
See the code of [Speech Editing with TextSync Mimi](https://huggingface.co/spaces/potsawee/TextSyncMimi-SpeechEditing/blob/main/app.py) for a use-case (e.g., encoding, decoding, swapping) of the model
## Acknowledgements
- Built on top of [Kyutai's Mimi](https://huggingface.co/kyutai/mimi) audio codec