File size: 2,673 Bytes
d5b2ce4
 
 
 
 
 
 
 
 
 
 
1d8abff
db9c863
 
 
 
 
 
b3d1d9f
db9c863
 
1d8abff
db9c863
 
1d8abff
 
 
db9c863
1d8abff
db9c863
 
 
 
 
 
 
 
d5b2ce4
 
 
 
 
1d8abff
5f190d7
d5b2ce4
 
a14a08f
d5b2ce4
 
 
 
ac27a0a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: cc-by-4.0
tags:
- audio
- text-sync
- mimi
- codec
---

# TextSyncMimi-v1

**TextSyncMimi** provides a *text‑synchronous* speech representation designed to plug into LLM‑based speech generation. Instead of operating at a fixed frame rate (time‑synchronous), it represents speech **per text token** and reconstructs high‑fidelity audio through a [Mimi](https://huggingface.co/kyutai/mimi)‑compatible neural audio decoder.

> TL;DR: We turn **time‑synchronous** Mimi latents into **text‑synchronous** token latents \([tᵢ, sᵢ]\), then expand them back to Mimi latents and decode to waveform. This makes token‑level control and alignment with LLM text outputs straightforward.

## Model overview

<div align="center">
<img src="https://i.postimg.cc/V6D84Sxs/Screenshot-2568-08-12-at-16-07-13.png" alt="TextSyncMimi" width="75%" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
</div>

  
- **TextSyncMimi components:**
  - **Cross‑attention encoder** — aligns Mimi’s time‑synchronous sequence (length *T*) to the text sequence (length *N*), producing one continuous speech latent per text token.
  - **Causal decoder** — expands token‑level latents back to a Mimi‑rate latent sequence suitable for a Mimi decoder. The decoder is streaming.
  - [Mimi backbone](https://huggingface.co/kyutai/mimi)
  
## Training / Evaluation
- **Lossess**: (i) **L2** distance between predicted and ground‑truth continuous Mimi latents, and (ii) **BCE** for the stop token during expansion.
- **Training Data**: LibriSpeech (960 hours) + LibriTTS (585 hours) -- around 1.5K hours in total
- **Results**: ASR WER on audio reconstructed from different methods (NB: non-zero WER of ground-truth audio came from ASR errors):
  | Method | Train data                              | WER ↓ |
  |------------------|------------------------------------------|------:|
  | Ground‑truth     | –                                       | 2.12 |
  | Mimi             | –                                       | 2.29 |
  | TASTE            | Emilia + LibriTTS                       | 4.40 |
  | **TextSyncMimi v1** | **LibriTTS‑R + LibriSpeech**     | **3.06** |
## Usage

### Loading the Model

```python
from transformers import AutoModel
model = AutoModel.from_pretrained("potsawee/TextSyncMimi-v1", trust_remote_code=True)
```

See the code of [Speech Editing with TextSync Mimi](https://huggingface.co/spaces/potsawee/TextSyncMimi-SpeechEditing/blob/main/app.py) for a use-case (e.g., encoding, decoding, swapping) of the model


## Acknowledgements

- Built on top of [Kyutai's Mimi](https://huggingface.co/kyutai/mimi) audio codec