potsawee
/

TextSyncMimi-v1

Model card Files Files and versions

TextSyncMimi-v1 / README.md

potsawee's picture

Update README.md

1d8abff verified 4 months ago

|

history blame contribute delete

2.67 kB

	---
	license: cc-by-4.0
	tags:
	- audio
	- text-sync
	- mimi
	- codec
	---

	# TextSyncMimi-v1

	TextSyncMimi provides a text‑synchronous speech representation designed to plug into LLM‑based speech generation. Instead of operating at a fixed frame rate (time‑synchronous), it represents speech per text token and reconstructs high‑fidelity audio through a [Mimi](https://huggingface.co/kyutai/mimi)‑compatible neural audio decoder.

	> TL;DR: We turn time‑synchronous Mimi latents into text‑synchronous token latents \([tᵢ, sᵢ]\), then expand them back to Mimi latents and decode to waveform. This makes token‑level control and alignment with LLM text outputs straightforward.

	## Model overview

	<div align="center">
	<img src="https://i.postimg.cc/V6D84Sxs/Screenshot-2568-08-12-at-16-07-13.png" alt="TextSyncMimi" width="75%" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
	</div>


	- TextSyncMimi components:
	- Cross‑attention encoder — aligns Mimi’s time‑synchronous sequence (length T) to the text sequence (length N), producing one continuous speech latent per text token.
	- Causal decoder — expands token‑level latents back to a Mimi‑rate latent sequence suitable for a Mimi decoder. The decoder is streaming.
	- [Mimi backbone](https://huggingface.co/kyutai/mimi)

	## Training / Evaluation
	- Lossess: (i) L2 distance between predicted and ground‑truth continuous Mimi latents, and (ii) BCE for the stop token during expansion.
	- Training Data: LibriSpeech (960 hours) + LibriTTS (585 hours) -- around 1.5K hours in total
	- Results: ASR WER on audio reconstructed from different methods (NB: non-zero WER of ground-truth audio came from ASR errors):
	\| Method \| Train data \| WER ↓ \|
	\|------------------\|------------------------------------------\|------:\|
	\| Ground‑truth \| – \| 2.12 \|
	\| Mimi \| – \| 2.29 \|
	\| TASTE \| Emilia + LibriTTS \| 4.40 \|
	\| TextSyncMimi v1 \| LibriTTS‑R + LibriSpeech \| 3.06 \|
	## Usage

	### Loading the Model

	```python
	from transformers import AutoModel
	model = AutoModel.from_pretrained("potsawee/TextSyncMimi-v1", trust_remote_code=True)
	```

	See the code of [Speech Editing with TextSync Mimi](https://huggingface.co/spaces/potsawee/TextSyncMimi-SpeechEditing/blob/main/app.py) for a use-case (e.g., encoding, decoding, swapping) of the model


	## Acknowledgements

	- Built on top of [Kyutai's Mimi](https://huggingface.co/kyutai/mimi) audio codec