potsawee
/

TextSyncMimi-v1

@@ -9,136 +9,45 @@ tags:
 # TextSyncMimi-v1
-TextSyncMimi is a text-synchronous neural audio codec model for high-quality text-to-speech synthesis. It extends the Mimi audio codec with text-speech alignment capabilities through cross-attention transformers, enabling controllable and efficient speech generation.
-## Model Description
-TextSyncMimi-v1 is built on top of the Mimi audio codec and introduces:
-- **Text-Speech Alignment**: Cross-attention transformers that align text representations with speech features
-- **Autoregressive Generation**: Causal attention transformers for generating audio in an autoregressive manner
-- **Token-Level Control**: Direct text token to speech frame alignment for fine-grained control
-- **End Token Prediction**: BCE-based end token classification for dynamic speech duration
-### Architecture
-The model consists of:
-1. **Text Embedding Layer**: Learnable embeddings (vocab_size=128,256, dim=4,096) matching LLaMA-3 tokenizer
-2. **Mimi Encoder**: Pre-trained audio encoder from Kyutai's Mimi model
-3. **Text Projection**: Linear projection from 4,096 to 512 dimensions
-4. **Cross-Attention Transformer**: 4 layers for text-speech alignment
-5. **Autoregressive Transformer**: 4 layers for causal speech generation
-6. **End Token Classifier**: Binary classifier for stopping generation
-### Key Features
-- **Sample Rate**: 24,000 Hz
-- **Frame Rate**: 12.5 frames/second
-- **Vocabulary Size**: 128,256 (LLaMA-3 tokenizer)
-- **Hidden Size**: 512
-- **Max Z Tokens per Text Token**: 50 (configurable)
 ## Usage
-### Installation
-```bash
-pip install transformers torch soundfile librosa
-```
 ### Loading the Model
 ```python
 from transformers import AutoModel, AutoTokenizer
 import torch
-# Load model and tokenizer
 model = AutoModel.from_pretrained("your-username/TextSyncMimi-v1", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
-# Move to GPU if available
-device = "cuda" if torch.cuda.is_available() else "cpu"
-model = model.to(device)
-model.eval()
-```
-### Generating Speech
-```python
-import torch
-import soundfile as sf
-from transformers import MimiModel
-# Load Mimi decoder for audio generation
-mimi_model = MimiModel.from_pretrained("kyutai/mimi")
-mimi_model.to(device)
-mimi_model.eval()
-# Prepare text input
-text = "Hello, this is a test of text to speech synthesis."
-tokens = tokenizer(text, return_tensors="pt", add_special_tokens=False)
-text_token_ids = tokens.input_ids.to(device)
-# Prepare reference audio (for style conditioning)
-# You need a reference audio file that provides the speaking style
-import librosa
-reference_audio, sr = librosa.load("reference.wav", sr=24000, mono=True)
-audio_inputs = torch.from_numpy(reference_audio).unsqueeze(0).unsqueeze(0).to(device)
-# Generate speech
-with torch.no_grad():
-    # Generate z-tokens autoregressively
-    z_tokens_list = model.generate_autoregressive(
-        text_token_ids=text_token_ids,
-        input_values=audio_inputs,
-        max_z_tokens=50,
-        end_token_threshold=0.5,
-        device=device
-    )
-    # Decode z-tokens to audio
-    if len(z_tokens_list[0]) > 0:
-        z_tokens_batch = torch.stack(z_tokens_list[0], dim=0).unsqueeze(0)
-        embeddings_bct = z_tokens_batch.transpose(1, 2)
-        embeddings_upsampled = mimi_model.upsample(embeddings_bct)
-        decoder_outputs = mimi_model.decoder_transformer(embeddings_upsampled.transpose(1, 2), return_dict=True)
-        embeddings_after_dec = decoder_outputs.last_hidden_state.transpose(1, 2)
-        audio_tensor = mimi_model.decoder(embeddings_after_dec)
-        # Save audio
-        audio_numpy = audio_tensor.squeeze().detach().cpu().numpy()
-        sf.write("output.wav", audio_numpy, 24000)
-```
-### Speech Editing
-TextSyncMimi enables fine-grained speech editing by swapping embeddings at the token level. See the gradio demo script for examples of speech embedding swapping between different transcripts.
-## Training
-The model was trained on:
-- Combined LibriTTS and LibriSpeech datasets
-- 50 epochs with early stopping
-- Batch size: 32
-- Learning rate: 1e-3 with warmup
-- Mixed precision (FP16) training
-- Loss: Combined MSE reconstruction loss + BCE end token loss
-### Loss Function
 ```
-total_loss = reconstruction_loss + alpha * clamp(bce_loss - threshold, min=0.0)
-```
-Where:
-- `alpha = 1.0`
-- `bce_threshold = 0.1`
-## License
-This model is released under the CC BY 4.0 License.
 ## Acknowledgements

 # TextSyncMimi-v1
+**TextSyncMimi** provides a *text‑synchronous* speech representation designed to plug into LLM‑based speech generation. Instead of operating at a fixed frame rate (time‑synchronous), it represents speech **per text token** and reconstructs high‑fidelity audio through a Mimi‑compatible neural audio decoder.
+> TL;DR: We turn **time‑synchronous** Mimi latents into **text‑synchronous** token latents \([tᵢ, sᵢ]\), then expand them back to Mimi latents and decode to waveform. This makes token‑level control and alignment with LLM text outputs straightforward.
+## Model overview
+<div align="center">
+<img src="https://i.postimg.cc/V6D84Sxs/Screenshot-2568-08-12-at-16-07-13.png" alt="TextSyncMimi" width="60%" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
+</div>
+- **Backbone codec:** Mimi (12.5 Hz latent sequence).
+- **TextSyncMimi components:**
+  - **Cross‑attention encoder** — aligns Mimi’s time‑synchronous sequence (length *T*) to the text sequence (length *N*), producing one continuous speech latent per text token.
+  - **Causal decoder** — expands token‑level latents back to a Mimi‑rate latent sequence suitable for a Mimi decoder.
+## Training / Evaluation
+- **Lossess**: (1) **L2** distance between predicted and ground‑truth continuous Mimi latents, and (2) **BCE** for the stop token during expansion.
+- **Training Data**: LibriSpeech (960 hours) + LibriTTS (585 hours) -- around 1.5K hours in total
+- **Results**: ASR WER on audio reconstructed from different methods (NB: non-zero WER of ground-truth audio came from ASR errors):
+  | Method | Train data                              | WER ↓ |
+  |------------------|------------------------------------------|------:|
+  | Ground‑truth     | –                                       | 2.12 |
+  | Mimi             | –                                       | 2.29 |
+  | TASTE            | Emilia + LibriTTS                       | 4.40 |
+  | **TextSyncMimi v1** | **LibriTTS‑R + LibriSpeech**     | **3.06** |
 ## Usage
 ### Loading the Model
 ```python
 from transformers import AutoModel, AutoTokenizer
 import torch
+# Load model
 model = AutoModel.from_pretrained("your-username/TextSyncMimi-v1", trust_remote_code=True)
 ```
+See `demo_speech_editing.py` for a use-case (e.g., encoding & decoding) of the model
 ## Acknowledgements