potsawee
/

TextSyncMimi-v1

+---
+license: cc-by-4.0
+tags:
+- audio
+- text-sync
+- mimi
+- codec
+---
+# TextSyncMimi-v1
+TextSyncMimi is a text-synchronous neural audio codec model for high-quality text-to-speech synthesis. It extends the Mimi audio codec with text-speech alignment capabilities through cross-attention transformers, enabling controllable and efficient speech generation.
+## Model Description
+TextSyncMimi-v1 is built on top of the Mimi audio codec and introduces:
+- **Text-Speech Alignment**: Cross-attention transformers that align text representations with speech features
+- **Autoregressive Generation**: Causal attention transformers for generating audio in an autoregressive manner
+- **Token-Level Control**: Direct text token to speech frame alignment for fine-grained control
+- **End Token Prediction**: BCE-based end token classification for dynamic speech duration
+### Architecture
+The model consists of:
+1. **Text Embedding Layer**: Learnable embeddings (vocab_size=128,256, dim=4,096) matching LLaMA-3 tokenizer
+2. **Mimi Encoder**: Pre-trained audio encoder from Kyutai's Mimi model
+3. **Text Projection**: Linear projection from 4,096 to 512 dimensions
+4. **Cross-Attention Transformer**: 4 layers for text-speech alignment
+5. **Autoregressive Transformer**: 4 layers for causal speech generation
+6. **End Token Classifier**: Binary classifier for stopping generation
+### Key Features
+- **Sample Rate**: 24,000 Hz
+- **Frame Rate**: 12.5 frames/second
+- **Vocabulary Size**: 128,256 (LLaMA-3 tokenizer)
+- **Hidden Size**: 512
+- **Max Z Tokens per Text Token**: 50 (configurable)
+## Usage
+### Installation
+```bash
+pip install transformers torch soundfile librosa
+```
+### Loading the Model
+```python
+from transformers import AutoModel, AutoTokenizer
+import torch
+# Load model and tokenizer
+model = AutoModel.from_pretrained("your-username/TextSyncMimi-v1", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+# Move to GPU if available
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device)
+model.eval()
+```
+### Generating Speech
+```python
+import torch
+import soundfile as sf
+from transformers import MimiModel
+# Load Mimi decoder for audio generation
+mimi_model = MimiModel.from_pretrained("kyutai/mimi")
+mimi_model.to(device)
+mimi_model.eval()
+# Prepare text input
+text = "Hello, this is a test of text to speech synthesis."
+tokens = tokenizer(text, return_tensors="pt", add_special_tokens=False)
+text_token_ids = tokens.input_ids.to(device)
+# Prepare reference audio (for style conditioning)
+# You need a reference audio file that provides the speaking style
+import librosa
+reference_audio, sr = librosa.load("reference.wav", sr=24000, mono=True)
+audio_inputs = torch.from_numpy(reference_audio).unsqueeze(0).unsqueeze(0).to(device)
+# Generate speech
+with torch.no_grad():
+    # Generate z-tokens autoregressively
+    z_tokens_list = model.generate_autoregressive(
+        text_token_ids=text_token_ids,
+        input_values=audio_inputs,
+        max_z_tokens=50,
+        end_token_threshold=0.5,
+        device=device
+    )
+    # Decode z-tokens to audio
+    if len(z_tokens_list[0]) > 0:
+        z_tokens_batch = torch.stack(z_tokens_list[0], dim=0).unsqueeze(0)
+        embeddings_bct = z_tokens_batch.transpose(1, 2)
+        embeddings_upsampled = mimi_model.upsample(embeddings_bct)
+        decoder_outputs = mimi_model.decoder_transformer(embeddings_upsampled.transpose(1, 2), return_dict=True)
+        embeddings_after_dec = decoder_outputs.last_hidden_state.transpose(1, 2)
+        audio_tensor = mimi_model.decoder(embeddings_after_dec)
+        # Save audio
+        audio_numpy = audio_tensor.squeeze().detach().cpu().numpy()
+        sf.write("output.wav", audio_numpy, 24000)
+```
+### Speech Editing
+TextSyncMimi enables fine-grained speech editing by swapping embeddings at the token level. See the gradio demo script for examples of speech embedding swapping between different transcripts.
+## Training
+The model was trained on:
+- Combined LibriTTS and LibriSpeech datasets
+- 50 epochs with early stopping
+- Batch size: 32
+- Learning rate: 1e-3 with warmup
+- Mixed precision (FP16) training
+- Loss: Combined MSE reconstruction loss + BCE end token loss
+### Loss Function
+```
+total_loss = reconstruction_loss + alpha * clamp(bce_loss - threshold, min=0.0)
+```
+Where:
+- `alpha = 1.0`
+- `bce_threshold = 0.1`
+## License
+This model is released under the CC BY 4.0 License.
+## Acknowledgements
+- Built on top of [Kyutai's Mimi](https://huggingface.co/kyutai/mimi) audio codec