potsawee commited on
Commit
db9c863
·
verified ·
1 Parent(s): d5b2ce4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -118
README.md CHANGED
@@ -9,136 +9,45 @@ tags:
9
 
10
  # TextSyncMimi-v1
11
 
12
- TextSyncMimi is a text-synchronous neural audio codec model for high-quality text-to-speech synthesis. It extends the Mimi audio codec with text-speech alignment capabilities through cross-attention transformers, enabling controllable and efficient speech generation.
13
-
14
- ## Model Description
15
-
16
- TextSyncMimi-v1 is built on top of the Mimi audio codec and introduces:
17
-
18
- - **Text-Speech Alignment**: Cross-attention transformers that align text representations with speech features
19
- - **Autoregressive Generation**: Causal attention transformers for generating audio in an autoregressive manner
20
- - **Token-Level Control**: Direct text token to speech frame alignment for fine-grained control
21
- - **End Token Prediction**: BCE-based end token classification for dynamic speech duration
22
-
23
- ### Architecture
24
-
25
- The model consists of:
26
-
27
- 1. **Text Embedding Layer**: Learnable embeddings (vocab_size=128,256, dim=4,096) matching LLaMA-3 tokenizer
28
- 2. **Mimi Encoder**: Pre-trained audio encoder from Kyutai's Mimi model
29
- 3. **Text Projection**: Linear projection from 4,096 to 512 dimensions
30
- 4. **Cross-Attention Transformer**: 4 layers for text-speech alignment
31
- 5. **Autoregressive Transformer**: 4 layers for causal speech generation
32
- 6. **End Token Classifier**: Binary classifier for stopping generation
33
-
34
- ### Key Features
35
-
36
- - **Sample Rate**: 24,000 Hz
37
- - **Frame Rate**: 12.5 frames/second
38
- - **Vocabulary Size**: 128,256 (LLaMA-3 tokenizer)
39
- - **Hidden Size**: 512
40
- - **Max Z Tokens per Text Token**: 50 (configurable)
41
-
42
  ## Usage
43
 
44
- ### Installation
45
-
46
- ```bash
47
- pip install transformers torch soundfile librosa
48
- ```
49
-
50
  ### Loading the Model
51
 
52
  ```python
53
  from transformers import AutoModel, AutoTokenizer
54
  import torch
55
 
56
- # Load model and tokenizer
57
  model = AutoModel.from_pretrained("your-username/TextSyncMimi-v1", trust_remote_code=True)
58
- tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
59
-
60
- # Move to GPU if available
61
- device = "cuda" if torch.cuda.is_available() else "cpu"
62
- model = model.to(device)
63
- model.eval()
64
- ```
65
-
66
- ### Generating Speech
67
-
68
- ```python
69
- import torch
70
- import soundfile as sf
71
- from transformers import MimiModel
72
-
73
- # Load Mimi decoder for audio generation
74
- mimi_model = MimiModel.from_pretrained("kyutai/mimi")
75
- mimi_model.to(device)
76
- mimi_model.eval()
77
-
78
- # Prepare text input
79
- text = "Hello, this is a test of text to speech synthesis."
80
- tokens = tokenizer(text, return_tensors="pt", add_special_tokens=False)
81
- text_token_ids = tokens.input_ids.to(device)
82
-
83
- # Prepare reference audio (for style conditioning)
84
- # You need a reference audio file that provides the speaking style
85
- import librosa
86
- reference_audio, sr = librosa.load("reference.wav", sr=24000, mono=True)
87
- audio_inputs = torch.from_numpy(reference_audio).unsqueeze(0).unsqueeze(0).to(device)
88
-
89
- # Generate speech
90
- with torch.no_grad():
91
- # Generate z-tokens autoregressively
92
- z_tokens_list = model.generate_autoregressive(
93
- text_token_ids=text_token_ids,
94
- input_values=audio_inputs,
95
- max_z_tokens=50,
96
- end_token_threshold=0.5,
97
- device=device
98
- )
99
-
100
- # Decode z-tokens to audio
101
- if len(z_tokens_list[0]) > 0:
102
- z_tokens_batch = torch.stack(z_tokens_list[0], dim=0).unsqueeze(0)
103
- embeddings_bct = z_tokens_batch.transpose(1, 2)
104
- embeddings_upsampled = mimi_model.upsample(embeddings_bct)
105
- decoder_outputs = mimi_model.decoder_transformer(embeddings_upsampled.transpose(1, 2), return_dict=True)
106
- embeddings_after_dec = decoder_outputs.last_hidden_state.transpose(1, 2)
107
- audio_tensor = mimi_model.decoder(embeddings_after_dec)
108
-
109
- # Save audio
110
- audio_numpy = audio_tensor.squeeze().detach().cpu().numpy()
111
- sf.write("output.wav", audio_numpy, 24000)
112
- ```
113
-
114
- ### Speech Editing
115
-
116
- TextSyncMimi enables fine-grained speech editing by swapping embeddings at the token level. See the gradio demo script for examples of speech embedding swapping between different transcripts.
117
-
118
- ## Training
119
-
120
- The model was trained on:
121
-
122
- - Combined LibriTTS and LibriSpeech datasets
123
- - 50 epochs with early stopping
124
- - Batch size: 32
125
- - Learning rate: 1e-3 with warmup
126
- - Mixed precision (FP16) training
127
- - Loss: Combined MSE reconstruction loss + BCE end token loss
128
-
129
- ### Loss Function
130
-
131
  ```
132
- total_loss = reconstruction_loss + alpha * clamp(bce_loss - threshold, min=0.0)
133
- ```
134
-
135
- Where:
136
- - `alpha = 1.0`
137
- - `bce_threshold = 0.1`
138
 
139
- ## License
140
 
141
- This model is released under the CC BY 4.0 License.
142
 
143
  ## Acknowledgements
144
 
 
9
 
10
  # TextSyncMimi-v1
11
 
12
+ **TextSyncMimi** provides a *textsynchronous* speech representation designed to plug into LLM‑based speech generation. Instead of operating at a fixed frame rate (time‑synchronous), it represents speech **per text token** and reconstructs high‑fidelity audio through a Mimi‑compatible neural audio decoder.
13
+
14
+ > TL;DR: We turn **time‑synchronous** Mimi latents into **text‑synchronous** token latents \([tᵢ, sᵢ]\), then expand them back to Mimi latents and decode to waveform. This makes token‑level control and alignment with LLM text outputs straightforward.
15
+
16
+ ## Model overview
17
+
18
+ <div align="center">
19
+ <img src="https://i.postimg.cc/V6D84Sxs/Screenshot-2568-08-12-at-16-07-13.png" alt="TextSyncMimi" width="60%" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
20
+ </div>
21
+
22
+ - **Backbone codec:** Mimi (12.5 Hz latent sequence).
23
+ - **TextSyncMimi components:**
24
+ - **Cross‑attention encoder** — aligns Mimi’s time‑synchronous sequence (length *T*) to the text sequence (length *N*), producing one continuous speech latent per text token.
25
+ - **Causal decoder** — expands token‑level latents back to a Mimi‑rate latent sequence suitable for a Mimi decoder.
26
+
27
+ ## Training / Evaluation
28
+ - **Lossess**: (1) **L2** distance between predicted and ground‑truth continuous Mimi latents, and (2) **BCE** for the stop token during expansion.
29
+ - **Training Data**: LibriSpeech (960 hours) + LibriTTS (585 hours) -- around 1.5K hours in total
30
+ - **Results**: ASR WER on audio reconstructed from different methods (NB: non-zero WER of ground-truth audio came from ASR errors):
31
+ | Method | Train data | WER |
32
+ |------------------|------------------------------------------|------:|
33
+ | Ground‑truth | – | 2.12 |
34
+ | Mimi | – | 2.29 |
35
+ | TASTE | Emilia + LibriTTS | 4.40 |
36
+ | **TextSyncMimi v1** | **LibriTTS‑R + LibriSpeech** | **3.06** |
 
 
 
 
 
37
  ## Usage
38
 
 
 
 
 
 
 
39
  ### Loading the Model
40
 
41
  ```python
42
  from transformers import AutoModel, AutoTokenizer
43
  import torch
44
 
45
+ # Load model
46
  model = AutoModel.from_pretrained("your-username/TextSyncMimi-v1", trust_remote_code=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
 
 
 
 
 
 
48
 
49
+ See `demo_speech_editing.py` for a use-case (e.g., encoding & decoding) of the model
50
 
 
51
 
52
  ## Acknowledgements
53