potsawee commited on
Commit
d5b2ce4
·
verified ·
1 Parent(s): 3c4f262

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ tags:
4
+ - audio
5
+ - text-sync
6
+ - mimi
7
+ - codec
8
+ ---
9
+
10
+ # TextSyncMimi-v1
11
+
12
+ TextSyncMimi is a text-synchronous neural audio codec model for high-quality text-to-speech synthesis. It extends the Mimi audio codec with text-speech alignment capabilities through cross-attention transformers, enabling controllable and efficient speech generation.
13
+
14
+ ## Model Description
15
+
16
+ TextSyncMimi-v1 is built on top of the Mimi audio codec and introduces:
17
+
18
+ - **Text-Speech Alignment**: Cross-attention transformers that align text representations with speech features
19
+ - **Autoregressive Generation**: Causal attention transformers for generating audio in an autoregressive manner
20
+ - **Token-Level Control**: Direct text token to speech frame alignment for fine-grained control
21
+ - **End Token Prediction**: BCE-based end token classification for dynamic speech duration
22
+
23
+ ### Architecture
24
+
25
+ The model consists of:
26
+
27
+ 1. **Text Embedding Layer**: Learnable embeddings (vocab_size=128,256, dim=4,096) matching LLaMA-3 tokenizer
28
+ 2. **Mimi Encoder**: Pre-trained audio encoder from Kyutai's Mimi model
29
+ 3. **Text Projection**: Linear projection from 4,096 to 512 dimensions
30
+ 4. **Cross-Attention Transformer**: 4 layers for text-speech alignment
31
+ 5. **Autoregressive Transformer**: 4 layers for causal speech generation
32
+ 6. **End Token Classifier**: Binary classifier for stopping generation
33
+
34
+ ### Key Features
35
+
36
+ - **Sample Rate**: 24,000 Hz
37
+ - **Frame Rate**: 12.5 frames/second
38
+ - **Vocabulary Size**: 128,256 (LLaMA-3 tokenizer)
39
+ - **Hidden Size**: 512
40
+ - **Max Z Tokens per Text Token**: 50 (configurable)
41
+
42
+ ## Usage
43
+
44
+ ### Installation
45
+
46
+ ```bash
47
+ pip install transformers torch soundfile librosa
48
+ ```
49
+
50
+ ### Loading the Model
51
+
52
+ ```python
53
+ from transformers import AutoModel, AutoTokenizer
54
+ import torch
55
+
56
+ # Load model and tokenizer
57
+ model = AutoModel.from_pretrained("your-username/TextSyncMimi-v1", trust_remote_code=True)
58
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
59
+
60
+ # Move to GPU if available
61
+ device = "cuda" if torch.cuda.is_available() else "cpu"
62
+ model = model.to(device)
63
+ model.eval()
64
+ ```
65
+
66
+ ### Generating Speech
67
+
68
+ ```python
69
+ import torch
70
+ import soundfile as sf
71
+ from transformers import MimiModel
72
+
73
+ # Load Mimi decoder for audio generation
74
+ mimi_model = MimiModel.from_pretrained("kyutai/mimi")
75
+ mimi_model.to(device)
76
+ mimi_model.eval()
77
+
78
+ # Prepare text input
79
+ text = "Hello, this is a test of text to speech synthesis."
80
+ tokens = tokenizer(text, return_tensors="pt", add_special_tokens=False)
81
+ text_token_ids = tokens.input_ids.to(device)
82
+
83
+ # Prepare reference audio (for style conditioning)
84
+ # You need a reference audio file that provides the speaking style
85
+ import librosa
86
+ reference_audio, sr = librosa.load("reference.wav", sr=24000, mono=True)
87
+ audio_inputs = torch.from_numpy(reference_audio).unsqueeze(0).unsqueeze(0).to(device)
88
+
89
+ # Generate speech
90
+ with torch.no_grad():
91
+ # Generate z-tokens autoregressively
92
+ z_tokens_list = model.generate_autoregressive(
93
+ text_token_ids=text_token_ids,
94
+ input_values=audio_inputs,
95
+ max_z_tokens=50,
96
+ end_token_threshold=0.5,
97
+ device=device
98
+ )
99
+
100
+ # Decode z-tokens to audio
101
+ if len(z_tokens_list[0]) > 0:
102
+ z_tokens_batch = torch.stack(z_tokens_list[0], dim=0).unsqueeze(0)
103
+ embeddings_bct = z_tokens_batch.transpose(1, 2)
104
+ embeddings_upsampled = mimi_model.upsample(embeddings_bct)
105
+ decoder_outputs = mimi_model.decoder_transformer(embeddings_upsampled.transpose(1, 2), return_dict=True)
106
+ embeddings_after_dec = decoder_outputs.last_hidden_state.transpose(1, 2)
107
+ audio_tensor = mimi_model.decoder(embeddings_after_dec)
108
+
109
+ # Save audio
110
+ audio_numpy = audio_tensor.squeeze().detach().cpu().numpy()
111
+ sf.write("output.wav", audio_numpy, 24000)
112
+ ```
113
+
114
+ ### Speech Editing
115
+
116
+ TextSyncMimi enables fine-grained speech editing by swapping embeddings at the token level. See the gradio demo script for examples of speech embedding swapping between different transcripts.
117
+
118
+ ## Training
119
+
120
+ The model was trained on:
121
+
122
+ - Combined LibriTTS and LibriSpeech datasets
123
+ - 50 epochs with early stopping
124
+ - Batch size: 32
125
+ - Learning rate: 1e-3 with warmup
126
+ - Mixed precision (FP16) training
127
+ - Loss: Combined MSE reconstruction loss + BCE end token loss
128
+
129
+ ### Loss Function
130
+
131
+ ```
132
+ total_loss = reconstruction_loss + alpha * clamp(bce_loss - threshold, min=0.0)
133
+ ```
134
+
135
+ Where:
136
+ - `alpha = 1.0`
137
+ - `bce_threshold = 0.1`
138
+
139
+ ## License
140
+
141
+ This model is released under the CC BY 4.0 License.
142
+
143
+ ## Acknowledgements
144
+
145
+ - Built on top of [Kyutai's Mimi](https://huggingface.co/kyutai/mimi) audio codec