mnhatdaous commited on
Commit
edfcfb2
·
1 Parent(s): 4d32bab

Update README with Hugging Face Space metadata

Browse files
Files changed (3) hide show
  1. README.md +33 -221
  2. README_HF.md +0 -53
  3. flowae/load/vgg_lpips.pth +0 -0
README.md CHANGED
@@ -1,20 +1,33 @@
1
- # Learnable-Speech Technical Implementation
 
 
 
 
 
 
 
 
 
2
 
3
- An unofficial implementation based on improvements of cosyvoice with learnable encoder and dac-vae, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).
4
 
5
- ![Architecture](assets/image.png)
6
 
7
- ## Overview
8
 
9
- This repository provides an implementation of the Learnable-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.
10
 
11
- ## Key Features
 
 
12
 
13
- - [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
14
- - [x] **Flow matching AE**: Flow matching training for autoencoders
15
- - [x] **Immiscible assignment**: Support immiscible adding noise while training
16
- - [x] **Contrastive Flow matching**: Support Contrastive training
17
- - [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
 
 
18
  - [ ] **MeanFlow**: Meanflow for FM model
19
 
20
  ## Architecture
@@ -27,217 +40,16 @@ Converts raw audio into discrete representations using the FSQ (S3Tokenizer) fra
27
 
28
  Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
29
 
30
- > **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
31
-
32
- ## Implementation Pipeline
33
-
34
- ### 1. Model Training
35
-
36
- #### BPE tokens to FSQ tokens
37
-
38
- - Based on the FSQ
39
- - Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
40
-
41
- #### FSQ tokens to DAC-VAE latent
42
-
43
- - Based on Cosyvoice2 flow matching decoder
44
- - Learns continuous latent representations from discrete tokens
45
-
46
- ### 2. Feature Extraction
47
-
48
- Before training the main model:
49
 
50
- 1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
51
- 2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
52
- - Notes: This model is trained with scale one fsq token will have 3 fractor of frame rate in dac-vae latent, will update 2 fractor soon
53
 
54
- ### 3. Two-Stage Training
55
 
56
- Train the models sequentially:
 
 
57
 
58
- - **Stage 1**: BPE tokens Discrete FSQ
59
- - **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
60
-
61
- ## Getting Started
62
-
63
- ### Prerequisites
64
-
65
- ```bash
66
- # List your dependencies here
67
- pip install -r requirements.txt
68
- ```
69
-
70
- ### Training Pipeline
71
-
72
- 1. **Extracting FSQ**
73
-
74
- ```bash
75
- pip install s3tokenizer
76
- s3tokenizer --wav_scp data.scp \
77
- --device "cuda" \
78
- --output_dir "./data" \
79
- --batch_size 32 \
80
- --model "speech_tokenizer_v2_25hz"
81
- ```
82
-
83
- or you can install via this repo, it will use filelist.txt to extract, each line in filelist.txt contains file audio path - example files_test.txt
84
-
85
- ```
86
- cd speech/tools/S3Tokenizer
87
- pip3 install .
88
- # example cmd to run
89
- torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" `which s3tokenizer` --root_path /data/dataset/ \
90
- --model speech_tokenizer_v2_25hz \
91
- --device "cuda" \
92
- --batch_size 64 \
93
- --file_list /speech/files_test.txt \
94
- --skip_existing
95
- ```
96
-
97
- 2. **Extracting DAC-VAE latent**
98
-
99
- ```bash
100
- cd dac-vae
101
- python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
102
- ```
103
-
104
- After processing you should have root folder with following files:
105
-
106
- ```
107
- dataset_root/
108
- ├── audio_name.wav
109
- ├── audio_name.txt
110
- ├── audio_name_fsq.pt
111
- ├── audio_name_latent.pt
112
- ├── another_audio.wav
113
- ├── another_audio.txt
114
- ├── another_audio_fsq.pt
115
- ├── another_audio_latent.pt
116
- └── ...
117
- ```
118
-
119
- 3. **Stage 1: Auto Regressive Transformer**
120
-
121
- ```bash
122
- #!/bin/bash
123
- pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
124
-
125
- export CUDA_VISIBLE_DEVICES="0"
126
- num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
127
- job_id=1986
128
- dist_backend="nccl"
129
- num_workers=2
130
- prefetch=100
131
- train_engine=torch_ddp
132
- model=llm
133
-
134
- torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
135
- train.py \
136
- --train_engine $train_engine \
137
- --config config.yaml \
138
- --train_data data/data.list \
139
- --cv_data data/data.list \
140
- --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
141
- --model $model \
142
- --model_dir /data/checkpoint/$model/ \
143
- --num_workers ${num_workers} \
144
- --prefetch ${prefetch} \
145
- --pin_memory \
146
- --use_amp \
147
- --comet_disabled
148
-
149
- ```
150
-
151
- 4. **Stage 2: FLow matching decoder**
152
-
153
- ```bash
154
- #!/bin/bash
155
- pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
156
- export CUDA_VISIBLE_DEVICES="0"
157
- num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
158
- job_id=1986
159
- dist_backend="nccl"
160
- num_workers=2
161
- prefetch=100
162
- train_engine=torch_ddp
163
- model=llm
164
-
165
- torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
166
- train.py \
167
- --train_engine $train_engine \
168
- --config config.yaml \
169
- --train_data data/data.list \
170
- --cv_data data/data.list \
171
- --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
172
- --model $model \
173
- --model_dir /data/checkpoint/$model/ \
174
- --num_workers ${num_workers} \
175
- --prefetch ${prefetch} \
176
- --pin_memory \
177
- --use_amp \
178
- --comet_disabled
179
-
180
- ```
181
-
182
- ## Project Structure
183
-
184
- ```
185
- minimax-speech/
186
- ├── assets/
187
- ├── dac-vae/
188
- ├── flowae/
189
- ├── speech/
190
- │ ├── llm/
191
- │ ├── flow/
192
- └── README.md
193
- ```
194
-
195
- ## Related Projects
196
-
197
- This implementation builds upon several key projects:
198
-
199
- - **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines
200
- - **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework
201
- - **Learnable-Speech**: Original technical report and methodology
202
-
203
- ## Citation
204
-
205
- If you use this code in your research, please cite:
206
-
207
- ```bibtex
208
- @article{minimax-speech,
209
- title={Learnable-Speech},
210
- author={[Learnable team]},
211
- year={[2025]}
212
- url={https://arxiv.org/pdf/2505.07916}
213
- }
214
-
215
- @misc{cosyvoice2,
216
- title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
217
- author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
218
- year={2024},
219
- url={https://github.com/FunAudioLLM/CosyVoice}
220
- }
221
- ```
222
-
223
- ## License
224
-
225
- This project follows the licensing terms of its dependencies:
226
-
227
- - CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
228
- - FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
229
-
230
- ## Acknowledgments
231
-
232
- - **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2
233
- - **[FSQ](https://github.com/xingchensong/S3Tokenizer)**: For the FSQ implementation
234
- - **Learnable team**: For the technical report and methodology
235
- - **FunAudioLLM team**: For the excellent CosyVoice2 codebase
236
-
237
- ## Contributing
238
-
239
- Contributions are welcome! Please feel free to submit a Pull Request.
240
-
241
- ## Disclaimer
242
-
243
- The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.
 
1
+ ---
2
+ title: Learnable Speech
3
+ emoji: 🎤
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ license: apache-2.0
9
+ app_port: 7860
10
+ ---
11
 
12
+ # Learnable-Speech: High-Quality 24kHz Speech Synthesis
13
 
14
+ An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
15
 
16
+ ## Demo
17
 
18
+ This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:
19
 
20
+ 1. Train the model using the provided training pipeline
21
+ 2. Upload the trained checkpoints
22
+ 3. Replace the placeholder inference code with actual model loading and inference
23
 
24
+ ## Features
25
+
26
+ - [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
27
+ - [x] **Flow matching AE**: Flow matching training for autoencoders
28
+ - [x] **Immiscible assignment**: Support immiscible adding noise while training
29
+ - [x] **Contrastive Flow matching**: Support Contrastive training
30
+ - [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
31
  - [ ] **MeanFlow**: Meanflow for FM model
32
 
33
  ## Architecture
 
40
 
41
  Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
42
 
43
+ ## Links
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
+ - [GitHub Repository](https://github.com/primepake/learnable-speech)
46
+ - [Technical Paper](https://arxiv.org/pdf/2505.07916)
47
+ - [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)
48
 
49
+ ## Usage
50
 
51
+ 1. Enter text in the text box
52
+ 2. Select a speaker ID (0-10)
53
+ 3. Click "Generate Speech" to synthesize audio
54
 
55
+ **Note**: This is currently a placeholder demo. The actual model requires training first.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README_HF.md DELETED
@@ -1,53 +0,0 @@
1
- ---
2
- title: Learnable Speech
3
- emoji: 🎤
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: docker
7
- pinned: false
8
- license: apache-2.0
9
- app_port: 7860
10
- ---
11
-
12
- # Learnable-Speech: High-Quality 24kHz Speech Synthesis
13
-
14
- An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
15
-
16
- ## Demo
17
-
18
- This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:
19
-
20
- 1. Train the model using the provided training pipeline
21
- 2. Upload the trained checkpoints
22
- 3. Replace the placeholder inference code with actual model loading and inference
23
-
24
- ## Features
25
-
26
- - **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
27
- - **Flow matching AE**: Flow matching training for autoencoders
28
- - **Immiscible assignment**: Support immiscible adding noise while training
29
- - **Contrastive Flow matching**: Support Contrastive training
30
-
31
- ## Architecture
32
-
33
- ### Stage 1: Audio to Discrete Tokens
34
-
35
- Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
36
-
37
- ### Stage 2: Discrete Tokens to Continuous Latent Space
38
-
39
- Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
40
-
41
- ## Links
42
-
43
- - [GitHub Repository](https://github.com/primepake/learnable-speech)
44
- - [Technical Paper](https://arxiv.org/pdf/2505.07916)
45
- - [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)
46
-
47
- ## Usage
48
-
49
- 1. Enter text in the text box
50
- 2. Select a speaker ID (0-10)
51
- 3. Click "Generate Speech" to synthesize audio
52
-
53
- **Note**: This is currently a placeholder demo. The actual model requires training first.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
flowae/load/vgg_lpips.pth CHANGED
Binary files a/flowae/load/vgg_lpips.pth and b/flowae/load/vgg_lpips.pth differ