Spaces:

mnhatdaous
/

learnable-speech

Sleeping

App Files Files Community

mnhatdaous commited on Sep 9

Commit

edfcfb2

1 Parent(s): 4d32bab

Update README with Hugging Face Space metadata

Browse files

Files changed (3) hide show

README.md +33 -221
README_HF.md +0 -53
flowae/load/vgg_lpips.pth +0 -0

README.md CHANGED Viewed

@@ -1,20 +1,33 @@
-# Learnable-Speech Technical Implementation
-An unofficial implementation based on improvements of cosyvoice with learnable encoder and dac-vae, with core components adapted from [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice).
-![Architecture](assets/image.png)
-## Overview
-This repository provides an implementation of the Learnable-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.
-## Key Features
-- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
-- [x] **Flow matching AE**: Flow matching training for autoencoders
-- [x] **Immiscible assignment**: Support immiscible adding noise while training
-- [x] **Contrastive Flow matching**: Support Contrastive training
-- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
 - [ ] **MeanFlow**: Meanflow for FM model
 ## Architecture
@@ -27,217 +40,16 @@ Converts raw audio into discrete representations using the FSQ (S3Tokenizer) fra
 Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
-> **Note**: This implementation uses standard DAC-VAE instead of Flow-VAE.
-## Implementation Pipeline
-### 1. Model Training
-#### BPE tokens to FSQ tokens
-- Based on the FSQ
-- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
-#### FSQ tokens to DAC-VAE latent
-- Based on Cosyvoice2 flow matching decoder
-- Learns continuous latent representations from discrete tokens
-### 2. Feature Extraction
-Before training the main model:
-1. Extract discrete tokens using the trained FSQ [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer)
-2. Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided [DAC-VAE](https://github.com/primepake/learnable-speech/releases/tag/dac-vae)
- - Notes: This model is trained with scale one fsq token will have 3 fractor of frame rate in dac-vae latent, will update 2 fractor soon
-### 3. Two-Stage Training
-Train the models sequentially:
-- **Stage 1**: BPE tokens → Discrete FSQ
-- **Stage 2**: Discrete FSQ → DAC-VAE Continuous latent space
-## Getting Started
-### Prerequisites
-```bash
-# List your dependencies here
-pip install -r requirements.txt
-```
-### Training Pipeline
-1. **Extracting FSQ**
-   ```bash
-   pip install s3tokenizer
-   s3tokenizer --wav_scp data.scp \
-            --device "cuda" \
-            --output_dir "./data" \
-            --batch_size 32 \
-            --model "speech_tokenizer_v2_25hz"
-   ```
-   or you can install via this repo, it will use filelist.txt to extract, each line in filelist.txt contains file audio path - example files_test.txt
-   ```
-   cd speech/tools/S3Tokenizer
-   pip3 install .
-   # example cmd to run
-   torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" `which s3tokenizer` --root_path /data/dataset/ \
-                --model speech_tokenizer_v2_25hz \
-                --device "cuda" \
-                --batch_size 64 \
-                --file_list /speech/files_test.txt \
-                --skip_existing
-   ```
-2. **Extracting DAC-VAE latent**
-   ```bash
-   cd dac-vae
-   python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
-   ```
-After processing you should have root folder with following files:
-```
-dataset_root/
-├── audio_name.wav
-├── audio_name.txt
-├── audio_name_fsq.pt
-├── audio_name_latent.pt
-├── another_audio.wav
-├── another_audio.txt
-├── another_audio_fsq.pt
-├── another_audio_latent.pt
-└── ...
-```
-3. **Stage 1: Auto Regressive Transformer**
-   ```bash
-   #!/bin/bash
-   pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
-   export CUDA_VISIBLE_DEVICES="0"
-   num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
-   job_id=1986
-   dist_backend="nccl"
-   num_workers=2
-   prefetch=100
-   train_engine=torch_ddp
-   model=llm
-   torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
-   train.py \
-   --train_engine $train_engine \
-   --config config.yaml \
-   --train_data data/data.list \
-   --cv_data data/data.list \
-   --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
-   --model $model \
-   --model_dir /data/checkpoint/$model/ \
-   --num_workers ${num_workers} \
-   --prefetch ${prefetch} \
-   --pin_memory \
-   --use_amp \
-   --comet_disabled
-   ```
-4. **Stage 2: FLow matching decoder**
-   ```bash
-   #!/bin/bash
-   pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
-   export CUDA_VISIBLE_DEVICES="0"
-   num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
-   job_id=1986
-   dist_backend="nccl"
-   num_workers=2
-   prefetch=100
-   train_engine=torch_ddp
-   model=llm
-   torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
-   train.py \
-   --train_engine $train_engine \
-   --config config.yaml \
-   --train_data data/data.list \
-   --cv_data data/data.list \
-   --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
-   --model $model \
-   --model_dir /data/checkpoint/$model/ \
-   --num_workers ${num_workers} \
-   --prefetch ${prefetch} \
-   --pin_memory \
-   --use_amp \
-   --comet_disabled
-   ```
-## Project Structure
-```
-minimax-speech/
-├── assets/
-├── dac-vae/
-├── flowae/
-├── speech/
-│   ├── llm/
-│   ├── flow/
-└── README.md
-```
-## Related Projects
-This implementation builds upon several key projects:
-- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: Core model architectures and training pipelines
-- **[Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: Audio tokenization framework
-- **Learnable-Speech**: Original technical report and methodology
-## Citation
-If you use this code in your research, please cite:
-```bibtex
-@article{minimax-speech,
-  title={Learnable-Speech},
-  author={[Learnable team]},
-  year={[2025]}
-  url={https://arxiv.org/pdf/2505.07916}
-}
-@misc{cosyvoice2,
-  title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
-  author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
-  year={2024},
-  url={https://github.com/FunAudioLLM/CosyVoice}
-}
-```
-## License
-This project follows the licensing terms of its dependencies:
-- CosyVoice2 components: [Check CosyVoice2 License](https://github.com/FunAudioLLM/CosyVoice/blob/main/LICENSE)
-- FSQ components: [Apache 2.0 License](https://github.com/xingchensong/S3Tokenizer/blob/main/LICENSE)
-## Acknowledgments
-- **[CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)**: This implementation extensively uses code and architectures from CosyVoice2
-- **[FSQ](https://github.com/xingchensong/S3Tokenizer)**: For the FSQ implementation
-- **Learnable team**: For the technical report and methodology
-- **FunAudioLLM team**: For the excellent CosyVoice2 codebase
-## Contributing
-Contributions are welcome! Please feel free to submit a Pull Request.
-## Disclaimer
-The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.

+---
+title: Learnable Speech
+emoji: 🎤
+colorFrom: blue
+colorTo: purple
+sdk: docker
+pinned: false
+license: apache-2.0
+app_port: 7860
+---
+# Learnable-Speech: High-Quality 24kHz Speech Synthesis
+An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
+## Demo
+This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:
+1. Train the model using the provided training pipeline
+2. Upload the trained checkpoints
+3. Replace the placeholder inference code with actual model loading and inference
+## Features
+- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
+- [x] **Flow matching AE**: Flow matching training for autoencoders
+- [x] **Immiscible assignment**: Support immiscible adding noise while training
+- [x] **Contrastive Flow matching**: Support Contrastive training
+- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
 - [ ] **MeanFlow**: Meanflow for FM model
 ## Architecture
 Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
+## Links
+- [GitHub Repository](https://github.com/primepake/learnable-speech)
+- [Technical Paper](https://arxiv.org/pdf/2505.07916)
+- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)
+## Usage
+1. Enter text in the text box
+2. Select a speaker ID (0-10)
+3. Click "Generate Speech" to synthesize audio
+**Note**: This is currently a placeholder demo. The actual model requires training first.

README_HF.md DELETED Viewed

@@ -1,53 +0,0 @@
----
-title: Learnable Speech
-emoji: 🎤
-colorFrom: blue
-colorTo: purple
-sdk: docker
-pinned: false
-license: apache-2.0
-app_port: 7860
----
-# Learnable-Speech: High-Quality 24kHz Speech Synthesis
-An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
-## Demo
-This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:
-1. Train the model using the provided training pipeline
-2. Upload the trained checkpoints
-3. Replace the placeholder inference code with actual model loading and inference
-## Features
-- **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
-- **Flow matching AE**: Flow matching training for autoencoders
-- **Immiscible assignment**: Support immiscible adding noise while training
-- **Contrastive Flow matching**: Support Contrastive training
-## Architecture
-### Stage 1: Audio to Discrete Tokens
-Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
-### Stage 2: Discrete Tokens to Continuous Latent Space
-Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
-## Links
-- [GitHub Repository](https://github.com/primepake/learnable-speech)
-- [Technical Paper](https://arxiv.org/pdf/2505.07916)
-- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)
-## Usage
-1. Enter text in the text box
-2. Select a speaker ID (0-10)
-3. Click "Generate Speech" to synthesize audio
-**Note**: This is currently a placeholder demo. The actual model requires training first.

flowae/load/vgg_lpips.pth CHANGED Viewed

Binary files a/flowae/load/vgg_lpips.pth and b/flowae/load/vgg_lpips.pth differ