Spaces:
Sleeping
Sleeping
| title: Learnable Speech | |
| emoji: 🎤 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| license: apache-2.0 | |
| app_port: 7860 | |
| # Learnable-Speech: High-Quality 24kHz Speech Synthesis | |
| An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE. | |
| ## Demo | |
| This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to: | |
| 1. Train the model using the provided training pipeline | |
| 2. Upload the trained checkpoints | |
| 3. Replace the placeholder inference code with actual model loading and inference | |
| ## Features | |
| - [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate | |
| - [x] **Flow matching AE**: Flow matching training for autoencoders | |
| - [x] **Immiscible assignment**: Support immiscible adding noise while training | |
| - [x] **Contrastive Flow matching**: Support Contrastive training | |
| - [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint | |
| - [ ] **MeanFlow**: Meanflow for FM model | |
| ## Architecture | |
| ### Stage 1: Audio to Discrete Tokens | |
| Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework. | |
| ### Stage 2: Discrete Tokens to Continuous Latent Space | |
| Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE). | |
| ## Links | |
| - [GitHub Repository](https://github.com/primepake/learnable-speech) | |
| - [Technical Paper](https://arxiv.org/pdf/2505.07916) | |
| - [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice) | |
| ## Usage | |
| 1. Enter text in the text box | |
| 2. Select a speaker ID (0-10) | |
| 3. Click "Generate Speech" to synthesize audio | |
| **Note**: This is currently a placeholder demo. The actual model requires training first. | |