Spaces:
Sleeping
Sleeping
File size: 1,777 Bytes
edfcfb2 4025348 edfcfb2 4025348 edfcfb2 4025348 edfcfb2 4025348 edfcfb2 4025348 edfcfb2 4025348 edfcfb2 93623e5 0216954 4025348 0216954 d9cc92f 4025348 0216954 4025348 edfcfb2 0216954 edfcfb2 4025348 edfcfb2 4025348 edfcfb2 0216954 edfcfb2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
---
title: Learnable Speech
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
---
# Learnable-Speech: High-Quality 24kHz Speech Synthesis
An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
## Demo
This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:
1. Train the model using the provided training pipeline
2. Upload the trained checkpoints
3. Replace the placeholder inference code with actual model loading and inference
## Features
- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
- [x] **Flow matching AE**: Flow matching training for autoencoders
- [x] **Immiscible assignment**: Support immiscible adding noise while training
- [x] **Contrastive Flow matching**: Support Contrastive training
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
- [ ] **MeanFlow**: Meanflow for FM model
## Architecture
### Stage 1: Audio to Discrete Tokens
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
### Stage 2: Discrete Tokens to Continuous Latent Space
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
## Links
- [GitHub Repository](https://github.com/primepake/learnable-speech)
- [Technical Paper](https://arxiv.org/pdf/2505.07916)
- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)
## Usage
1. Enter text in the text box
2. Select a speaker ID (0-10)
3. Click "Generate Speech" to synthesize audio
**Note**: This is currently a placeholder demo. The actual model requires training first.
|