learnable-speech / README.md
mnhatdaous's picture
Update README with Hugging Face Space metadata
edfcfb2
---
title: Learnable Speech
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
---
# Learnable-Speech: High-Quality 24kHz Speech Synthesis
An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.
## Demo
This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:
1. Train the model using the provided training pipeline
2. Upload the trained checkpoints
3. Replace the placeholder inference code with actual model loading and inference
## Features
- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate
- [x] **Flow matching AE**: Flow matching training for autoencoders
- [x] **Immiscible assignment**: Support immiscible adding noise while training
- [x] **Contrastive Flow matching**: Support Contrastive training
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint
- [ ] **MeanFlow**: Meanflow for FM model
## Architecture
### Stage 1: Audio to Discrete Tokens
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
### Stage 2: Discrete Tokens to Continuous Latent Space
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
## Links
- [GitHub Repository](https://github.com/primepake/learnable-speech)
- [Technical Paper](https://arxiv.org/pdf/2505.07916)
- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)
## Usage
1. Enter text in the text box
2. Select a speaker ID (0-10)
3. Click "Generate Speech" to synthesize audio
**Note**: This is currently a placeholder demo. The actual model requires training first.