---
title: Learnable Speech
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
---

# Learnable-Speech: High-Quality 24kHz Speech Synthesis

An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.

## Demo

This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:

1. Train the model using the provided training pipeline
2. Upload the trained checkpoints
3. Replace the placeholder inference code with actual model loading and inference

## Features

- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate  
- [x] **Flow matching AE**: Flow matching training for autoencoders  
- [x] **Immiscible assignment**: Support immiscible adding noise while training  
- [x] **Contrastive Flow matching**: Support Contrastive training  
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint  
- [ ] **MeanFlow**: Meanflow for FM model

## Architecture

### Stage 1: Audio to Discrete Tokens

Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

### Stage 2: Discrete Tokens to Continuous Latent Space

Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

## Links

- [GitHub Repository](https://github.com/primepake/learnable-speech)
- [Technical Paper](https://arxiv.org/pdf/2505.07916)
- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)

## Usage

1. Enter text in the text box
2. Select a speaker ID (0-10)
3. Click "Generate Speech" to synthesize audio

**Note**: This is currently a placeholder demo. The actual model requires training first.