File size: 1,777 Bytes
edfcfb2
 
 
 
 
 
 
 
 
 
4025348
edfcfb2
4025348
edfcfb2
4025348
edfcfb2
4025348
edfcfb2
4025348
edfcfb2
 
 
4025348
edfcfb2
 
 
 
 
 
 
93623e5
0216954
4025348
 
 
0216954
d9cc92f
4025348
 
0216954
4025348
 
edfcfb2
0216954
edfcfb2
 
 
4025348
edfcfb2
4025348
edfcfb2
 
 
0216954
edfcfb2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
title: Learnable Speech
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860
---

# Learnable-Speech: High-Quality 24kHz Speech Synthesis

An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.

## Demo

This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:

1. Train the model using the provided training pipeline
2. Upload the trained checkpoints
3. Replace the placeholder inference code with actual model loading and inference

## Features

- [x] **24kHz Audio Support**: High-quality audio generation at 24kHz sampling rate  
- [x] **Flow matching AE**: Flow matching training for autoencoders  
- [x] **Immiscible assignment**: Support immiscible adding noise while training  
- [x] **Contrastive Flow matching**: Support Contrastive training  
- [ ] **Checkpoint release**: Release LLM and Contrastive FM checkpoint  
- [ ] **MeanFlow**: Meanflow for FM model

## Architecture

### Stage 1: Audio to Discrete Tokens

Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

### Stage 2: Discrete Tokens to Continuous Latent Space

Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

## Links

- [GitHub Repository](https://github.com/primepake/learnable-speech)
- [Technical Paper](https://arxiv.org/pdf/2505.07916)
- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)

## Usage

1. Enter text in the text box
2. Select a speaker ID (0-10)
3. Click "Generate Speech" to synthesize audio

**Note**: This is currently a placeholder demo. The actual model requires training first.