Spaces:

mnhatdaous
/

learnable-speech

Sleeping

App Files Files Community

learnable-speech / README.md

mnhatdaous

Update README with Hugging Face Space metadata

edfcfb2 3 months ago

preview code

raw

history blame contribute delete

1.78 kB

metadata

title: Learnable Speech
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860

Learnable-Speech: High-Quality 24kHz Speech Synthesis

An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.

Demo

This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:

Train the model using the provided training pipeline
Upload the trained checkpoints
Replace the placeholder inference code with actual model loading and inference

Features

24kHz Audio Support: High-quality audio generation at 24kHz sampling rate
Flow matching AE: Flow matching training for autoencoders
Immiscible assignment: Support immiscible adding noise while training
Contrastive Flow matching: Support Contrastive training
Checkpoint release: Release LLM and Contrastive FM checkpoint
MeanFlow: Meanflow for FM model

Architecture

Stage 1: Audio to Discrete Tokens

Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

Stage 2: Discrete Tokens to Continuous Latent Space

Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

Usage

Enter text in the text box
Select a speaker ID (0-10)
Click "Generate Speech" to synthesize audio

Note: This is currently a placeholder demo. The actual model requires training first.