learnable-speech / README.md
mnhatdaous's picture
Update README with Hugging Face Space metadata
edfcfb2
metadata
title: Learnable Speech
emoji: 🎤
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0
app_port: 7860

Learnable-Speech: High-Quality 24kHz Speech Synthesis

An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.

Demo

This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:

  1. Train the model using the provided training pipeline
  2. Upload the trained checkpoints
  3. Replace the placeholder inference code with actual model loading and inference

Features

  • 24kHz Audio Support: High-quality audio generation at 24kHz sampling rate
  • Flow matching AE: Flow matching training for autoencoders
  • Immiscible assignment: Support immiscible adding noise while training
  • Contrastive Flow matching: Support Contrastive training
  • Checkpoint release: Release LLM and Contrastive FM checkpoint
  • MeanFlow: Meanflow for FM model

Architecture

Stage 1: Audio to Discrete Tokens

Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

Stage 2: Discrete Tokens to Continuous Latent Space

Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

Links

Usage

  1. Enter text in the text box
  2. Select a speaker ID (0-10)
  3. Click "Generate Speech" to synthesize audio

Note: This is currently a placeholder demo. The actual model requires training first.