Spaces:

mnhatdaous
/

learnable-speech

Sleeping

App Files Files Community

learnable-speech / README.md

mnhatdaous

Update README with Hugging Face Space metadata

edfcfb2 3 months ago

preview code

raw

history blame contribute delete

1.78 kB

	---
	title: Learnable Speech
	emoji: 🎤
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	license: apache-2.0
	app_port: 7860
	---

	# Learnable-Speech: High-Quality 24kHz Speech Synthesis

	An unofficial implementation based on improvements of CosyVoice with learnable encoder and DAC-VAE.

	## Demo

	This Space provides a demo interface for the Learnable-Speech model. Currently, it shows a placeholder implementation. To use the actual trained model, you would need to:

	1. Train the model using the provided training pipeline
	2. Upload the trained checkpoints
	3. Replace the placeholder inference code with actual model loading and inference

	## Features

	- [x] 24kHz Audio Support: High-quality audio generation at 24kHz sampling rate
	- [x] Flow matching AE: Flow matching training for autoencoders
	- [x] Immiscible assignment: Support immiscible adding noise while training
	- [x] Contrastive Flow matching: Support Contrastive training
	- [ ] Checkpoint release: Release LLM and Contrastive FM checkpoint
	- [ ] MeanFlow: Meanflow for FM model

	## Architecture

	### Stage 1: Audio to Discrete Tokens

	Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

	### Stage 2: Discrete Tokens to Continuous Latent Space

	Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

	## Links

	- [GitHub Repository](https://github.com/primepake/learnable-speech)
	- [Technical Paper](https://arxiv.org/pdf/2505.07916)
	- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice)

	## Usage

	1. Enter text in the text box
	2. Select a speaker ID (0-10)
	3. Click "Generate Speech" to synthesize audio

	Note: This is currently a placeholder demo. The actual model requires training first.