KALL-E / README.md

kxxia

Create README.md

3d64dc9 unverified 7 months ago

preview code

raw

history blame contribute delete

3.93 kB

metadata

license: apache-2.0
datasets:
  - amphion/Emilia-Dataset
language:
  - en
  - zh
tags:
  - speech-synthesis
  - pytorch
author:
  - name: Kangxiang Xia
    email: xkx@mail.nwpu.edu.cn
organization:
  - name: ASLP@NPU
    url: http://www.npu-aslp.org/
links:
  - name: Paper
    url: https://arxiv.org/abs/2412.16846
  - name: GitHub Repo
    url: https://github.com/xkx-hub/KALL-E
  - name: Demo Page
    url: https://nwpu-aslp.feishu.cn/wiki/TfLEwoITwiTReakgfnPczGfunzh

🎙️ KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction

News

[2025.08.05] 🔥 🔥 🔥 We release the inference code of KALL-E!
[2025.09.17] 🎉 🎉 🎉 KALL-E's paper is updated on arxiv, read it now!

Overview

This repository contains the inference utilities for KALL-E, a text-to-speech system that predicts continuous speech representations using a single autoregressive language model.

Autoregressive Language Modeling: Utilizes an autoregressive approach for next-distribution prediction in text-to-speech synthesis.
Continuous Speech Distribution: Directly models and predicts continuous speech distributions conditioned on text, avoiding reliance on diffusion-based components.
FlowVAE: Employs FlowVAE to extract continuous speech distributions from waveforms, rather than using discrete speech tokens.
Single AR Language Model: Uses a single autoregressive language model to predict continuous speech distributions from text, constrained by Kullback-Leibler divergence loss.
Simplified Paradigm: Offers a more straightforward and effective approach for using continuous speech representations in TTS.

Key Features

Random Speaker Voices - When no speaker prompt is provided, the model is able to generate random voices, either female or male.
⚡ Blazing-fast Synthesis Generate up to 5 seconds of audio with a single click in the web UI.
Context-aware Synthesis KALL-E excels in generating expressive, context-aware speech, showcasing its ability to handle complex linguistic and emotional features with ease.

Environment Setup

Python>=3.9 or higher
PyTorch with CUDA support
Transformers==4.49.0
NumPy
SciPy
alias-free-torch

Then you can clone the code for github:

git clone https://github.com/xkx-hub/KALL-E.git
cd KALL-E

Usage

1. Model Download

You need download the model in advance and place them like this:

KALL-E
|    ckpt
|    | - flowvae.pt
|    | - model.pt
|    ......
|    model.py
|    infer.py

2. Unconditional generation

python infer.py --target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model."

3. Conditional generation


python infer.py \
--target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model." \
--prompt_text "oh that's crazy!" \
--prompt_wav_path ./test.wav

4. Web demo

python web.py