KALL-E / README.md
kxxia's picture
Create README.md
3d64dc9 unverified
metadata
license: apache-2.0
datasets:
  - amphion/Emilia-Dataset
language:
  - en
  - zh
tags:
  - speech-synthesis
  - pytorch
author:
  - name: Kangxiang Xia
    email: xkx@mail.nwpu.edu.cn
organization:
  - name: ASLP@NPU
    url: http://www.npu-aslp.org/
links:
  - name: Paper
    url: https://arxiv.org/abs/2412.16846
  - name: GitHub Repo
    url: https://github.com/xkx-hub/KALL-E
  - name: Demo Page
    url: https://nwpu-aslp.feishu.cn/wiki/TfLEwoITwiTReakgfnPczGfunzh

πŸŽ™οΈ KALL-E: Autoregressive Speech Synthesis with Next-Distribution Prediction

Project Page arXiv Demo

News

  • [2025.08.05] πŸ”₯ πŸ”₯ πŸ”₯ We release the inference code of KALL-E!
  • [2025.09.17] πŸŽ‰ πŸŽ‰ πŸŽ‰ KALL-E's paper is updated on arxiv, read it now!

Overview

This repository contains the inference utilities for KALL-E, a text-to-speech system that predicts continuous speech representations using a single autoregressive language model.

System Overview

  • Autoregressive Language Modeling: Utilizes an autoregressive approach for next-distribution prediction in text-to-speech synthesis.
  • Continuous Speech Distribution: Directly models and predicts continuous speech distributions conditioned on text, avoiding reliance on diffusion-based components.
  • FlowVAE: Employs FlowVAE to extract continuous speech distributions from waveforms, rather than using discrete speech tokens.
  • Single AR Language Model: Uses a single autoregressive language model to predict continuous speech distributions from text, constrained by Kullback-Leibler divergence loss.
  • Simplified Paradigm: Offers a more straightforward and effective approach for using continuous speech representations in TTS.

Key Features

  • Random Speaker Voices - When no speaker prompt is provided, the model is able to generate random voices, either female or male.

  • ⚑ Blazing-fast Synthesis Generate up to 5 seconds of audio with a single click in the web UI.

  • Context-aware Synthesis KALL-E excels in generating expressive, context-aware speech, showcasing its ability to handle complex linguistic and emotional features with ease.

Environment Setup

  • Python>=3.9 or higher
  • PyTorch with CUDA support
  • Transformers==4.49.0
  • NumPy
  • SciPy
  • alias-free-torch

Then you can clone the code for github:

git clone https://github.com/xkx-hub/KALL-E.git
cd KALL-E

Usage

1. Model Download

You need download the model in advance and place them like this:

KALL-E
|    ckpt
|    | - flowvae.pt
|    | - model.pt
|    ......
|    model.py
|    infer.py

2. Unconditional generation

python infer.py --target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model."

3. Conditional generation


python infer.py \
--target_text "<ka li E> is a text-to-speech system that predicts continuous speech representations using a single autoregressive language model." \
--prompt_text "oh that's crazy!" \
--prompt_wav_path ./test.wav 

4. Web demo

python web.py