Kawiri TTS 🎙️
Kawiri TTS is a high-fidelity, dual-stage Text-to-Speech (TTS) framework designed to generate natural, expressive, and high-resolution speech. It diverges from older end-to-end architectures (like legacy VITS) in favor of a modern, continuous Flow-Matching approach paired with a highly compressed Variational Autoencoder (VAE) Vocoder.
This architecture natively targets high-resolution audio (44.1 kHz) and integrates cutting-edge sequence modeling techniques like Ring Attention (Ringformer) and iSTFTNet to optimize both long-form generation and synthesis speed.
🧠 System Architecture
The model splits the complex task of speech generation into two distinct stages: translating text (phonemes) into a compressed latent representation, and then decoding that latent representation into the final raw audio waveform.
1. Stage 1: The VAE Vocoder (iSTFTNet-based Acoustic Bottleneck)
Instead of enforcing the acoustic model to output high-dimensional linear spectrograms or generating waveform samples step-by-step through heavy transposed convolutions, we utilize an advanced Variational Autoencoder (vae2_bottleneck.json).
- Latent Compression: The VAE compresses 100-band Mel-spectrograms into a low-dimensional, continuous latent space. This acts as an acoustic "bottleneck" that captures the core characteristics of the voice and phonetics.
- iSTFTNet Decoder: To reconstruct the audio from the latent space, the decoder leverages iSTFTNet (Inverse Short-Time Fourier Transform Network). Rather than predicting the raw 1D waveform directly, the network predicts the phase and magnitude (spectrograms) in the frequency domain, followed by a mathematical iSTFT operation. This significantly speeds up training and inference while dramatically reducing metallic artifacts at 44.1 kHz.
2. Stage 2: ARCHI-TTS (Continuous Flow-Matching with Ringformer)
The core acoustic modeling is handled by a Flow-Matching framework (archi_tts.json), which predicts the learned latents from text.
- Continuous Normalizing Flows: ARCHI-TTS utilizes Continuous Normalizing Flows (Flow-Matching) to map a simple Gaussian noise distribution to our VAE's target latent distribution.
- Ring Attention & Ringformer: To handle very long audio segments without running out of GPU memory, the model's core transformer blocks employ Ring Attention (Ringformer). By distributing the Key-Value (KV) computations in a ring topology across chunks or devices, the model can contextually understand and generate much longer, cohesive narratives (e.g., audiobooks) without breaking the context window.
- Condition Encoder: A dedicated neural Condition Encoder fuses multiple inputs—phoneme sequences, language IDs, and speaker embeddings—into a dense, unified conditioning signal. This signal heavily guides the Flow-Matching vector field, dictating the exact prosody, accent, and timbre of the generated speech.
📉 Loss Functions & Optimization
Kawiri TTS relies on a sophisticated mix of loss functions to ensure acoustic fidelity and perfect text-to-speech alignment:
- Auxiliary Latent-Phoneme CTC Loss: Because phonemes and audio frames are naturally unaligned, the VAE relies on an auxiliary Connectionist Temporal Classification (CTC) loss during training (
vae2_ctc_weight=0.1). With avae2_ctc_upsample_factor=2applied to the encoder's mean latents, the model intrinsically learns to bind phonetic boundaries to specific frames in the compressed latent space. - Flow-Matching Vector Field Loss: In Stage 2, the primary objective is to minimize the Mean Squared Error (MSE) between the predicted vector field and the optimal transport path from noise to the target latent.
- Reconstruction & GAN Losses: The VAE uses L1 Mel-spectrogram reconstruction loss alongside multi-period and multi-scale Discriminator losses (Feature Matching and Adversarial Hinge losses) to ensure the iSTFTNet output sounds indistinguishable from real human speech.
🎛 Audio & Mel-Spectrogram Specifications
The audio pipeline expects the following strict parameters matching the vocoder/VAE2 configuration:
- Sample Rate:
44100 Hz(All raw audio inputs must be resampled to 44.1 kHz prior to training) - Mel Bins:
100 - FFT Size (
n_fft):2048 - Window Size (
win_length):1764 - Hop Size (
hop_length):441 - Frequencies:
fmin: 0,fmax: null
📂 Data Preparation & Training Workflow
To efficiently train the models, data is rigorously pre-processed into a strict pipe-separated (|) format (audio_path|phonemes|lang_id|speaker_id|raw_text).
- Dump Mels: Raw audio files are converted into Mel-spectrogram tensors and stored in a designated dump folder structure (e.g.,
dump_100/train/mels/). - Train the VAE/iSTFTNet:
python train.py -c configs/vae2_bottleneck.json -m vae_normal - Dump the Latents: Freeze the VAE weights and extract the latent representations for the entire dataset (
dump_latents_vae2/). - Train the ARCHI-TTS Flow-Matching Model:
python train.py -c configs/archi_tts.json -m archi_stage
⚠️ Known Limitations & Future Work
- Word Error Rate (WER) Issues: Currently, the model experiences occasional stability issues regarding the Word Error Rate (WER). During complex sentences, it may occasionally mispronounce, hallucinate, or skip specific words due to the continuous nature of the flow-matching alignment. However, this is a known limitation of the current alignment formulation and will be fully resolved in future architectural updates.
- EMA Alignment: Training utilizes an Exponential Moving Average (EMA) shadow model via Accelerate. Care must be taken to ensure the EMA wrapper is initialized on the correct CUDA device to prevent device mismatch errors during cross-GPU synchronizations.
pip install triton-nightly==3.0.0.post20240716052845 --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu130
python train.py -c configs/vae2_bottleneck.json -m vae_normal conda create -n ringformer python=3.11