VITS-2 + LARoPE + CFM -- LJSpeech

Single-speaker English text-to-speech model based on VITS-2 extended with LARoPE (Length-Aware Rotary Position Embedding) and OT-CFM (Optimal-Transport Conditional Flow Matching). Trained for 800,000 steps on LJSpeech-1.1.

Model Description

This model combines three lines of TTS research:

  1. VITS-2 (Kong et al., 2023) -- End-to-end single-stage TTS with adversarial learning, monotonic alignment search, stochastic duration predictor, and HiFi-GAN decoder.
  2. LARoPE (Kim et al., 2025) -- Rotary position embedding that normalizes positions by sequence length, replacing Shaw's relative position encoding in the text encoder. Improves alignment stability on long utterances.
  3. OT-CFM (Mehta et al., 2024 / Kim et al., 2025) -- Replaces normalizing flows with a conditional flow matching module that learns straight-line optimal transport paths from noise to target, enabling high-quality synthesis in just 4 Euler ODE steps.

The model takes normalized English text (character-level, no phoneme conversion) and directly outputs 22.05 kHz waveforms.

Architecture

Component Summary

Component Architecture Dims / Config Parameters
Text Encoder 6-layer Transformer + LARoPE hidden=192, FFN=768, heads=2, kernel=3 ~3.8M
Posterior Encoder 16-layer WaveNet hidden=192, kernel=5, dilation=1, in=80 mel ~4.5M
CFM (Flow) U-Net vector field estimator channels=[256,256], 1 transformer block/level, 2 mid blocks ~7.2M
HiFi-GAN Decoder 4-stage upsampling + MRF upsample=[8,8,2,2], init_ch=512, ResBlock1 ~13.6M
Duration Predictor 3-layer Conv1D + noise input filter=256, kernel=3, dropout=0.1 ~0.6M
Duration Discriminator 5-layer Conv1D classifier filter=256, kernel=3, dropout=0.1 ~0.7M
MAS Numba JIT dynamic programming Gaussian noise injection, decay=2e-6 0
Total (generator) -- -- ~30.4M

External Discriminators (training only)

Discriminator Architecture Parameters
Multi-Period Discriminator (MPD) 5 sub-discriminators, periods=[2,3,5,7,11] ~16.8M
Multi-Scale Discriminator (MSD) 3 sub-discriminators, scales=[1,2,4] ~6.0M

LARoPE: Length-Aware Rotary Position Embedding

Standard RoPE applies rotation based on absolute position index, which causes attention patterns to degrade on sequences longer than those seen during training. LARoPE normalizes positions by total sequence length before computing rotation angles:

angles = gamma * (position / sequence_length) * theta_j

Where:

  • gamma = 10.0 (scaling factor)
  • theta_j = 1 / (10000 ^ (2j/d)) (standard RoPE frequency bands)
  • position / sequence_length maps all positions to [0, 1] regardless of utterance length

This makes the model invariant to absolute sequence length and improves text-speech alignment convergence. Applied to both query and key projections in all 6 text encoder attention layers.

Configuration: "pos_encoding": "larope" in config.json (set to "shaw" for original VITS-2 behavior).

OT-CFM: Conditional Flow Matching

Instead of the normalizing flow (4 affine coupling layers with transformer blocks) used in base VITS-2, this model uses Optimal-Transport Conditional Flow Matching:

Training: Sample timestep t ~ U(0,1), interpolate between noise z_0 ~ N(0,I) and posterior sample z_1:

z_t = (1 - (1 - sigma_min) * t) * z_0 + t * z_1
u_t = z_1 - (1 - sigma_min) * z_0    (target vector field)

A U-Net estimates the vector field v_theta(z_t, t, cond) and is trained with MSE loss against u_t.

Inference: 4-step Euler ODE integration from z_0 ~ N(0,I) to z_1:

for i in 0..3:
    t = i / 4
    v = estimator(z, t, cond, mask)
    z = z + (1/4) * v

The U-Net vector field estimator has:

  • Encoder: 2 levels, each with ResnetBlock1D + TransformerBlock1D (4-head attention, GELU FFN) + Downsample
  • Bottleneck: 2 mid blocks (ResnetBlock1D + TransformerBlock1D)
  • Decoder: 2 levels with skip connections + Upsample
  • Sinusoidal timestep embedding (scale=1000) injected via MLP into every ResnetBlock
  • sigma_min = 1e-4

Configuration: "flow_type": "cfm" in config.json (set to "nf" for original normalizing flows).

Training Procedure

Optimizers

Three independent AdamW optimizers train simultaneously from step 0:

Optimizer Scope Parameters
optim_g Text encoder, posterior encoder, CFM, HiFi-GAN decoder, duration predictor ~30.4M
optim_d Multi-Period Discriminator + Multi-Scale Discriminator ~22.8M
optim_dp_d Duration Discriminator ~0.7M

All optimizers share the same hyperparameters:

  • Learning rate: 2e-4
  • Betas: (0.8, 0.99)
  • Epsilon: 1e-9
  • Weight decay: 0.01
  • LR scheduler: ExponentialLR, gamma=0.999875 (per epoch)

Training Configuration

Setting Value
Total steps 800,000
Batch size 32
Precision Mixed (FP16 via GradScaler)
Segment size 8192 samples (32 mel frames)
Gradient accumulation 1
Seed 1234
Duration ~8 days

Loss Function

The total generator loss combines 8 components:

L_gen = L_mel + L_cfm + L_kl_bridge + L_adv + L_fm + L_dp_adv + L_dp_mse
Loss Weight Description
L_mel c_mel = 45 L1 mel-spectrogram reconstruction
L_cfm c_cfm = 1.0 OT-CFM vector field MSE
L_kl_bridge c_kl_bridge = 0.1 Auxiliary KL: posterior -> expanded prior (helps MAS alignment)
L_adv 1.0 LSGAN generator adversarial loss (MPD + MSD)
L_fm lambda_fm = 2.0 Feature matching loss (MPD + MSD)
L_dp_adv 1.0 Duration predictor adversarial loss
L_dp_mse 1.0 Duration predictor MSE (predicted vs. MAS durations)

Discriminator losses (separate backward passes):

  • L_disc: LSGAN discriminator loss on real/fake audio (MPD + MSD)
  • L_dp_disc: LSGAN discriminator loss on real/fake durations

Training Data

LJSpeech-1.1 -- a public domain speech dataset:

  • 13,100 short audio clips
  • Single female English speaker (Linda Johnson)
  • Reading passages from 7 non-fiction books
  • Sampling rate: 22,050 Hz
  • Total duration: ~24 hours

Data split:

Split Utterances Purpose
Train 12,500 Model training
Validation 100 Whisper CER evaluation
Test 500 Held-out evaluation

Audio processing:

  • 80-band mel spectrogram (VITS-2 change from 513-bin linear spec)
  • FFT size: 1024, hop length: 256, window: 1024
  • Mel range: 0 Hz -- Nyquist (no fmax cap)
  • Character-level text (no phoneme conversion, per VITS-2 paper)
  • No blank token interspersion (VITS-2 change from VITS)

Evaluation: CER Progression

Character Error Rate measured by running Whisper large-v3 on synthesized validation utterances:

Epoch Step CER (%) Notes
5 ~6k 96.6 Early training, unintelligible
20 ~23k 82.4 Beginning to form words
275 ~316k 79.6 Still high, alignment improving
705 ~800k 14.35 Best CER (final checkpoint)

CER validation runs every 5 epochs using the validation split (100 utterances). The best-CER checkpoint is saved automatically.

Note: CER is measured via Whisper transcription, not human evaluation. No MOS (Mean Opinion Score) evaluation has been performed.

Hardware

Component Specification
GPU NVIDIA RTX 3090 Ti (24 GB VRAM)
CPU AMD Ryzen 9 5900X (12-core)
RAM 62 GB DDR4
OS Ubuntu 22.04 LTS
PyTorch 2.10.0+cu128
CUDA 12.4.1
Training time ~8 days

TensorBoard Logs

Full training logs are included in the logs/ directory. To visualize:

pip install tensorboard
tensorboard --logdir logs/

Metrics tracked:

  • loss/mel -- Mel spectrogram reconstruction loss
  • loss/cfm -- OT-CFM vector field MSE
  • loss/kl_bridge -- Auxiliary KL bridge loss
  • loss/adv_g -- Generator adversarial loss
  • loss/fm -- Feature matching loss
  • loss/dp_adv -- Duration predictor adversarial loss
  • loss/dp_mse -- Duration predictor MSE loss
  • loss/dp_disc -- Duration discriminator loss
  • loss/disc -- Audio discriminator loss
  • lr/gen -- Generator learning rate
  • cer/whisper -- Character error rate (periodic)

Audio Samples

Synthesized samples from the final checkpoint (step 800k):

Sample 1:

Sample 2:

Sample 3:

Sample 4:

Sample 5:

How to Use

Python (with huggingface_hub)

import torch
import json
import numpy as np
import scipy.io.wavfile as wavfile
from huggingface_hub import hf_hub_download

# Download model files
repo_id = "jonathansilvasantos/vits2-larope-csbe-ljspeech"
config_path = hf_hub_download(repo_id, "config.json")
ckpt_path = hf_hub_download(repo_id, "checkpoints/vits2_final.pt")

# Load config
with open(config_path) as f:
    config = json.load(f)

# Build model (requires this repo's source code)
from src.models.vits2 import SynthesizerTrn
from src.text.symbols import NUM_SYMBOLS
from src.text.text_processing import normalize_text, text_to_ids
from src.text.symbols import SYMBOL_TO_ID

model = SynthesizerTrn(
    n_vocab=NUM_SYMBOLS,
    spec_channels=config["data"]["n_mel_channels"],
    segment_size=config["train"]["segment_size"] // config["data"]["hop_length"],
    **config["model"],
).cuda().eval()

# Load weights
state = torch.load(ckpt_path, map_location="cuda")
model.load_state_dict(state["model"])

# Synthesize
text = "The quick brown fox jumps over the lazy dog."
normalized = normalize_text(text)
ids = text_to_ids(normalized, SYMBOL_TO_ID)
x = torch.LongTensor([ids]).cuda()
x_lengths = torch.LongTensor([len(ids)]).cuda()

with torch.no_grad():
    audio, attn = model.infer(x, x_lengths, noise_scale=0.667, noise_scale_w=0.8)

# Save to WAV
audio_np = audio.squeeze().cpu().numpy()
audio_np = audio_np / np.abs(audio_np).max()  # normalize
wavfile.write("output.wav", 22050, (audio_np * 32767).astype(np.int16))

CLI (using the repo's inference script)

git clone https://github.com/jonathandasilvasantos/vits-2.git
cd vits-2
pip install -r requirements.txt

# Download checkpoint
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('jonathansilvasantos/vits2-larope-csbe-ljspeech', 'checkpoints/vits2_final.pt', local_dir='.')"

# Run inference
python inference.py --checkpoint checkpoints/vits2_final.pt --text "Hello world."

Available Checkpoints

File Step Size Description
checkpoints/vits2_final.pt 800,000 1.1 GB Final checkpoint (includes model + discriminators + optimizers)
checkpoints/vits2_best_cer.pt ~800k 1.1 GB Best CER checkpoint (14.35% CER, Whisper large-v3)
checkpoints/vits2-larope-csbe.pt ~350k 1.0 GB Earlier checkpoint (model weights only, no discriminators/optimizers)

All checkpoints contain the model key with the SynthesizerTrn state dict. The final and best-CER checkpoints also include mpd, msd, optim_g, optim_d, optim_dp_d, and scaler states for training resumption.

Limitations

  • Single speaker only -- trained on LJSpeech (one female English speaker), no multi-speaker support
  • English only -- character set covers lowercase English letters and basic punctuation
  • Character-level input -- no phoneme conversion; may struggle with unusual spellings or abbreviations
  • No MOS evaluation -- quality assessed via Whisper CER only, no human listening tests
  • 22.05 kHz output -- lower than modern 44.1/48 kHz TTS systems
  • GPU required -- inference requires CUDA (no CPU fallback implemented)
  • No streaming -- generates full utterance before outputting audio

References

  1. VITS-2: Kong, J., Park, J., Kim, B., Kim, J., Kong, D., & Kim, S. (2023). VITS 2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. INTERSPEECH 2023. arXiv:2307.16430

  2. LARoPE: Kim, H., Lee, J., Yang, J., & Morton, J. (2025). Length-Aware Rotary Position Embedding for Text-Speech Alignment. arXiv:2509.11084

  3. SupertonicTTS (CSBE): Kim, H., Yang, J., Yu, Y., Ji, S., Morton, J., Bous, F., Byun, J., & Lee, J. (2025). SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System. arXiv:2503.23108

  4. Matcha-TTS (OT-CFM): Mehta, S., Tu, R., Beskow, J., Sz'ekely, E., & Henter, G. E. (2024). Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching. ICASSP 2024. arXiv:2309.03199

  5. VITS: Kim, J., Kong, J., & Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. ICML 2021. arXiv:2106.06103

  6. LJSpeech: Ito, K. & Johnson, L. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/

  7. HiFi-GAN: Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS 2020. arXiv:2010.05646

  8. RoFormer (RoPE): Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing. arXiv:2104.09864

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train jonathansilvasantos/vits2-larope-csbe-ljspeech

Papers for jonathansilvasantos/vits2-larope-csbe-ljspeech

Evaluation results

  • Character Error Rate (Whisper-large-v3) on LJSpeech-1.1
    self-reported
    14.350