VITS-2 + LARoPE + CFM -- LJSpeech

Single-speaker English text-to-speech model based on VITS-2 extended with LARoPE (Length-Aware Rotary Position Embedding) and OT-CFM (Optimal-Transport Conditional Flow Matching). Trained for 800,000 steps on LJSpeech-1.1.

Model Description

This model combines three lines of TTS research:

VITS-2 (Kong et al., 2023) -- End-to-end single-stage TTS with adversarial learning, monotonic alignment search, stochastic duration predictor, and HiFi-GAN decoder.
LARoPE (Kim et al., 2025) -- Rotary position embedding that normalizes positions by sequence length, replacing Shaw's relative position encoding in the text encoder. Improves alignment stability on long utterances.
OT-CFM (Mehta et al., 2024 / Kim et al., 2025) -- Replaces normalizing flows with a conditional flow matching module that learns straight-line optimal transport paths from noise to target, enabling high-quality synthesis in just 4 Euler ODE steps.

The model takes normalized English text (character-level, no phoneme conversion) and directly outputs 22.05 kHz waveforms.

Architecture

Component Summary

Component	Architecture	Dims / Config	Parameters
Text Encoder	6-layer Transformer + LARoPE	hidden=192, FFN=768, heads=2, kernel=3	~3.8M
Posterior Encoder	16-layer WaveNet	hidden=192, kernel=5, dilation=1, in=80 mel	~4.5M
CFM (Flow)	U-Net vector field estimator	channels=[256,256], 1 transformer block/level, 2 mid blocks	~7.2M
HiFi-GAN Decoder	4-stage upsampling + MRF	upsample=[8,8,2,2], init_ch=512, ResBlock1	~13.6M
Duration Predictor	3-layer Conv1D + noise input	filter=256, kernel=3, dropout=0.1	~0.6M
Duration Discriminator	5-layer Conv1D classifier	filter=256, kernel=3, dropout=0.1	~0.7M
MAS	Numba JIT dynamic programming	Gaussian noise injection, decay=2e-6	0
Total (generator)	--	--	~30.4M

External Discriminators (training only)

Discriminator	Architecture	Parameters
Multi-Period Discriminator (MPD)	5 sub-discriminators, periods=[2,3,5,7,11]	~16.8M
Multi-Scale Discriminator (MSD)	3 sub-discriminators, scales=[1,2,4]	~6.0M

LARoPE: Length-Aware Rotary Position Embedding

Standard RoPE applies rotation based on absolute position index, which causes attention patterns to degrade on sequences longer than those seen during training. LARoPE normalizes positions by total sequence length before computing rotation angles:

angles = gamma * (position / sequence_length) * theta_j

Where:

gamma = 10.0 (scaling factor)
theta_j = 1 / (10000 ^ (2j/d)) (standard RoPE frequency bands)
position / sequence_length maps all positions to [0, 1] regardless of utterance length

This makes the model invariant to absolute sequence length and improves text-speech alignment convergence. Applied to both query and key projections in all 6 text encoder attention layers.

Configuration: "pos_encoding": "larope" in config.json (set to "shaw" for original VITS-2 behavior).

OT-CFM: Conditional Flow Matching

Instead of the normalizing flow (4 affine coupling layers with transformer blocks) used in base VITS-2, this model uses Optimal-Transport Conditional Flow Matching:

Training: Sample timestep t ~ U(0,1), interpolate between noise z_0 ~ N(0,I) and posterior sample z_1:

z_t = (1 - (1 - sigma_min) * t) * z_0 + t * z_1
u_t = z_1 - (1 - sigma_min) * z_0    (target vector field)

A U-Net estimates the vector field v_theta(z_t, t, cond) and is trained with MSE loss against u_t.

Inference: 4-step Euler ODE integration from z_0 ~ N(0,I) to z_1:

for i in 0..3:
    t = i / 4
    v = estimator(z, t, cond, mask)
    z = z + (1/4) * v

The U-Net vector field estimator has:

Encoder: 2 levels, each with ResnetBlock1D + TransformerBlock1D (4-head attention, GELU FFN) + Downsample
Bottleneck: 2 mid blocks (ResnetBlock1D + TransformerBlock1D)
Decoder: 2 levels with skip connections + Upsample
Sinusoidal timestep embedding (scale=1000) injected via MLP into every ResnetBlock
sigma_min = 1e-4

Configuration: "flow_type": "cfm" in config.json (set to "nf" for original normalizing flows).

Training Procedure

Optimizers

Three independent AdamW optimizers train simultaneously from step 0:

Optimizer	Scope	Parameters
`optim_g`	Text encoder, posterior encoder, CFM, HiFi-GAN decoder, duration predictor	~30.4M
`optim_d`	Multi-Period Discriminator + Multi-Scale Discriminator	~22.8M
`optim_dp_d`	Duration Discriminator	~0.7M

All optimizers share the same hyperparameters:

Learning rate: 2e-4
Betas: (0.8, 0.99)
Epsilon: 1e-9
Weight decay: 0.01
LR scheduler: ExponentialLR, gamma=0.999875 (per epoch)

Training Configuration

Setting	Value
Total steps	800,000
Batch size	32
Precision	Mixed (FP16 via GradScaler)
Segment size	8192 samples (32 mel frames)
Gradient accumulation	1
Seed	1234
Duration	~8 days

Loss Function

The total generator loss combines 8 components:

L_gen = L_mel + L_cfm + L_kl_bridge + L_adv + L_fm + L_dp_adv + L_dp_mse

Loss	Weight	Description
`L_mel`	c_mel = 45	L1 mel-spectrogram reconstruction
`L_cfm`	c_cfm = 1.0	OT-CFM vector field MSE
`L_kl_bridge`	c_kl_bridge = 0.1	Auxiliary KL: posterior -> expanded prior (helps MAS alignment)
`L_adv`	1.0	LSGAN generator adversarial loss (MPD + MSD)
`L_fm`	lambda_fm = 2.0	Feature matching loss (MPD + MSD)
`L_dp_adv`	1.0	Duration predictor adversarial loss
`L_dp_mse`	1.0	Duration predictor MSE (predicted vs. MAS durations)

Discriminator losses (separate backward passes):

L_disc: LSGAN discriminator loss on real/fake audio (MPD + MSD)
L_dp_disc: LSGAN discriminator loss on real/fake durations

Training Data

LJSpeech-1.1 -- a public domain speech dataset:

13,100 short audio clips
Single female English speaker (Linda Johnson)
Reading passages from 7 non-fiction books
Sampling rate: 22,050 Hz
Total duration: ~24 hours

Data split:

Split	Utterances	Purpose
Train	12,500	Model training
Validation	100	Whisper CER evaluation
Test	500	Held-out evaluation

Audio processing:

80-band mel spectrogram (VITS-2 change from 513-bin linear spec)
FFT size: 1024, hop length: 256, window: 1024
Mel range: 0 Hz -- Nyquist (no fmax cap)
Character-level text (no phoneme conversion, per VITS-2 paper)
No blank token interspersion (VITS-2 change from VITS)

Evaluation: CER Progression

Character Error Rate measured by running Whisper large-v3 on synthesized validation utterances:

Epoch	Step	CER (%)	Notes
5	~6k	96.6	Early training, unintelligible
20	~23k	82.4	Beginning to form words
275	~316k	79.6	Still high, alignment improving
705	~800k	14.35	Best CER (final checkpoint)

CER validation runs every 5 epochs using the validation split (100 utterances). The best-CER checkpoint is saved automatically.

Note: CER is measured via Whisper transcription, not human evaluation. No MOS (Mean Opinion Score) evaluation has been performed.

Hardware

Component	Specification
GPU	NVIDIA RTX 3090 Ti (24 GB VRAM)
CPU	AMD Ryzen 9 5900X (12-core)
RAM	62 GB DDR4
OS	Ubuntu 22.04 LTS
PyTorch	2.10.0+cu128
CUDA	12.4.1
Training time	~8 days

TensorBoard Logs

Full training logs are included in the logs/ directory. To visualize:

pip install tensorboard
tensorboard --logdir logs/

Metrics tracked:

loss/mel -- Mel spectrogram reconstruction loss
loss/cfm -- OT-CFM vector field MSE
loss/kl_bridge -- Auxiliary KL bridge loss
loss/adv_g -- Generator adversarial loss
loss/fm -- Feature matching loss
loss/dp_adv -- Duration predictor adversarial loss
loss/dp_mse -- Duration predictor MSE loss
loss/dp_disc -- Duration discriminator loss
loss/disc -- Audio discriminator loss
lr/gen -- Generator learning rate
cer/whisper -- Character error rate (periodic)

Audio Samples

Synthesized samples from the final checkpoint (step 800k):

Sample 1:

Sample 2:

Sample 3:

Sample 4:

Sample 5:

How to Use

Python (with huggingface_hub)

import torch
import json
import numpy as np
import scipy.io.wavfile as wavfile
from huggingface_hub import hf_hub_download

# Download model files
repo_id = "jonathansilvasantos/vits2-larope-csbe-ljspeech"
config_path = hf_hub_download(repo_id, "config.json")
ckpt_path = hf_hub_download(repo_id, "checkpoints/vits2_final.pt")

# Load config
with open(config_path) as f:
    config = json.load(f)

# Build model (requires this repo's source code)
from src.models.vits2 import SynthesizerTrn
from src.text.symbols import NUM_SYMBOLS
from src.text.text_processing import normalize_text, text_to_ids
from src.text.symbols import SYMBOL_TO_ID

model = SynthesizerTrn(
    n_vocab=NUM_SYMBOLS,
    spec_channels=config["data"]["n_mel_channels"],
    segment_size=config["train"]["segment_size"] // config["data"]["hop_length"],
    **config["model"],
).cuda().eval()

# Load weights
state = torch.load(ckpt_path, map_location="cuda")
model.load_state_dict(state["model"])

# Synthesize
text = "The quick brown fox jumps over the lazy dog."
normalized = normalize_text(text)
ids = text_to_ids(normalized, SYMBOL_TO_ID)
x = torch.LongTensor([ids]).cuda()
x_lengths = torch.LongTensor([len(ids)]).cuda()

with torch.no_grad():
    audio, attn = model.infer(x, x_lengths, noise_scale=0.667, noise_scale_w=0.8)

# Save to WAV
audio_np = audio.squeeze().cpu().numpy()
audio_np = audio_np / np.abs(audio_np).max()  # normalize
wavfile.write("output.wav", 22050, (audio_np * 32767).astype(np.int16))

CLI (using the repo's inference script)

git clone https://github.com/jonathandasilvasantos/vits-2.git
cd vits-2
pip install -r requirements.txt

# Download checkpoint
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('jonathansilvasantos/vits2-larope-csbe-ljspeech', 'checkpoints/vits2_final.pt', local_dir='.')"

# Run inference
python inference.py --checkpoint checkpoints/vits2_final.pt --text "Hello world."

Available Checkpoints

File	Step	Size	Description
`checkpoints/vits2_final.pt`	800,000	1.1 GB	Final checkpoint (includes model + discriminators + optimizers)
`checkpoints/vits2_best_cer.pt`	~800k	1.1 GB	Best CER checkpoint (14.35% CER, Whisper large-v3)
`checkpoints/vits2-larope-csbe.pt`	~350k	1.0 GB	Earlier checkpoint (model weights only, no discriminators/optimizers)

All checkpoints contain the model key with the SynthesizerTrn state dict. The final and best-CER checkpoints also include mpd, msd, optim_g, optim_d, optim_dp_d, and scaler states for training resumption.

Limitations

Single speaker only -- trained on LJSpeech (one female English speaker), no multi-speaker support
English only -- character set covers lowercase English letters and basic punctuation
Character-level input -- no phoneme conversion; may struggle with unusual spellings or abbreviations
No MOS evaluation -- quality assessed via Whisper CER only, no human listening tests
22.05 kHz output -- lower than modern 44.1/48 kHz TTS systems
GPU required -- inference requires CUDA (no CPU fallback implemented)
No streaming -- generates full utterance before outputting audio

References

VITS-2: Kong, J., Park, J., Kim, B., Kim, J., Kong, D., & Kim, S. (2023). VITS 2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. INTERSPEECH 2023. arXiv:2307.16430
LARoPE: Kim, H., Lee, J., Yang, J., & Morton, J. (2025). Length-Aware Rotary Position Embedding for Text-Speech Alignment. arXiv:2509.11084
SupertonicTTS (CSBE): Kim, H., Yang, J., Yu, Y., Ji, S., Morton, J., Bous, F., Byun, J., & Lee, J. (2025). SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System. arXiv:2503.23108
Matcha-TTS (OT-CFM): Mehta, S., Tu, R., Beskow, J., Sz'ekely, E., & Henter, G. E. (2024). Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching. ICASSP 2024. arXiv:2309.03199
VITS: Kim, J., Kong, J., & Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. ICML 2021. arXiv:2106.06103
LJSpeech: Ito, K. & Johnson, L. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/
HiFi-GAN: Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS 2020. arXiv:2010.05646
RoFormer (RoPE): Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing. arXiv:2104.09864

Downloads last month: 6

Dataset used to train jonathansilvasantos/vits2-larope-csbe-ljspeech

Papers for jonathansilvasantos/vits2-larope-csbe-ljspeech

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Paper • 2307.16430 • Published Jul 31, 2023 • 4

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Paper • 2106.06103 • Published Jun 11, 2021 • 4

Evaluation results

Character Error Rate (Whisper-large-v3) on LJSpeech-1.1
self-reported

14.350