VITS-2 + LARoPE + CFM -- LJSpeech
Single-speaker English text-to-speech model based on VITS-2 extended with LARoPE (Length-Aware Rotary Position Embedding) and OT-CFM (Optimal-Transport Conditional Flow Matching). Trained for 800,000 steps on LJSpeech-1.1.
Model Description
This model combines three lines of TTS research:
- VITS-2 (Kong et al., 2023) -- End-to-end single-stage TTS with adversarial learning, monotonic alignment search, stochastic duration predictor, and HiFi-GAN decoder.
- LARoPE (Kim et al., 2025) -- Rotary position embedding that normalizes positions by sequence length, replacing Shaw's relative position encoding in the text encoder. Improves alignment stability on long utterances.
- OT-CFM (Mehta et al., 2024 / Kim et al., 2025) -- Replaces normalizing flows with a conditional flow matching module that learns straight-line optimal transport paths from noise to target, enabling high-quality synthesis in just 4 Euler ODE steps.
The model takes normalized English text (character-level, no phoneme conversion) and directly outputs 22.05 kHz waveforms.
Architecture
Component Summary
| Component | Architecture | Dims / Config | Parameters |
|---|---|---|---|
| Text Encoder | 6-layer Transformer + LARoPE | hidden=192, FFN=768, heads=2, kernel=3 | ~3.8M |
| Posterior Encoder | 16-layer WaveNet | hidden=192, kernel=5, dilation=1, in=80 mel | ~4.5M |
| CFM (Flow) | U-Net vector field estimator | channels=[256,256], 1 transformer block/level, 2 mid blocks | ~7.2M |
| HiFi-GAN Decoder | 4-stage upsampling + MRF | upsample=[8,8,2,2], init_ch=512, ResBlock1 | ~13.6M |
| Duration Predictor | 3-layer Conv1D + noise input | filter=256, kernel=3, dropout=0.1 | ~0.6M |
| Duration Discriminator | 5-layer Conv1D classifier | filter=256, kernel=3, dropout=0.1 | ~0.7M |
| MAS | Numba JIT dynamic programming | Gaussian noise injection, decay=2e-6 | 0 |
| Total (generator) | -- | -- | ~30.4M |
External Discriminators (training only)
| Discriminator | Architecture | Parameters |
|---|---|---|
| Multi-Period Discriminator (MPD) | 5 sub-discriminators, periods=[2,3,5,7,11] | ~16.8M |
| Multi-Scale Discriminator (MSD) | 3 sub-discriminators, scales=[1,2,4] | ~6.0M |
LARoPE: Length-Aware Rotary Position Embedding
Standard RoPE applies rotation based on absolute position index, which causes attention patterns to degrade on sequences longer than those seen during training. LARoPE normalizes positions by total sequence length before computing rotation angles:
angles = gamma * (position / sequence_length) * theta_j
Where:
gamma = 10.0(scaling factor)theta_j = 1 / (10000 ^ (2j/d))(standard RoPE frequency bands)position / sequence_lengthmaps all positions to [0, 1] regardless of utterance length
This makes the model invariant to absolute sequence length and improves text-speech alignment convergence. Applied to both query and key projections in all 6 text encoder attention layers.
Configuration: "pos_encoding": "larope" in config.json (set to "shaw" for original VITS-2 behavior).
OT-CFM: Conditional Flow Matching
Instead of the normalizing flow (4 affine coupling layers with transformer blocks) used in base VITS-2, this model uses Optimal-Transport Conditional Flow Matching:
Training: Sample timestep t ~ U(0,1), interpolate between noise z_0 ~ N(0,I) and posterior sample z_1:
z_t = (1 - (1 - sigma_min) * t) * z_0 + t * z_1
u_t = z_1 - (1 - sigma_min) * z_0 (target vector field)
A U-Net estimates the vector field v_theta(z_t, t, cond) and is trained with MSE loss against u_t.
Inference: 4-step Euler ODE integration from z_0 ~ N(0,I) to z_1:
for i in 0..3:
t = i / 4
v = estimator(z, t, cond, mask)
z = z + (1/4) * v
The U-Net vector field estimator has:
- Encoder: 2 levels, each with ResnetBlock1D + TransformerBlock1D (4-head attention, GELU FFN) + Downsample
- Bottleneck: 2 mid blocks (ResnetBlock1D + TransformerBlock1D)
- Decoder: 2 levels with skip connections + Upsample
- Sinusoidal timestep embedding (scale=1000) injected via MLP into every ResnetBlock
sigma_min = 1e-4
Configuration: "flow_type": "cfm" in config.json (set to "nf" for original normalizing flows).
Training Procedure
Optimizers
Three independent AdamW optimizers train simultaneously from step 0:
| Optimizer | Scope | Parameters |
|---|---|---|
optim_g |
Text encoder, posterior encoder, CFM, HiFi-GAN decoder, duration predictor | ~30.4M |
optim_d |
Multi-Period Discriminator + Multi-Scale Discriminator | ~22.8M |
optim_dp_d |
Duration Discriminator | ~0.7M |
All optimizers share the same hyperparameters:
- Learning rate: 2e-4
- Betas: (0.8, 0.99)
- Epsilon: 1e-9
- Weight decay: 0.01
- LR scheduler: ExponentialLR, gamma=0.999875 (per epoch)
Training Configuration
| Setting | Value |
|---|---|
| Total steps | 800,000 |
| Batch size | 32 |
| Precision | Mixed (FP16 via GradScaler) |
| Segment size | 8192 samples (32 mel frames) |
| Gradient accumulation | 1 |
| Seed | 1234 |
| Duration | ~8 days |
Loss Function
The total generator loss combines 8 components:
L_gen = L_mel + L_cfm + L_kl_bridge + L_adv + L_fm + L_dp_adv + L_dp_mse
| Loss | Weight | Description |
|---|---|---|
L_mel |
c_mel = 45 | L1 mel-spectrogram reconstruction |
L_cfm |
c_cfm = 1.0 | OT-CFM vector field MSE |
L_kl_bridge |
c_kl_bridge = 0.1 | Auxiliary KL: posterior -> expanded prior (helps MAS alignment) |
L_adv |
1.0 | LSGAN generator adversarial loss (MPD + MSD) |
L_fm |
lambda_fm = 2.0 | Feature matching loss (MPD + MSD) |
L_dp_adv |
1.0 | Duration predictor adversarial loss |
L_dp_mse |
1.0 | Duration predictor MSE (predicted vs. MAS durations) |
Discriminator losses (separate backward passes):
L_disc: LSGAN discriminator loss on real/fake audio (MPD + MSD)L_dp_disc: LSGAN discriminator loss on real/fake durations
Training Data
LJSpeech-1.1 -- a public domain speech dataset:
- 13,100 short audio clips
- Single female English speaker (Linda Johnson)
- Reading passages from 7 non-fiction books
- Sampling rate: 22,050 Hz
- Total duration: ~24 hours
Data split:
| Split | Utterances | Purpose |
|---|---|---|
| Train | 12,500 | Model training |
| Validation | 100 | Whisper CER evaluation |
| Test | 500 | Held-out evaluation |
Audio processing:
- 80-band mel spectrogram (VITS-2 change from 513-bin linear spec)
- FFT size: 1024, hop length: 256, window: 1024
- Mel range: 0 Hz -- Nyquist (no fmax cap)
- Character-level text (no phoneme conversion, per VITS-2 paper)
- No blank token interspersion (VITS-2 change from VITS)
Evaluation: CER Progression
Character Error Rate measured by running Whisper large-v3 on synthesized validation utterances:
| Epoch | Step | CER (%) | Notes |
|---|---|---|---|
| 5 | ~6k | 96.6 | Early training, unintelligible |
| 20 | ~23k | 82.4 | Beginning to form words |
| 275 | ~316k | 79.6 | Still high, alignment improving |
| 705 | ~800k | 14.35 | Best CER (final checkpoint) |
CER validation runs every 5 epochs using the validation split (100 utterances). The best-CER checkpoint is saved automatically.
Note: CER is measured via Whisper transcription, not human evaluation. No MOS (Mean Opinion Score) evaluation has been performed.
Hardware
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 3090 Ti (24 GB VRAM) |
| CPU | AMD Ryzen 9 5900X (12-core) |
| RAM | 62 GB DDR4 |
| OS | Ubuntu 22.04 LTS |
| PyTorch | 2.10.0+cu128 |
| CUDA | 12.4.1 |
| Training time | ~8 days |
TensorBoard Logs
Full training logs are included in the logs/ directory. To visualize:
pip install tensorboard
tensorboard --logdir logs/
Metrics tracked:
loss/mel-- Mel spectrogram reconstruction lossloss/cfm-- OT-CFM vector field MSEloss/kl_bridge-- Auxiliary KL bridge lossloss/adv_g-- Generator adversarial lossloss/fm-- Feature matching lossloss/dp_adv-- Duration predictor adversarial lossloss/dp_mse-- Duration predictor MSE lossloss/dp_disc-- Duration discriminator lossloss/disc-- Audio discriminator losslr/gen-- Generator learning ratecer/whisper-- Character error rate (periodic)
Audio Samples
Synthesized samples from the final checkpoint (step 800k):
Sample 1:
Sample 2:
Sample 3:
Sample 4:
Sample 5:
How to Use
Python (with huggingface_hub)
import torch
import json
import numpy as np
import scipy.io.wavfile as wavfile
from huggingface_hub import hf_hub_download
# Download model files
repo_id = "jonathansilvasantos/vits2-larope-csbe-ljspeech"
config_path = hf_hub_download(repo_id, "config.json")
ckpt_path = hf_hub_download(repo_id, "checkpoints/vits2_final.pt")
# Load config
with open(config_path) as f:
config = json.load(f)
# Build model (requires this repo's source code)
from src.models.vits2 import SynthesizerTrn
from src.text.symbols import NUM_SYMBOLS
from src.text.text_processing import normalize_text, text_to_ids
from src.text.symbols import SYMBOL_TO_ID
model = SynthesizerTrn(
n_vocab=NUM_SYMBOLS,
spec_channels=config["data"]["n_mel_channels"],
segment_size=config["train"]["segment_size"] // config["data"]["hop_length"],
**config["model"],
).cuda().eval()
# Load weights
state = torch.load(ckpt_path, map_location="cuda")
model.load_state_dict(state["model"])
# Synthesize
text = "The quick brown fox jumps over the lazy dog."
normalized = normalize_text(text)
ids = text_to_ids(normalized, SYMBOL_TO_ID)
x = torch.LongTensor([ids]).cuda()
x_lengths = torch.LongTensor([len(ids)]).cuda()
with torch.no_grad():
audio, attn = model.infer(x, x_lengths, noise_scale=0.667, noise_scale_w=0.8)
# Save to WAV
audio_np = audio.squeeze().cpu().numpy()
audio_np = audio_np / np.abs(audio_np).max() # normalize
wavfile.write("output.wav", 22050, (audio_np * 32767).astype(np.int16))
CLI (using the repo's inference script)
git clone https://github.com/jonathandasilvasantos/vits-2.git
cd vits-2
pip install -r requirements.txt
# Download checkpoint
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('jonathansilvasantos/vits2-larope-csbe-ljspeech', 'checkpoints/vits2_final.pt', local_dir='.')"
# Run inference
python inference.py --checkpoint checkpoints/vits2_final.pt --text "Hello world."
Available Checkpoints
| File | Step | Size | Description |
|---|---|---|---|
checkpoints/vits2_final.pt |
800,000 | 1.1 GB | Final checkpoint (includes model + discriminators + optimizers) |
checkpoints/vits2_best_cer.pt |
~800k | 1.1 GB | Best CER checkpoint (14.35% CER, Whisper large-v3) |
checkpoints/vits2-larope-csbe.pt |
~350k | 1.0 GB | Earlier checkpoint (model weights only, no discriminators/optimizers) |
All checkpoints contain the model key with the SynthesizerTrn state dict. The final and best-CER checkpoints also include mpd, msd, optim_g, optim_d, optim_dp_d, and scaler states for training resumption.
Limitations
- Single speaker only -- trained on LJSpeech (one female English speaker), no multi-speaker support
- English only -- character set covers lowercase English letters and basic punctuation
- Character-level input -- no phoneme conversion; may struggle with unusual spellings or abbreviations
- No MOS evaluation -- quality assessed via Whisper CER only, no human listening tests
- 22.05 kHz output -- lower than modern 44.1/48 kHz TTS systems
- GPU required -- inference requires CUDA (no CPU fallback implemented)
- No streaming -- generates full utterance before outputting audio
References
VITS-2: Kong, J., Park, J., Kim, B., Kim, J., Kong, D., & Kim, S. (2023). VITS 2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. INTERSPEECH 2023. arXiv:2307.16430
LARoPE: Kim, H., Lee, J., Yang, J., & Morton, J. (2025). Length-Aware Rotary Position Embedding for Text-Speech Alignment. arXiv:2509.11084
SupertonicTTS (CSBE): Kim, H., Yang, J., Yu, Y., Ji, S., Morton, J., Bous, F., Byun, J., & Lee, J. (2025). SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System. arXiv:2503.23108
Matcha-TTS (OT-CFM): Mehta, S., Tu, R., Beskow, J., Sz'ekely, E., & Henter, G. E. (2024). Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching. ICASSP 2024. arXiv:2309.03199
VITS: Kim, J., Kong, J., & Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. ICML 2021. arXiv:2106.06103
LJSpeech: Ito, K. & Johnson, L. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/
HiFi-GAN: Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS 2020. arXiv:2010.05646
RoFormer (RoPE): Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing. arXiv:2104.09864
- Downloads last month
- 6
Dataset used to train jonathansilvasantos/vits2-larope-csbe-ljspeech
Papers for jonathansilvasantos/vits2-larope-csbe-ljspeech
SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System
Matcha-TTS: A fast TTS architecture with conditional flow matching
VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
Evaluation results
- Character Error Rate (Whisper-large-v3) on LJSpeech-1.1self-reported14.350