Dialogs-RU · Expressive Russian TTS (VITS2)

An expressive, multi-speaker, emotion-conditioned Russian text-to-speech model. It is a VITS2 model (multi-band iSTFT decoder) trained on the Dialogs corpus — a studio-quality, expressive, conversational Russian speech dataset.

This is the exact model described in the paper "Dialogs: a studio-quality expressive conversational Russian speech corpus for dialog assistants" (Ilya Shigabeev & Ilya Latyshev, Langswap), trained as a proof of concept that the corpus supports expressive, dialog-style TTS.

👉 Try it in your browser (free CPU demo): frappuccino/dialogs-ru-tts

Highlights

3 studio voices: Masha (F), Sveta (F), Dima (M)
13 emotional styles: neutral, happy, surprise, arrogance, yawn, fear, laughing, whisper, disgust, angry, sad, tongue-twister, poem
22.05 kHz output, runs in real time on CPU (RTF ≈ 0.2–0.4)
Plain Russian text in; stress is optional (auto-placed with ruaccent, or marked manually with +)

Conditioning

speaker_id	voice	gender
0	Masha / Маша	female
1	Sveta / Света	female
2	Dima / Дима	male

emotion_id	style	emotion_id	style
0	neutral	7	whisper
1	happy	8	disgust
2	surprise	9	angry
3	arrogance	10	sad
4	yawn	11	tongue-twister
5	fear	12	poem
6	laughing

Usage

pip install -r requirements.txt
python inference.py

from scipy.io.wavfile import write
from tts import DialogsTTS

tts = DialogsTTS()                      # downloads weights on first run
sr, audio, used = tts.synthesize(
    "В 2024 году цена выросла на 7,5%.",  # numbers/dates/etc. are spelled out
    speaker_id=0,                         # 0=Masha, 1=Sveta, 2=Dima
    emotion_id=1,                         # 1=happy
    normalize=True,                       # rutextnorm: numbers, dates, abbreviations
    auto_stress=True,                     # ruaccent: place '+' stress automatically
)
write("out.wav", sr, audio)

Text front-end (normalization & stress)

DialogsTTS.synthesize runs a two-stage front-end before tokenization:

Normalization (normalize=True, default) — rutextnorm spells out numbers, dates, money, units, fractions and abbreviations (7,5% → «семь целых и пять десятых процента», №5 → «номер пять»). It is vendored at text/rutextnorm.py (single file, MIT, zero dependencies).
Stress (auto_stress=True, default) — ruaccent places a + before each stressed vowel. The model was trained with these marks; you can also write them yourself (e.g. з+амок vs зам+ок).

The result is lowercased and mapped through the 73-symbol vocabulary (characters outside it are dropped).

Model details

Architecture: VITS2 with a multi-band iSTFT generator (4 subbands), mel posterior encoder, transformer flows, deterministic duration predictor (no SDP), and a duration discriminator.
Conditioning: speaker embedding (192-d) concatenated with emotion embedding (64-d) → 256-d global condition.
Sample rate: 22.05 kHz · vocabulary: 73 symbols (punctuation + Latin + Cyrillic, + = stress).
Training: ~615k steps on a single RTX 4090, batch size 16, following VITS2 hyperparameter defaults.
Checkpoint: averaged_G_615000.pth (weight-averaged generator).
Evaluation (paper, proof-of-concept): UTMOS ≈ 3.36; MOS overall ≈ 2.83.

Limitations

Per-speaker data is modest and unbalanced (Masha 9.9 h, Dima 6.2 h, Sveta 4.4 h), so this single-corpus model is a proof of concept rather than a production system. The strongest styles are neutral, happy and sad; rarer styles (whisper, poem, tongue-twister, yawn) are subtler. For production use the authors recommend mixing Dialogs with a larger Russian corpus.

Citation

@inproceedings{shigabeev2025dialogs,
  title     = {Dialogs: a studio-quality expressive conversational Russian speech corpus for dialog assistants},
  author    = {Shigabeev, Ilya and Latyshev, Ilya},
  year      = {2025},
  note      = {Langswap}
}

License

OpenRAIL — free use including commercial applications, subject to the use-based restrictions of the license.

Downloads last month: 37

frappuccino
/

dialogs-ru-vits2