Dialogs-RU · Expressive Russian TTS (VITS2)

An expressive, multi-speaker, emotion-conditioned Russian text-to-speech model. It is a VITS2 model (multi-band iSTFT decoder) trained on the Dialogs corpus — a studio-quality, expressive, conversational Russian speech dataset.

This is the exact model described in the paper "Dialogs: a studio-quality expressive conversational Russian speech corpus for dialog assistants" (Ilya Shigabeev & Ilya Latyshev, Langswap), trained as a proof of concept that the corpus supports expressive, dialog-style TTS.

👉 Try it in your browser (free CPU demo): frappuccino/dialogs-ru-tts

Highlights

  • 3 studio voices: Masha (F), Sveta (F), Dima (M)
  • 13 emotional styles: neutral, happy, surprise, arrogance, yawn, fear, laughing, whisper, disgust, angry, sad, tongue-twister, poem
  • 22.05 kHz output, runs in real time on CPU (RTF ≈ 0.2–0.4)
  • Plain Russian text in; stress is optional (auto-placed with ruaccent, or marked manually with +)

Conditioning

speaker_id voice gender
0 Masha / Маша female
1 Sveta / Света female
2 Dima / Дима male
emotion_id style emotion_id style
0 neutral 7 whisper
1 happy 8 disgust
2 surprise 9 angry
3 arrogance 10 sad
4 yawn 11 tongue-twister
5 fear 12 poem
6 laughing

Usage

pip install -r requirements.txt
python inference.py
from scipy.io.wavfile import write
from tts import DialogsTTS

tts = DialogsTTS()                      # downloads weights on first run
sr, audio, used = tts.synthesize(
    "В 2024 году цена выросла на 7,5%.",  # numbers/dates/etc. are spelled out
    speaker_id=0,                         # 0=Masha, 1=Sveta, 2=Dima
    emotion_id=1,                         # 1=happy
    normalize=True,                       # rutextnorm: numbers, dates, abbreviations
    auto_stress=True,                     # ruaccent: place '+' stress automatically
)
write("out.wav", sr, audio)

Text front-end (normalization & stress)

DialogsTTS.synthesize runs a two-stage front-end before tokenization:

  1. Normalization (normalize=True, default) — rutextnorm spells out numbers, dates, money, units, fractions and abbreviations (7,5% → «семь целых и пять десятых процента», №5 → «номер пять»). It is vendored at text/rutextnorm.py (single file, MIT, zero dependencies).
  2. Stress (auto_stress=True, default) — ruaccent places a + before each stressed vowel. The model was trained with these marks; you can also write them yourself (e.g. з+амок vs зам+ок).

The result is lowercased and mapped through the 73-symbol vocabulary (characters outside it are dropped).

Model details

  • Architecture: VITS2 with a multi-band iSTFT generator (4 subbands), mel posterior encoder, transformer flows, deterministic duration predictor (no SDP), and a duration discriminator.
  • Conditioning: speaker embedding (192-d) concatenated with emotion embedding (64-d) → 256-d global condition.
  • Sample rate: 22.05 kHz · vocabulary: 73 symbols (punctuation + Latin + Cyrillic, + = stress).
  • Training: ~615k steps on a single RTX 4090, batch size 16, following VITS2 hyperparameter defaults.
  • Checkpoint: averaged_G_615000.pth (weight-averaged generator).
  • Evaluation (paper, proof-of-concept): UTMOS ≈ 3.36; MOS overall ≈ 2.83.

Limitations

Per-speaker data is modest and unbalanced (Masha 9.9 h, Dima 6.2 h, Sveta 4.4 h), so this single-corpus model is a proof of concept rather than a production system. The strongest styles are neutral, happy and sad; rarer styles (whisper, poem, tongue-twister, yawn) are subtler. For production use the authors recommend mixing Dialogs with a larger Russian corpus.

Links

Citation

@inproceedings{shigabeev2025dialogs,
  title     = {Dialogs: a studio-quality expressive conversational Russian speech corpus for dialog assistants},
  author    = {Shigabeev, Ilya and Latyshev, Ilya},
  year      = {2025},
  note      = {Langswap}
}

License

OpenRAIL — free use including commercial applications, subject to the use-based restrictions of the license.

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train frappuccino/dialogs-ru-vits2

Space using frappuccino/dialogs-ru-vits2 1