Dialogs-RU · Expressive Russian TTS (VITS2)
An expressive, multi-speaker, emotion-conditioned Russian text-to-speech model. It is a VITS2 model (multi-band iSTFT decoder) trained on the Dialogs corpus — a studio-quality, expressive, conversational Russian speech dataset.
This is the exact model described in the paper "Dialogs: a studio-quality expressive conversational Russian speech corpus for dialog assistants" (Ilya Shigabeev & Ilya Latyshev, Langswap), trained as a proof of concept that the corpus supports expressive, dialog-style TTS.
👉 Try it in your browser (free CPU demo): frappuccino/dialogs-ru-tts
Highlights
- 3 studio voices: Masha (F), Sveta (F), Dima (M)
- 13 emotional styles: neutral, happy, surprise, arrogance, yawn, fear, laughing, whisper, disgust, angry, sad, tongue-twister, poem
- 22.05 kHz output, runs in real time on CPU (RTF ≈ 0.2–0.4)
- Plain Russian text in; stress is optional (auto-placed with
ruaccent, or marked manually with+)
Conditioning
| speaker_id | voice | gender |
|---|---|---|
| 0 | Masha / Маша | female |
| 1 | Sveta / Света | female |
| 2 | Dima / Дима | male |
| emotion_id | style | emotion_id | style | |
|---|---|---|---|---|
| 0 | neutral | 7 | whisper | |
| 1 | happy | 8 | disgust | |
| 2 | surprise | 9 | angry | |
| 3 | arrogance | 10 | sad | |
| 4 | yawn | 11 | tongue-twister | |
| 5 | fear | 12 | poem | |
| 6 | laughing |
Usage
pip install -r requirements.txt
python inference.py
from scipy.io.wavfile import write
from tts import DialogsTTS
tts = DialogsTTS() # downloads weights on first run
sr, audio, used = tts.synthesize(
"В 2024 году цена выросла на 7,5%.", # numbers/dates/etc. are spelled out
speaker_id=0, # 0=Masha, 1=Sveta, 2=Dima
emotion_id=1, # 1=happy
normalize=True, # rutextnorm: numbers, dates, abbreviations
auto_stress=True, # ruaccent: place '+' stress automatically
)
write("out.wav", sr, audio)
Text front-end (normalization & stress)
DialogsTTS.synthesize runs a two-stage front-end before tokenization:
- Normalization (
normalize=True, default) —rutextnormspells out numbers, dates, money, units, fractions and abbreviations (7,5%→ «семь целых и пять десятых процента»,№5→ «номер пять»). It is vendored attext/rutextnorm.py(single file, MIT, zero dependencies). - Stress (
auto_stress=True, default) —ruaccentplaces a+before each stressed vowel. The model was trained with these marks; you can also write them yourself (e.g.з+амокvsзам+ок).
The result is lowercased and mapped through the 73-symbol vocabulary (characters outside it are dropped).
Model details
- Architecture: VITS2 with a multi-band iSTFT generator (4 subbands), mel posterior encoder, transformer flows, deterministic duration predictor (no SDP), and a duration discriminator.
- Conditioning: speaker embedding (192-d) concatenated with emotion embedding (64-d) → 256-d global condition.
- Sample rate: 22.05 kHz · vocabulary: 73 symbols (punctuation + Latin + Cyrillic,
+= stress). - Training: ~615k steps on a single RTX 4090, batch size 16, following VITS2 hyperparameter defaults.
- Checkpoint:
averaged_G_615000.pth(weight-averaged generator). - Evaluation (paper, proof-of-concept): UTMOS ≈ 3.36; MOS overall ≈ 2.83.
Limitations
Per-speaker data is modest and unbalanced (Masha 9.9 h, Dima 6.2 h, Sveta 4.4 h), so this single-corpus model is a proof of concept rather than a production system. The strongest styles are neutral, happy and sad; rarer styles (whisper, poem, tongue-twister, yawn) are subtler. For production use the authors recommend mixing Dialogs with a larger Russian corpus.
Links
- 📚 Dataset: langswap/dialogs-ru-emotional-conversations
- 🕹️ Demo Space: frappuccino/dialogs-ru-tts
- 💻 Training code: github.com/shigabeev/vits2-emotional
- 🔤 Text normalizer: github.com/shigabeev/russian_tts_normalization (vendored at
text/rutextnorm.py, MIT)
Citation
@inproceedings{shigabeev2025dialogs,
title = {Dialogs: a studio-quality expressive conversational Russian speech corpus for dialog assistants},
author = {Shigabeev, Ilya and Latyshev, Ilya},
year = {2025},
note = {Langswap}
}
License
OpenRAIL — free use including commercial applications, subject to the use-based restrictions of the license.
- Downloads last month
- 37