File size: 5,531 Bytes
a60b21f
 
 
 
 
 
 
 
 
 
 
 
483b6a3
a60b21f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b115c1
 
 
 
 
 
 
 
a60b21f
2b115c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a60b21f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
<div align="center">

<a href="https://ibb.co/wN1LS7K"><img width="320" height="173" alt="Screenshot-2024-01-15-at-8-14-08-PM" src="https://github.com/user-attachments/assets/af22f00d-e9d6-49e1-98b1-7efeac900f9a" /></a>

<h1>MahaTTS v2: An Open-Source Large Speech Generation Model</h1>
a <a href = "https://black.dubverse.ai">Dubverse Black</a> initiative <br> <br>

<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qkZz2km-PX75P0f6mUb2y5e-uzub27NW?usp=sharing) -->
</div>

------
## Description
We introduce [MahaTTS v2](https://arxiv.org/abs/2508.14049) (Paper), a multi-speaker text-to-speech (TTS) system that has been trained on 50k hours of Indic and global languages. 
We have followed a text-to-semantic-to-acoustic approach, leveraging wav2vec2 tokens, this gives out-the-box generalization to unseen low-resourced languages. 
We have open sourced the first version (MahaTTS), which was trained on English and Indic languages as two separate models on 9k and 400 hours of open source datasets. 
In MahaTTS v2, we have collected over 20k+ hours of training data into a single multilingual cross-lingual model. 
We have used gemma as the backbone for text-to-semantic modeling and a conditional flow model for semantics to mel spectogram generation, using a BigVGAN vocoder to generate the final audio waveform. 
The model has shown great robustness and quality results compared to the previous version.
We are also open sourcing the ability to finetune on your own voice.

### With this release:
- generate voices in multiple seen and unseen speaker identities (voice cloning)
- generate voices in multiple langauges (multilingual and cross-lingual voice cloning)
- copy the style of speech from one speaker to another (cross-lingual voice cloning with prosody and intonation transfer)
- Train your own large scale pretraining or finetuning Models.

### MahaTTS Architecture

<img width="1023" height="859" alt="Screenshot 2025-07-10 at 4 04 08β€―PM" src="https://github.com/user-attachments/assets/4d44cc35-4b66-41a1-b4fd-415af35eda87" />




## Installation

```bash
git lfs install
git clone --recurse-submodules https://huggingface.co/Dubverse/MahaTTSv2
pip install -r MahaTTSv2/requirements.txt
```

```bash
import sys
sys.path.append("MahaTTSv2/")
import os
import torch
import subprocess
from inference import infer, prepare_inputs, load_t2s_model, load_cfm, create_wav_header

device = "cuda"# if torch.cuda.is_available() else "cpu"
print("Using device", device)

# Model checkpoints
m1_checkpoint = "MahaTTSv2/pretrained_checkpoint/m1_gemma_benchmark_1_latest_weights.pt"
m2_checkpoint = "MahaTTSv2/pretrained_checkpoint/m2.pt"
vocoder_checkpoint = 'MahaTTSv2/pretrained_checkpoint/700_580k_multilingual_infer_ready/'

global FM, vocoder, m2, mu, std, m1

# Load models
FM, vocoder, m2, mu, std = load_cfm(m2_checkpoint, vocoder_checkpoint, device)
m1 = load_t2s_model(m1_checkpoint, device)


def generate_audio(text, language):

    ref_clips = [
        'speakers/female1/train_hindifemale_02794.wav',
        'speakers/female1/train_hindifemale_04167.wav',
        'speakers/female1/train_hindifemale_02795.wav'
        ]

    text_ids, code_ids, language_code, ref_mels_m1, ref_mels_m2 = prepare_inputs(
        text.lower(),
        ref_clips_m1=ref_clips,
        ref_clips_m2=ref_clips,
        language=language,
        device=device
    )

    audio_wav = infer(m1, m2, vocoder, FM, mu, std, text_ids, code_ids, language_code, ref_mels_m1, ref_mels_m2, device)
    return 24000,audio_wav

```



### Model Params
|      Model                | Parameters | Model Type |       Output      |  
|:-------------------------:|:----------:|------------|:-----------------:|
|   Text to Semantic (M1)   |    510 M   | Causal LM  |   10,001 Tokens   |
|  Semantic to MelSpec(M2)  |    71 M    |   FLOW     |   100x Melspec    |
|      BigVGAN Vocoder      |    112 M   |    GAN     |   Audio Waveform  |


## 🌐 Supported Languages

The following languages are currently supported:

| Language         | Status |
|------------------|:------:|
| Assamese (in)    | βœ…     |
| Bengali (in)     | βœ…     |
| Bhojpuri (in)    | βœ…     |
| Bodo (in)        | βœ…     |
| Dogri (in)       | βœ…     |
| Odia (in)        | βœ…     |
| English (en)     | βœ…     |
| French (fr)      | βœ…     |
| Gujarati (in)    | βœ…     |
| German (de)      | βœ…     |
| Hindi (in)       | βœ…     |
| Italian (it)     | βœ…     |
| Kannada (in)     | βœ…     |
| Malayalam (in)   | βœ…     |
| Marathi (in)     | βœ…     |
| Telugu (in)      | βœ…     |
| Punjabi (in)     | βœ…     |
| Rajasthani (in)  | βœ…     |
| Sanskrit (in)    | βœ…     |
| Spanish (es)     | βœ…     |
| Tamil (in)       | βœ…     |
| Telugu (in)      | βœ…     |


## TODO:
1. Addind Training Instructions.
2. Add a colab for the same.


## License
MahaTTS is licensed under the Apache 2.0 License. 

## πŸ™ Appreciation

- [Tortoise-tts](https://github.com/neonbjb/tortoise-tts) for inspiring the architecture
- [M4t Seamless](https://github.com/facebookresearch/seamless_communication) [AudioLM](https://arxiv.org/abs/2209.03143) and many other ground-breaking papers that enabled the development of MahaTTS
- [BIGVGAN](https://github.com/NVIDIA/BigVGAN) out of the box vocoder
- [Flow training](https://github.com/shivammehta25/Matcha-TTS) for training Flow model
- [Huggingface](https://huggingface.co/docs/transformers/index) for related training and inference code