rasenganai commited on
Commit
a60b21f
Β·
verified Β·
1 Parent(s): 8f40e60

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -95
README.md CHANGED
@@ -1,95 +1,93 @@
1
- <div align="center">
2
-
3
- <a href="https://ibb.co/wN1LS7K"><img width="320" height="173" alt="Screenshot-2024-01-15-at-8-14-08-PM" src="https://github.com/user-attachments/assets/af22f00d-e9d6-49e1-98b1-7efeac900f9a" /></a>
4
-
5
- <h1>MahaTTS v2: An Open-Source Large Speech Generation Model</h1>
6
- a <a href = "https://black.dubverse.ai">Dubverse Black</a> initiative <br> <br>
7
-
8
- <!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qkZz2km-PX75P0f6mUb2y5e-uzub27NW?usp=sharing) -->
9
- </div>
10
-
11
- ------
12
- ## Description
13
- We introduce MahaTTS v2, a multi-speaker text-to-speech (TTS) system that has been trained on 50k hours of Indic and global languages.
14
- We have followed a text-to-semantic-to-acoustic approach, leveraging wav2vec2 tokens, this gives out-the-box generalization to unseen low-resourced languages.
15
- We have open sourced the first version (MahaTTS), which was trained on English and Indic languages as two separate models on 9k and 400 hours of open source datasets.
16
- In MahaTTS v2, we have collected over 20k+ hours of training data into a single multilingual cross-lingual model.
17
- We have used gemma as the backbone for text-to-semantic modeling and a conditional flow model for semantics to mel spectogram generation, using a BigVGAN vocoder to generate the final audio waveform.
18
- The model has shown great robustness and quality results compared to the previous version.
19
- We are also open sourcing the ability to finetune on your own voice.
20
-
21
- ### With this release:
22
- - generate voices in multiple seen and unseen speaker identities (voice cloning)
23
- - generate voices in multiple langauges (multilingual and cross-lingual voice cloning)
24
- - copy the style of speech from one speaker to another (cross-lingual voice cloning with prosody and intonation transfer)
25
- - Train your own large scale pretraining or finetuning Models.
26
-
27
- ### MahaTTS Architecture
28
-
29
- <img width="1023" height="859" alt="Screenshot 2025-07-10 at 4 04 08β€―PM" src="https://github.com/user-attachments/assets/4d44cc35-4b66-41a1-b4fd-415af35eda87" />
30
-
31
-
32
-
33
-
34
- <!-- ## Installation -->
35
- <!--
36
- ```bash
37
- pip install git+https://github.com/dubverse-ai/MahaTTSv2.git
38
- ``` -->
39
-
40
-
41
- ### Model Params
42
- | Model | Parameters | Model Type | Output |
43
- |:-------------------------:|:----------:|------------|:-----------------:|
44
- | Text to Semantic (M1) | 510 M | Causal LM | 10,001 Tokens |
45
- | Semantic to MelSpec(M2) | 71 M | FLOW | 100x Melspec |
46
- | BigVGAN Vocoder | 112 M | GAN | Audio Waveform |
47
-
48
-
49
- ## 🌐 Supported Languages
50
-
51
- The following languages are currently supported:
52
-
53
- | Language | Status |
54
- |------------------|:------:|
55
- | English (en) | βœ… |
56
- | Hindi (in) | βœ… |
57
- | Assamese (in) | βœ… |
58
- | Gujarati (in) | βœ… |
59
- | Telugu (in) | βœ… |
60
- | Punjabi (in) | βœ… |
61
- | Marathi (in) | βœ… |
62
- | Tamil (in) | βœ… |
63
- | Bengali (in) | βœ… |
64
- | Odia (in) | βœ… |
65
- | Manipuri (in) | βœ… |
66
- | Bhojpuri (in) | βœ… |
67
- | Sanskrit (in) | βœ… |
68
- | Bodo (in) | βœ… |
69
- | Malayalam (in) | βœ… |
70
- | Kannada (in) | βœ… |
71
- | Dogri (in) | βœ… |
72
- | Rajasthani (in) | βœ… |
73
- | Thai (th) | βœ… |
74
- | Japanese (ja) | βœ… |
75
- | French (fr) | βœ… |
76
- | German (de) | βœ… |
77
- | Italian (it) | βœ… |
78
- | Spanish (es) | βœ… |
79
-
80
-
81
- ## TODO:
82
- 1. Addind Training Instructions.
83
- 2. Add a colab for the same.
84
-
85
-
86
- ## License
87
- MahaTTS is licensed under the Apache 2.0 License.
88
-
89
- ## πŸ™ Appreciation
90
-
91
- - [Tortoise-tts](https://github.com/neonbjb/tortoise-tts) for inspiring the architecture
92
- - [M4t Seamless](https://github.com/facebookresearch/seamless_communication) [AudioLM](https://arxiv.org/abs/2209.03143) and many other ground-breaking papers that enabled the development of MahaTTS
93
- - [BIGVGAN](https://github.com/NVIDIA/BigVGAN) out of the box vocoder
94
- - [Flow training](https://github.com/shivammehta25/Matcha-TTS) for training Flow model
95
- - [Huggingface](https://huggingface.co/docs/transformers/index) for related training and inference code
 
1
+ <div align="center">
2
+
3
+ <a href="https://ibb.co/wN1LS7K"><img width="320" height="173" alt="Screenshot-2024-01-15-at-8-14-08-PM" src="https://github.com/user-attachments/assets/af22f00d-e9d6-49e1-98b1-7efeac900f9a" /></a>
4
+
5
+ <h1>MahaTTS v2: An Open-Source Large Speech Generation Model</h1>
6
+ a <a href = "https://black.dubverse.ai">Dubverse Black</a> initiative <br> <br>
7
+
8
+ <!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qkZz2km-PX75P0f6mUb2y5e-uzub27NW?usp=sharing) -->
9
+ </div>
10
+
11
+ ------
12
+ ## Description
13
+ We introduce MahaTTS v2, a multi-speaker text-to-speech (TTS) system that has been trained on 50k hours of Indic and global languages.
14
+ We have followed a text-to-semantic-to-acoustic approach, leveraging wav2vec2 tokens, this gives out-the-box generalization to unseen low-resourced languages.
15
+ We have open sourced the first version (MahaTTS), which was trained on English and Indic languages as two separate models on 9k and 400 hours of open source datasets.
16
+ In MahaTTS v2, we have collected over 20k+ hours of training data into a single multilingual cross-lingual model.
17
+ We have used gemma as the backbone for text-to-semantic modeling and a conditional flow model for semantics to mel spectogram generation, using a BigVGAN vocoder to generate the final audio waveform.
18
+ The model has shown great robustness and quality results compared to the previous version.
19
+ We are also open sourcing the ability to finetune on your own voice.
20
+
21
+ ### With this release:
22
+ - generate voices in multiple seen and unseen speaker identities (voice cloning)
23
+ - generate voices in multiple langauges (multilingual and cross-lingual voice cloning)
24
+ - copy the style of speech from one speaker to another (cross-lingual voice cloning with prosody and intonation transfer)
25
+ - Train your own large scale pretraining or finetuning Models.
26
+
27
+ ### MahaTTS Architecture
28
+
29
+ <img width="1023" height="859" alt="Screenshot 2025-07-10 at 4 04 08β€―PM" src="https://github.com/user-attachments/assets/4d44cc35-4b66-41a1-b4fd-415af35eda87" />
30
+
31
+
32
+
33
+
34
+ <!-- ## Installation -->
35
+ <!--
36
+ ```bash
37
+ pip install git+https://github.com/dubverse-ai/MahaTTSv2.git
38
+ ``` -->
39
+
40
+
41
+ ### Model Params
42
+ | Model | Parameters | Model Type | Output |
43
+ |:-------------------------:|:----------:|------------|:-----------------:|
44
+ | Text to Semantic (M1) | 510 M | Causal LM | 10,001 Tokens |
45
+ | Semantic to MelSpec(M2) | 71 M | FLOW | 100x Melspec |
46
+ | BigVGAN Vocoder | 112 M | GAN | Audio Waveform |
47
+
48
+
49
+ ## 🌐 Supported Languages
50
+
51
+ The following languages are currently supported:
52
+
53
+ | Language | Status |
54
+ |------------------|:------:|
55
+ | Assamese (in) | βœ… |
56
+ | Bengali (in) | βœ… |
57
+ | Bhojpuri (in) | βœ… |
58
+ | Bodo (in) | βœ… |
59
+ | Dogri (in) | βœ… |
60
+ | Odia (in) | βœ… |
61
+ | English (en) | βœ… |
62
+ | French (fr) | βœ… |
63
+ | Gujarati (in) | βœ… |
64
+ | German (de) | βœ… |
65
+ | Hindi (in) | βœ… |
66
+ | Italian (it) | βœ… |
67
+ | Kannada (in) | βœ… |
68
+ | Malayalam (in) | βœ… |
69
+ | Marathi (in) | βœ… |
70
+ | Telugu (in) | βœ… |
71
+ | Punjabi (in) | βœ… |
72
+ | Rajasthani (in) | βœ… |
73
+ | Sanskrit (in) | βœ… |
74
+ | Spanish (es) | βœ… |
75
+ | Tamil (in) | βœ… |
76
+ | Telugu (in) | βœ… |
77
+
78
+
79
+ ## TODO:
80
+ 1. Addind Training Instructions.
81
+ 2. Add a colab for the same.
82
+
83
+
84
+ ## License
85
+ MahaTTS is licensed under the Apache 2.0 License.
86
+
87
+ ## πŸ™ Appreciation
88
+
89
+ - [Tortoise-tts](https://github.com/neonbjb/tortoise-tts) for inspiring the architecture
90
+ - [M4t Seamless](https://github.com/facebookresearch/seamless_communication) [AudioLM](https://arxiv.org/abs/2209.03143) and many other ground-breaking papers that enabled the development of MahaTTS
91
+ - [BIGVGAN](https://github.com/NVIDIA/BigVGAN) out of the box vocoder
92
+ - [Flow training](https://github.com/shivammehta25/Matcha-TTS) for training Flow model
93
+ - [Huggingface](https://huggingface.co/docs/transformers/index) for related training and inference code