Update README.md
Browse files
README.md
CHANGED
|
@@ -1,95 +1,93 @@
|
|
| 1 |
-
<div align="center">
|
| 2 |
-
|
| 3 |
-
<a href="https://ibb.co/wN1LS7K"><img width="320" height="173" alt="Screenshot-2024-01-15-at-8-14-08-PM" src="https://github.com/user-attachments/assets/af22f00d-e9d6-49e1-98b1-7efeac900f9a" /></a>
|
| 4 |
-
|
| 5 |
-
<h1>MahaTTS v2: An Open-Source Large Speech Generation Model</h1>
|
| 6 |
-
a <a href = "https://black.dubverse.ai">Dubverse Black</a> initiative <br> <br>
|
| 7 |
-
|
| 8 |
-
<!-- [](https://colab.research.google.com/drive/1qkZz2km-PX75P0f6mUb2y5e-uzub27NW?usp=sharing) -->
|
| 9 |
-
</div>
|
| 10 |
-
|
| 11 |
-
------
|
| 12 |
-
## Description
|
| 13 |
-
We introduce MahaTTS v2, a multi-speaker text-to-speech (TTS) system that has been trained on 50k hours of Indic and global languages.
|
| 14 |
-
We have followed a text-to-semantic-to-acoustic approach, leveraging wav2vec2 tokens, this gives out-the-box generalization to unseen low-resourced languages.
|
| 15 |
-
We have open sourced the first version (MahaTTS), which was trained on English and Indic languages as two separate models on 9k and 400 hours of open source datasets.
|
| 16 |
-
In MahaTTS v2, we have collected over 20k+ hours of training data into a single multilingual cross-lingual model.
|
| 17 |
-
We have used gemma as the backbone for text-to-semantic modeling and a conditional flow model for semantics to mel spectogram generation, using a BigVGAN vocoder to generate the final audio waveform.
|
| 18 |
-
The model has shown great robustness and quality results compared to the previous version.
|
| 19 |
-
We are also open sourcing the ability to finetune on your own voice.
|
| 20 |
-
|
| 21 |
-
### With this release:
|
| 22 |
-
- generate voices in multiple seen and unseen speaker identities (voice cloning)
|
| 23 |
-
- generate voices in multiple langauges (multilingual and cross-lingual voice cloning)
|
| 24 |
-
- copy the style of speech from one speaker to another (cross-lingual voice cloning with prosody and intonation transfer)
|
| 25 |
-
- Train your own large scale pretraining or finetuning Models.
|
| 26 |
-
|
| 27 |
-
### MahaTTS Architecture
|
| 28 |
-
|
| 29 |
-
<img width="1023" height="859" alt="Screenshot 2025-07-10 at 4 04 08β―PM" src="https://github.com/user-attachments/assets/4d44cc35-4b66-41a1-b4fd-415af35eda87" />
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
<!-- ## Installation -->
|
| 35 |
-
<!--
|
| 36 |
-
```bash
|
| 37 |
-
pip install git+https://github.com/dubverse-ai/MahaTTSv2.git
|
| 38 |
-
``` -->
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
### Model Params
|
| 42 |
-
| Model | Parameters | Model Type | Output |
|
| 43 |
-
|:-------------------------:|:----------:|------------|:-----------------:|
|
| 44 |
-
| Text to Semantic (M1) | 510 M | Causal LM | 10,001 Tokens |
|
| 45 |
-
| Semantic to MelSpec(M2) | 71 M | FLOW | 100x Melspec |
|
| 46 |
-
| BigVGAN Vocoder | 112 M | GAN | Audio Waveform |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
## π Supported Languages
|
| 50 |
-
|
| 51 |
-
The following languages are currently supported:
|
| 52 |
-
|
| 53 |
-
| Language | Status |
|
| 54 |
-
|------------------|:------:|
|
| 55 |
-
|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
| 65 |
-
|
|
| 66 |
-
|
|
| 67 |
-
|
|
| 68 |
-
|
|
| 69 |
-
|
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
-
| Rajasthani (in) | β
|
|
| 73 |
-
|
|
| 74 |
-
|
|
| 75 |
-
|
|
| 76 |
-
|
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
- [
|
| 92 |
-
- [
|
| 93 |
-
- [
|
| 94 |
-
- [Flow training](https://github.com/shivammehta25/Matcha-TTS) for training Flow model
|
| 95 |
-
- [Huggingface](https://huggingface.co/docs/transformers/index) for related training and inference code
|
|
|
|
| 1 |
+
<div align="center">
|
| 2 |
+
|
| 3 |
+
<a href="https://ibb.co/wN1LS7K"><img width="320" height="173" alt="Screenshot-2024-01-15-at-8-14-08-PM" src="https://github.com/user-attachments/assets/af22f00d-e9d6-49e1-98b1-7efeac900f9a" /></a>
|
| 4 |
+
|
| 5 |
+
<h1>MahaTTS v2: An Open-Source Large Speech Generation Model</h1>
|
| 6 |
+
a <a href = "https://black.dubverse.ai">Dubverse Black</a> initiative <br> <br>
|
| 7 |
+
|
| 8 |
+
<!-- [](https://colab.research.google.com/drive/1qkZz2km-PX75P0f6mUb2y5e-uzub27NW?usp=sharing) -->
|
| 9 |
+
</div>
|
| 10 |
+
|
| 11 |
+
------
|
| 12 |
+
## Description
|
| 13 |
+
We introduce MahaTTS v2, a multi-speaker text-to-speech (TTS) system that has been trained on 50k hours of Indic and global languages.
|
| 14 |
+
We have followed a text-to-semantic-to-acoustic approach, leveraging wav2vec2 tokens, this gives out-the-box generalization to unseen low-resourced languages.
|
| 15 |
+
We have open sourced the first version (MahaTTS), which was trained on English and Indic languages as two separate models on 9k and 400 hours of open source datasets.
|
| 16 |
+
In MahaTTS v2, we have collected over 20k+ hours of training data into a single multilingual cross-lingual model.
|
| 17 |
+
We have used gemma as the backbone for text-to-semantic modeling and a conditional flow model for semantics to mel spectogram generation, using a BigVGAN vocoder to generate the final audio waveform.
|
| 18 |
+
The model has shown great robustness and quality results compared to the previous version.
|
| 19 |
+
We are also open sourcing the ability to finetune on your own voice.
|
| 20 |
+
|
| 21 |
+
### With this release:
|
| 22 |
+
- generate voices in multiple seen and unseen speaker identities (voice cloning)
|
| 23 |
+
- generate voices in multiple langauges (multilingual and cross-lingual voice cloning)
|
| 24 |
+
- copy the style of speech from one speaker to another (cross-lingual voice cloning with prosody and intonation transfer)
|
| 25 |
+
- Train your own large scale pretraining or finetuning Models.
|
| 26 |
+
|
| 27 |
+
### MahaTTS Architecture
|
| 28 |
+
|
| 29 |
+
<img width="1023" height="859" alt="Screenshot 2025-07-10 at 4 04 08β―PM" src="https://github.com/user-attachments/assets/4d44cc35-4b66-41a1-b4fd-415af35eda87" />
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
<!-- ## Installation -->
|
| 35 |
+
<!--
|
| 36 |
+
```bash
|
| 37 |
+
pip install git+https://github.com/dubverse-ai/MahaTTSv2.git
|
| 38 |
+
``` -->
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
### Model Params
|
| 42 |
+
| Model | Parameters | Model Type | Output |
|
| 43 |
+
|:-------------------------:|:----------:|------------|:-----------------:|
|
| 44 |
+
| Text to Semantic (M1) | 510 M | Causal LM | 10,001 Tokens |
|
| 45 |
+
| Semantic to MelSpec(M2) | 71 M | FLOW | 100x Melspec |
|
| 46 |
+
| BigVGAN Vocoder | 112 M | GAN | Audio Waveform |
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
## π Supported Languages
|
| 50 |
+
|
| 51 |
+
The following languages are currently supported:
|
| 52 |
+
|
| 53 |
+
| Language | Status |
|
| 54 |
+
|------------------|:------:|
|
| 55 |
+
| Assamese (in) | β
|
|
| 56 |
+
| Bengali (in) | β
|
|
| 57 |
+
| Bhojpuri (in) | β
|
|
| 58 |
+
| Bodo (in) | β
|
|
| 59 |
+
| Dogri (in) | β
|
|
| 60 |
+
| Odia (in) | β
|
|
| 61 |
+
| English (en) | β
|
|
| 62 |
+
| French (fr) | β
|
|
| 63 |
+
| Gujarati (in) | β
|
|
| 64 |
+
| German (de) | β
|
|
| 65 |
+
| Hindi (in) | β
|
|
| 66 |
+
| Italian (it) | β
|
|
| 67 |
+
| Kannada (in) | β
|
|
| 68 |
+
| Malayalam (in) | β
|
|
| 69 |
+
| Marathi (in) | β
|
|
| 70 |
+
| Telugu (in) | β
|
|
| 71 |
+
| Punjabi (in) | β
|
|
| 72 |
+
| Rajasthani (in) | β
|
|
| 73 |
+
| Sanskrit (in) | β
|
|
| 74 |
+
| Spanish (es) | β
|
|
| 75 |
+
| Tamil (in) | β
|
|
| 76 |
+
| Telugu (in) | β
|
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## TODO:
|
| 80 |
+
1. Addind Training Instructions.
|
| 81 |
+
2. Add a colab for the same.
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
## License
|
| 85 |
+
MahaTTS is licensed under the Apache 2.0 License.
|
| 86 |
+
|
| 87 |
+
## π Appreciation
|
| 88 |
+
|
| 89 |
+
- [Tortoise-tts](https://github.com/neonbjb/tortoise-tts) for inspiring the architecture
|
| 90 |
+
- [M4t Seamless](https://github.com/facebookresearch/seamless_communication) [AudioLM](https://arxiv.org/abs/2209.03143) and many other ground-breaking papers that enabled the development of MahaTTS
|
| 91 |
+
- [BIGVGAN](https://github.com/NVIDIA/BigVGAN) out of the box vocoder
|
| 92 |
+
- [Flow training](https://github.com/shivammehta25/Matcha-TTS) for training Flow model
|
| 93 |
+
- [Huggingface](https://huggingface.co/docs/transformers/index) for related training and inference code
|
|
|
|
|
|