Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,103 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
| 3 |
---
|
| 4 |
+
# Fastspeech2 Model using Hybrid Segmentation (HS)
|
| 5 |
+
|
| 6 |
+
This repository contains a Fastspeech2 Model for 16 Indian languages (male and female both) implemented using the Hybrid Segmentation (HS) for speech synthesis. The model is capable of generating mel-spectrograms from text inputs and can be used to synthesize speech..
|
| 7 |
+
|
| 8 |
+
The Repo is large in size: We have used [Git LFS](https://git-lfs.com/) due to Github's size constraint (please install latest git LFS from the link, we have provided the current one below).
|
| 9 |
+
```
|
| 10 |
+
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.python.sh | bash
|
| 11 |
+
sudo apt-get install git-lfs
|
| 12 |
+
git lfs install
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
Language model files are uploaded using git LFS. so please use:
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
git lfs fetch --all
|
| 19 |
+
git lfs pull
|
| 20 |
+
```
|
| 21 |
+
to get the original files in your directory.
|
| 22 |
+
|
| 23 |
+
## Model Files
|
| 24 |
+
|
| 25 |
+
The model for each language includes the following files:
|
| 26 |
+
|
| 27 |
+
- `config.yaml`: Configuration file for the Fastspeech2 Model.
|
| 28 |
+
- `energy_stats.npz`: Energy statistics for normalization during synthesis.
|
| 29 |
+
- `feats_stats.npz`: Features statistics for normalization during synthesis.
|
| 30 |
+
- `feats_type`: Features type information.
|
| 31 |
+
- `pitch_stats.npz`: Pitch statistics for normalization during synthesis.
|
| 32 |
+
- `model.pth`: Pre-trained Fastspeech2 model weights.
|
| 33 |
+
|
| 34 |
+
## Installation
|
| 35 |
+
|
| 36 |
+
1. Install [Miniconda](https://docs.conda.io/projects/miniconda/en/latest/) first. Create a conda environment using the provided `environment.yml` file:
|
| 37 |
+
|
| 38 |
+
```shell
|
| 39 |
+
conda env create -f environment.yml
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
2.Activate the conda environment (check inside environment.yaml file):
|
| 43 |
+
```shell
|
| 44 |
+
conda activate tts-hs-hifigan
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
3. Install PyTorch separately (you can install the specific version based on your requirements):
|
| 48 |
+
```shell
|
| 49 |
+
conda install pytorch cudatoolkit
|
| 50 |
+
pip install torchaudio
|
| 51 |
+
pip install numpy==1.23.0
|
| 52 |
+
```
|
| 53 |
+
## Vocoder
|
| 54 |
+
For generating WAV files from mel-spectrograms, you can use a vocoder of your choice. One popular option is the [HIFIGAN](https://github.com/jik876/hifi-gan) vocoder (Clone this repo and put it in the current working directory). Please refer to the documentation of the vocoder you choose for installation and usage instructions.
|
| 55 |
+
|
| 56 |
+
(**We have used the HIFIGAN vocoder and have provided Vocoder tuned on Aryan and Dravidian languages**)
|
| 57 |
+
|
| 58 |
+
## Usage
|
| 59 |
+
|
| 60 |
+
The directory paths are Relative. ( But if needed, Make changes to **text_preprocess_for_inference.py** and **inference.py** file, Update folder/file paths wherever required.)
|
| 61 |
+
|
| 62 |
+
**Please give language/gender in small cases and sample text between quotes. Adjust output speed using the alpha parameter (higher for slow voiced output and vice versa). Output argument is optional; the provide name will be used for the output file.**
|
| 63 |
+
|
| 64 |
+
Use the inference file to synthesize speech from text inputs:
|
| 65 |
+
```shell
|
| 66 |
+
python inference.py --sample_text "Your input text here" --language <language> --gender <gender> --alpha <alpha> --output_file <file_name.wav OR path/to/file_name.wav>
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
**Example:**
|
| 70 |
+
|
| 71 |
+
```
|
| 72 |
+
python inference.py --sample_text "श्रीलंका और पाकिस्तान में खेला जा रहा एशिया कप अब तक का सबसे विवादित टूर्नामेंट होता जा रहा है।" --language hindi --gender male --alpha 1 --output_file male_hindi_output.wav
|
| 73 |
+
```
|
| 74 |
+
The file will be stored as `male_hindi_output.wav` and will be inside current working directory. If **--output_file** argument is not given it will be stored as `<language>_<gender>_output.wav` in the current working directory.
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
### Citation
|
| 78 |
+
If you use this Fastspeech2 Model in your research or work, please consider citing:
|
| 79 |
+
|
| 80 |
+
“
|
| 81 |
+
COPYRIGHT
|
| 82 |
+
2023, Speech Technology Consortium,
|
| 83 |
+
|
| 84 |
+
Bhashini, MeiTY and by Hema A Murthy & S Umesh,
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
|
| 88 |
+
and
|
| 89 |
+
ELECTRICAL ENGINEERING,
|
| 90 |
+
IIT MADRAS. ALL RIGHTS RESERVED "
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
|
| 95 |
+
|
| 96 |
+
This work is licensed under a
|
| 97 |
+
[Creative Commons Attribution 4.0 International License][cc-by].
|
| 98 |
+
|
| 99 |
+
[![CC BY 4.0][cc-by-image]][cc-by]
|
| 100 |
+
|
| 101 |
+
[cc-by]: http://creativecommons.org/licenses/by/4.0/
|
| 102 |
+
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
|
| 103 |
+
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
|