| # MFA based extraction for FastSpeech | |
| ## Prepare | |
| Everything is done from main repo folder so TensorflowTTS/ | |
| 0. Optional* Modify MFA scripts to work with your language (https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained_models.html) | |
| 1. Download pretrained mfa, lexicon and run extract textgrids: | |
| - ``` | |
| bash examples/mfa_extraction/scripts/prepare_mfa.sh | |
| ``` | |
| - ``` | |
| python examples/mfa_extraction/run_mfa.py \ | |
| --corpus_directory ./libritts \ | |
| --output_directory ./mfa/parsed \ | |
| --jobs 8 | |
| ``` | |
| After this step, the TextGrids is allocated at `./mfa/parsed`. | |
| 2. Extract duration from textgrid files: | |
| - ``` | |
| python examples/mfa_extraction/txt_grid_parser.py \ | |
| --yaml_path examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml \ | |
| --dataset_path ./libritts \ | |
| --text_grid_path ./mfa/parsed \ | |
| --output_durations_path ./libritts/durations \ | |
| --sample_rate 24000 | |
| ``` | |
| - Dataset structure after finish this step: | |
| ``` | |
| |- TensorFlowTTS/ | |
| | |- LibriTTS/ | |
| | |- |- train-clean-100/ | |
| | |- |- SPEAKERS.txt | |
| | |- |- ... | |
| | |- dataset/ | |
| | |- |- 200/ | |
| | |- |- |- 200_124139_000001_000000.txt | |
| | |- |- |- 200_124139_000001_000000.wav | |
| | |- |- |- ... | |
| | |- |- 250/ | |
| | |- |- ... | |
| | |- |- durations/ | |
| | |- |- train.txt | |
| | |- tensorflow_tts/ | |
| | |- models/ | |
| | |- ... | |
| ``` | |
| 3. Optional* add your own dataset parser based on tensorflow_tts/processor/experiment/example_dataset.py ( If base processor dataset didnt match yours ) | |
| 4. Run preprocess and normalization (Step 4,5 in `examples/fastspeech2_libritts/README.MD`) | |
| 5. Run fix mismatch to fix few frames difference in audio and duration files: | |
| - ``` | |
| python examples/mfa_extraction/fix_mismatch.py \ | |
| --base_path ./dump \ | |
| --trimmed_dur_path ./dataset/trimmed-durations \ | |
| --dur_path ./dataset/durations | |
| ``` | |
| ## Problems with MFA extraction | |
| Looks like MFA have problems with trimmed files it works better (in my experiments) with ~100ms of silence at start and end | |
| Short files can get a lot of false positive like only silence extraction (LibriTTS example) so i would get only samples >2s | |