haoxiangsnr commited on Jan 4, 2025

Commit

53cf2e4

verified ·

1 Parent(s): 9c38bf7

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

Amphion/egs/metrics/README.md +174 -0
Amphion/egs/metrics/run.sh +132 -0
Amphion/egs/svc/TransformerSVC/exp_config.json +108 -0
Amphion/egs/svc/VitsSVC/README.md +125 -0
Amphion/egs/tta/README.md +19 -0
Amphion/egs/tta/audioldm/exp_config.json +90 -0
Amphion/egs/tta/audioldm/run_train.sh +26 -0
Amphion/egs/tta/audioldm/run_train_latent_4_10_78.sh +26 -0
Amphion/egs/tta/autoencoderkl/run_train_latent_4_10_78.sh +26 -0
Amphion/egs/tts/FastSpeech2/prepare_mfa.sh +29 -0
Amphion/egs/tts/FastSpeech2/run.sh +155 -0
Amphion/egs/tts/NaturalSpeech2/run_inference.sh +49 -0
Amphion/egs/tts/VALLE/README.md +207 -0
Amphion/egs/tts/VALLE/prompt_examples/5142_33396_000002_000004.wav +0 -0
Amphion/egs/tts/VALLE/prompt_examples/7176_92135_000004_000000.normalized.txt +1 -0
Amphion/egs/tts/VITS/exp_config.json +34 -0
Amphion/egs/vocoder/gan/bigvgan_large/exp_config.json +70 -0
Amphion/egs/vocoder/gan/hifigan/exp_config.json +59 -0
Amphion/egs/vocoder/gan/hifigan/run.sh +141 -0
Amphion/egs/vocoder/gan/nsfhifigan/run.sh +141 -0
Amphion/models/__pycache__/__init__.cpython-310.pyc +0 -0
Amphion/models/base/__init__.py +7 -0
Amphion/models/base/new_dataset.py +50 -0
Amphion/models/base/new_trainer.py +727 -0
Amphion/models/codec/ns3_codec/__pycache__/facodec.cpython-310.pyc +0 -0
Amphion/models/codec/ns3_codec/alias_free_torch/__pycache__/__init__.cpython-310.pyc +0 -0
Amphion/models/codec/ns3_codec/alias_free_torch/__pycache__/resample.cpython-310.pyc +0 -0
Amphion/models/codec/ns3_codec/quantize/__pycache__/rvq.cpython-310.pyc +0 -0
Amphion/models/codec/ns3_codec/quantize/rvq.py +87 -0
Amphion/models/svc/transformer/transformer.py +82 -0
Amphion/models/tts/fastspeech2/fs2.py +548 -0
Amphion/models/tts/fastspeech2/fs2_inference.py +193 -0
Amphion/models/tts/naturalspeech2/__init__.py +0 -0
Amphion/models/tts/naturalspeech2/wavenet.py +206 -0
Amphion/models/tts/valle/valle.py +794 -0
Amphion/models/tts/valle/valle_inference.py +237 -0
Amphion/models/tts/valle/valle_trainer.py +367 -0
Amphion/models/tts/vits/__init__.py +0 -0
Amphion/models/tts/vits/vits.py +379 -0
Amphion/models/tts/vits/vits_dataset.py +140 -0
Amphion/models/tts/vits/vits_trainer.py +439 -0
Amphion/models/vocoders/autoregressive/autoregressive_vocoder_trainer.py +0 -0
Amphion/modules/activation_functions/gated_activation_unit.py +61 -0
Amphion/modules/base/base_module.py +75 -0
Amphion/modules/diffusion/__init__.py +7 -0
Amphion/modules/duration_predictor/__init__.py +0 -0
Amphion/modules/duration_predictor/standard_duration_predictor.py +53 -0
Amphion/modules/duration_predictor/stochastic_duration_predictor.py +120 -0
Amphion/modules/general/scaling.py +1349 -0
Amphion/modules/norms/norm.py +173 -0

Amphion/egs/metrics/README.md ADDED Viewed

	@@ -0,0 +1,174 @@

+# Amphion Evaluation Recipe
+## Supported Evaluation Metrics
+Until now, Amphion Evaluation has supported the following objective metrics:
+- **F0 Modeling**:
+  - F0 Pearson Coefficients (FPC)
+  - F0 Periodicity Root Mean Square Error (PeriodicityRMSE)
+  - F0 Root Mean Square Error (F0RMSE)
+  - Voiced/Unvoiced F1 Score (V/UV F1)
+- **Energy Modeling**:
+  - Energy Root Mean Square Error (EnergyRMSE)
+  - Energy Pearson Coefficients (EnergyPC)
+- **Intelligibility**:
+  - Character Error Rate (CER) based on [Whipser](https://github.com/openai/whisper)
+  - Word Error Rate (WER) based on [Whipser](https://github.com/openai/whisper)
+- **Spectrogram Distortion**:
+  - Frechet Audio Distance (FAD)
+  - Mel Cepstral Distortion (MCD)
+  - Multi-Resolution STFT Distance (MSTFT)
+  - Perceptual Evaluation of Speech Quality (PESQ)
+  - Short Time Objective Intelligibility (STOI)
+  - Scale Invariant Signal to Distortion Ratio (SISDR)
+  - Scale Invariant Signal to Noise Ratio (SISNR)
+- **Speaker Similarity**:
+  - Cosine similarity based on:
+    - [Rawnet3](https://github.com/Jungjee/RawNet)
+    - [Resemblyzer](https://github.com/resemble-ai/Resemblyzer)
+    - [WavLM](https://huggingface.co/microsoft/wavlm-base-plus-sv)
+We provide a recipe to demonstrate how to objectively evaluate your generated audios. There are three steps in total:
+1. Pretrained Models Preparation
+2. Audio Data Preparation
+3. Evaluation
+## 1. Pretrained Models Preparation
+If you want to calculate `RawNet3` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).
+## 2. Audio Data Preparation
+Prepare reference audios and generated audios in two folders, the `ref_dir` contains the reference audio and the `gen_dir` contains the generated audio. Here is an example.
+```plaintext
+ ┣ {ref_dir}
+ ┃ ┣ sample1.wav
+ ┃ ┣ sample2.wav
+ ┣ {gen_dir}
+ ┃ ┣ sample1.wav
+ ┃ ┣ sample2.wav
+```
+You have to make sure that the pairwise **reference audio and generated audio are named the same**, as illustrated above (sample1 to sample1, sample2 to sample2).
+## 3. Evaluation
+Run the `run.sh` with specified refenrece folder, generated folder, dump folder and metrics.
+```bash
+cd Amphion
+sh egs/metrics/run.sh \
+	--reference_folder [Your path to the reference audios] \
+	--generated_folder [Your path to the generated audios] \
+	--dump_folder [Your path to dump the objective results] \
+	--metrics [The metrics you need] \
+	--fs [Optional. To calculate all metrics in the specified sampling rate] \
+	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
+	--similarity_mode [Optional. To choose the mode for calculating the speaker similarity. "pairwith" for calculating a series of ground truth / prediction audio pairs to obtain the speaker similarity, and "overall" for computing the average score with all possible pairs between the refernece folder and generated folder. Default to "pairwith"] \
+	--intelligibility_mode [Optionoal. To choose the mode for computing CER and WER. "gt_audio" means selecting the recognition content of the reference audio as the target, "gt_content" means using transcription as the target. Default to "gt_audio"] \
+	--ltr_path [Optional. Path to the transcription file] \
+	--language [Optional. Language for computing CER and WER. Default to "english"]
+```
+As for the metrics, an example is provided below:
+```bash
+--metrics "mcd pesq fad"
+```
+All currently available metrics keywords are listed below:
+| Keys                      | Description                                |
+| ------------------------- | ------------------------------------------ |
+| `fpc`                     | F0 Pearson Coefficients                    |
+| `f0_periodicity_rmse`     | F0 Periodicity Root Mean Square Error      |
+| `f0rmse`                  | F0 Root Mean Square Error                  |
+| `v_uv_f1`                 | Voiced/Unvoiced F1 Score                   |
+| `energy_rmse`             | Energy Root Mean Square Error              |
+| `energy_pc`               | Energy Pearson Coefficients                |
+| `cer`                     | Character Error Rate                       |
+| `wer`                     | Word Error Rate                            |
+| `similarity`      | Speaker Similarity
+| `fad`                     | Frechet Audio Distance                     |
+| `mcd`                     | Mel Cepstral Distortion                    |
+| `mstft`                   | Multi-Resolution STFT Distance             |
+| `pesq`                    | Perceptual Evaluation of Speech Quality    |
+| `si_sdr`                  | Scale Invariant Signal to Distortion Ratio |
+| `si_snr`                  | Scale Invariant Signal to Noise Ratio      |
+| `stoi`                    | Short Time Objective Intelligibility       |
+For example, if want to calculate the speaker similarity between the synthesized audio and the reference audio with the same content, run:
+```bash
+sh egs/metrics/run.sh \
+	--reference_folder [Your path to the reference audios] \
+	--generated_folder [Your path to the generated audios] \
+	--dump_folder [Your path to dump the objective results] \
+	--metrics "similarity" \
+	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
+	--similarity_mode "pairwith" \
+```
+If you don't have the reference audio with the same content, run the following to get the conteng-free similarity score:
+```bash
+sh egs/metrics/run.sh \
+	--reference_folder [Your path to the reference audios] \
+	--generated_folder [Your path to the generated audios] \
+	--dump_folder [Your path to dump the objective results] \
+	--metrics "similarity" \
+	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
+	--similarity_mode "overall" \
+```
+## Troubleshooting
+### FAD (Using Offline Models)
+If your system is unable to access huggingface.co from the terminal, you might run into an error like "OSError: Can't load tokenizer for ...". To work around this, follow these steps to use local models:
+1. Download the [bert-base-uncased](https://huggingface.co/bert-base-uncased), [roberta-base](https://huggingface.co/roberta-base), and [facebook/bart-base](https://huggingface.co/facebook/bart-base) models from `huggingface.co`. Ensure that the models are complete and uncorrupted. Place these directories within `Amphion/pretrained`. For a detailed file structure reference, see [This README](../../pretrained/README.md#optional-model-dependencies-for-evaluation) under `Amphion/pretrained`.
+2. Inside the `Amphion/pretrained` directory, create a bash script with the content outlined below. This script will automatically update the tokenizer paths used by your system:
+  ```bash
+  #!/bin/bash
+  BERT_DIR="bert-base-uncased"
+  ROBERTA_DIR="roberta-base"
+  BART_DIR="facebook/bart-base"
+  PYTHON_SCRIPT="[YOUR ENV PATH]/lib/python3.9/site-packages/laion_clap/training/data.py"
+  update_tokenizer_path() {
+      local dir_name=$1
+      local tokenizer_variable=$2
+      local full_path
+      if [ -d "$dir_name" ]; then
+          full_path=$(realpath "$dir_name")
+          if [ -f "$PYTHON_SCRIPT" ]; then
+              sed -i "s|${tokenizer_variable}.from_pretrained(\".*\")|${tokenizer_variable}.from_pretrained(\"$full_path\")|" "$PYTHON_SCRIPT"
+              echo "Updated ${tokenizer_variable} path to $full_path."
+          else
+              echo "Error: The specified Python script does not exist."
+              exit 1
+          fi
+      else
+          echo "Error: The directory $dir_name does not exist in the current directory."
+          exit 1
+      fi
+  }
+  update_tokenizer_path "$BERT_DIR" "BertTokenizer"
+  update_tokenizer_path "$ROBERTA_DIR" "RobertaTokenizer"
+  update_tokenizer_path "$BART_DIR" "BartTokenizer"
+  echo "BERT, BART and RoBERTa Python script paths have been updated."
+  ```
+3. The script provided is intended to adjust the tokenizer paths in the `data.py` file, found under `/lib/python3.9/site-packages/laion_clap/training/`, within your specific environment. For those utilizing conda, you can determine your environment path by running `conda info --envs`. Then, substitute `[YOUR ENV PATH]` in the script with this path. If your environment is configured differently, you'll need to update the `PYTHON_SCRIPT` variable to correctly point to the `data.py` file.
+4. Run the script. If it executes successfully, the tokenizer paths will be updated, allowing them to be loaded locally.
+### WavLM-based Speaker Similarity (Using Offline Models)
+If your system is unable to access huggingface.co from the terminal and you want to calculate `WavLM` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).

Amphion/egs/metrics/run.sh ADDED Viewed

	@@ -0,0 +1,132 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $exp_dir))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,reference_folder:,generated_folder:,dump_folder:,metrics:,fs:,align_method:,energy_db_scale:,f0_subtract_mean:,similarity_model:,similarity_mode:,ltr_path:,intelligibility_mode:,language: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # Reference Audio Folder
+    --reference_folder) shift; ref_dir=$1 ; shift ;;
+    # Generated Audio Folder
+    --generated_folder) shift; deg_dir=$1 ; shift ;;
+    # Result Dumping Folder
+    --dump_folder) shift; dump_dir=$1 ; shift ;;
+    # Metrics to Compute
+    --metrics) shift; metrics=$1 ; shift ;;
+    # Sampling Rate
+    --fs) shift; fs=$1 ; shift ;;
+    # Method for aligning F0. The default value is "cut"
+    --align_method) shift; align_method=$1 ; shift ;;
+    # Method for normalizing F0. The default value is "True"
+    --f0_subtract_mean) shift; f0_subtract_mean=$1 ; shift ;;
+    # Method for normalizing Energy. The default value is "True"
+    --energy_db_scale) shift; energy_db_scale=$1 ; shift ;;
+    # Model for computing speaker similarity. The default value is "wavlm"
+    --similarity_model) shift; similarity_model=$1 ; shift ;;
+    # Mode for computing speaker similarity. The default value is "pairwith"
+    --similarity_mode) shift; similarity_mode=$1 ; shift ;;
+    # Path for the transcript.
+    --ltr_path) shift; ltr_path=$1 ; shift ;;
+    # Mode for computing CER and WER. The default value is "gt_audio"
+    --intelligibility_mode) shift; intelligibility_mode=$1 ; shift ;;
+    # Language for computing CER and WER. The default value is "english"
+    --language) shift; language=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$ref_dir" ]; then
+    echo "[Error] Please specify the reference_folder"
+    exit 1
+fi
+if [ -z "$deg_dir" ]; then
+    echo "[Error] Please specify the generated_folder"
+    exit 1
+fi
+if [ -z "$dump_dir" ]; then
+    echo "[Error] Please specify the dump_folder"
+    exit 1
+fi
+if [ -z "$metrics" ]; then
+    echo "[Error] Please specify the metrics"
+    exit 1
+fi
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+if [ -z "$fs" ]; then
+    fs="None"
+fi
+if [ -z "$align_method" ]; then
+    align_method="dtw"
+fi
+if [ -z "$energy_db_scale" ]; then
+    energy_db_scale="True"
+fi
+if [ -z "$f0_subtract_mean" ]; then
+    f0_subtract_mean="True"
+fi
+if [ -z "$similarity_model" ]; then
+    similarity_model="wavlm"
+fi
+if [ -z "$similarity_mode" ]; then
+    similarity_mode="pairwith"
+fi
+if [ -z "$ltr_path" ]; then
+    ltr_path="None"
+fi
+if [ -z "$intelligibility_mode" ]; then
+    intelligibility_mode="gt_audio"
+fi
+if [ -z "$language" ]; then
+    language="english"
+fi
+######## Calculate Objective Metrics ###########
+CUDA_VISIBLE_DEVICES=$gpu python "$work_dir"/bins/calc_metrics.py \
+    --ref_dir $ref_dir \
+    --deg_dir $deg_dir \
+    --dump_dir $dump_dir \
+    --metrics $metrics \
+    --fs $fs \
+    --align_method $align_method \
+    --db_scale $energy_db_scale \
+    --f0_subtract_mean $f0_subtract_mean \
+    --similarity_model $similarity_model \
+    --similarity_mode $similarity_mode \
+    --ltr_path $ltr_path \
+    --intelligibility_mode $intelligibility_mode \
+    --language $language

Amphion/egs/svc/TransformerSVC/exp_config.json ADDED Viewed

	@@ -0,0 +1,108 @@

+{
+    "base_config": "config/transformer.json",
+    "model_type": "TransformerSVC",
+    "dataset": [
+        "m4singer",
+        "opencpop",
+        "opensinger",
+        "svcc",
+        "vctk"
+    ],
+    "dataset_path": {
+        // TODO: Fill in your dataset path
+        "m4singer": "[M4Singer dataset path]",
+        "opencpop": "[Opencpop dataset path]",
+        "opensinger": "[OpenSinger dataset path]",
+        "svcc": "[SVCC dataset path]",
+        "vctk": "[VCTK dataset path]"
+    },
+    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/svc"
+    "log_dir": "ckpts/svc",
+    "preprocess": {
+        // TODO: Fill in the output data path. The default value is "Amphion/data"
+        "processed_dir": "data",
+        // Config for features extraction
+        "extract_mel": true,
+        "extract_pitch": true,
+        "extract_energy": true,
+        "extract_whisper_feature": true,
+        "extract_contentvec_feature": true,
+        "extract_wenet_feature": false,
+        "whisper_batch_size": 30, // decrease it if your GPU is out of memory
+        "contentvec_batch_size": 1,
+        // Fill in the content-based pretrained model's path
+        "contentvec_file": "pretrained/contentvec/checkpoint_best_legacy_500.pt",
+        "wenet_model_path": "pretrained/wenet/20220506_u2pp_conformer_exp/final.pt",
+        "wenet_config": "pretrained/wenet/20220506_u2pp_conformer_exp/train.yaml",
+        "whisper_model": "medium",
+        "whisper_model_path": "pretrained/whisper/medium.pt",
+        // Config for features usage
+        "use_mel": true,
+        "use_min_max_norm_mel": true,
+        "use_frame_pitch": true,
+        "use_frame_energy": true,
+        "use_spkid": true,
+        "use_whisper": true,
+        "use_contentvec": true,
+        "use_wenet": false,
+        "n_mel": 100,
+        "sample_rate": 24000
+    },
+    "model": {
+        "condition_encoder": {
+            // Config for features usage
+            "use_whisper": true,
+            "use_contentvec": true,
+            "use_wenet": false,
+            "whisper_dim": 1024,
+            "contentvec_dim": 256,
+            "wenet_dim": 512,
+            "use_singer_encoder": false,
+            "pitch_min": 50,
+            "pitch_max": 1100
+        },
+        "transformer": {
+            // 'conformer' or 'transformer'
+            "type": "conformer",
+            "input_dim": 384,
+            "output_dim": 100,
+            "n_heads": 2,
+            "n_layers": 6,
+            "filter_channels": 512,
+            "dropout": 0.1,
+        }
+    },
+    "train": {
+        "batch_size": 64,
+        "gradient_accumulation_step": 1,
+        "max_epoch": -1, // -1 means no limit
+        "save_checkpoint_stride": [
+            50,
+            50
+        ],
+        "keep_last": [
+            5,
+            -1
+        ],
+        "run_eval": [
+            false,
+            true
+        ],
+        "adamw": {
+            "lr": 4.0e-4
+        },
+        "reducelronplateau": {
+            "factor": 0.8,
+            "patience": 10,
+            "min_lr": 1.0e-4
+        },
+        "dataloader": {
+            "num_worker": 8,
+            "pin_memory": true
+        },
+        "sampler": {
+            "holistic_shuffle": false,
+            "drop_last": true
+        }
+    }
+}

Amphion/egs/svc/VitsSVC/README.md ADDED Viewed

	@@ -0,0 +1,125 @@

+# VITS for Singing Voice Conversion
+This is an implementation of VITS as acoustic model for end-to-end singing voice conversion. Adapted from [so-vits-svc](https://github.com/svc-develop-team/so-vits-svc), SoftVC content encoder is used to extract content features from the source audio. These feature vectors are directly fed into VITS without the need for conversion to a text-based intermediate representation.
+There are four stages in total:
+1. Data preparation
+2. Features extraction
+3. Training
+4. Inference/conversion
+> **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
+> ```bash
+> cd Amphion
+> ```
+## 1. Data Preparation
+### Dataset Download
+By default, we utilize the five datasets for training: M4Singer, Opencpop, OpenSinger, SVCC, and VCTK. How to download them is detailed [here](../../datasets/README.md).
+### Configuration
+Specify the dataset paths in  `exp_config.json`. Note that you can change the `dataset` list to use your preferred datasets.
+```json
+    "dataset": [
+        "m4singer",
+        "opencpop",
+        "opensinger",
+        "svcc",
+        "vctk"
+    ],
+    "dataset_path": {
+        // TODO: Fill in your dataset path
+        "m4singer": "[M4Singer dataset path]",
+        "opencpop": "[Opencpop dataset path]",
+        "opensinger": "[OpenSinger dataset path]",
+        "svcc": "[SVCC dataset path]",
+        "vctk": "[VCTK dataset path]"
+    },
+```
+## 2. Features Extraction
+### Content-based Pretrained Models Download
+By default, we utilize ContentVec and Whisper to extract content features. How to download them is detailed [here](../../../pretrained/README.md).
+### Configuration
+Specify the dataset path and the output path for saving the processed data and the training model in `exp_config.json`:
+```json
+    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/svc"
+    "log_dir": "ckpts/svc",
+    "preprocess": {
+        // TODO: Fill in the output data path. The default value is "Amphion/data"
+        "processed_dir": "data",
+        ...
+    },
+```
+### Run
+Run the `run.sh` as the preproces stage (set  `--stage 1`).
+```bash
+sh egs/svc/VitsSVC/run.sh --stage 1
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "1"`.
+## 3. Training
+### Configuration
+We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on you GPU machines.
+```json
+"train": {
+        "batch_size": 32,
+        ...
+        "adamw": {
+            "lr": 2.0e-4
+        },
+        ...
+    }
+```
+### Run
+Run the `run.sh` as the training stage (set  `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/svc/[YourExptName]`.
+```bash
+sh egs/svc/VitsSVC/run.sh --stage 2 --name [YourExptName]
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
+## 4. Inference/Conversion
+### Run
+For inference/conversion, you need to specify the following configurations when running `run.sh`:
+| Parameters                                          | Description                                                                                                                                                       | Example                                                                                                                                                                                                  |
+| --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--infer_expt_dir`                                  | The experimental directory which contains `checkpoint`                                                                                                            | `[Your path to save logs and checkpoints]/[YourExptName]`                                                                                                                                                |
+| `--infer_output_dir`                                | The output directory to save inferred audios.                                                                                                                     | `[Your path to save logs and checkpoints]/[YourExptName]/result`                                                                                                                                         |
+| `--infer_source_file` or `--infer_source_audio_dir` | The inference source (can be a json file or a dir).                                                                                                               | The `infer_source_file` could be `[Your path to save processed data]/[YourDataset]/test.json`, and the `infer_source_audio_dir` is a folder which includes several audio files (*.wav, *.mp3 or *.flac). |
+| `--infer_target_speaker`                            | The target speaker you want to convert into. You can refer to `[Your path to save logs and checkpoints]/[YourExptName]/singers.json` to choose a trained speaker. | For opencpop dataset, the speaker name would be `opencpop_female1`.                                                                                                                                      |
+| `--infer_key_shift`                                 | How many semitones you want to transpose.                                                                                                                         | `"autoshfit"` (by default), `3`, `-3`, etc.                                                                                                                                                              |
+For example, if you want to make `opencpop_female1` sing the songs in the `[Your Audios Folder]`, just run:
+```bash
+sh egs/svc/VitsSVC/run.sh --stage 3 --gpu "0" \
+	--infer_expt_dir Amphion/ckpts/svc/[YourExptName] \
+	--infer_output_dir Amphion/ckpts/svc/[YourExptName]/result \
+	--infer_source_audio_dir [Your Audios Folder] \
+	--infer_target_speaker "opencpop_female1" \
+	--infer_key_shift "autoshift"
+```

Amphion/egs/tta/README.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# Amphion Text-to-Audio (TTA) Recipe
+## Quick Start
+We provide a **[beginner recipe](RECIPE.md)** to demonstrate how to train a cutting edge TTA model. Specifically, it is designed as a latent diffusion model like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830).
+## Supported Model Architectures
+Until now, Amphion has supported a latent diffusion based text-to-audio model:
+<br>
+<div align="center">
+<img src="../../imgs/tta/DiffusionTTA.png" width="65%">
+</div>
+<br>
+Similar to [AUDIT](https://arxiv.org/abs/2304.00830), we implement it in two-stage training:
+1. Training the VAE which is called `AutoencoderKL` in Amphion.
+2. Training the conditional latent diffusion model which is called `AudioLDM` in Amphion.

Amphion/egs/tta/audioldm/exp_config.json ADDED Viewed

	@@ -0,0 +1,90 @@

+{
+    "base_config": "egs/tta/audioldm/exp_config_base.json",
+    "dataset": [
+      "AudioCaps"
+    ],
+    "preprocess": {
+      // Specify the output root path to save the processed data
+      "processed_dir": "data",
+      // For example: "/home/TTADataset/processed_data"
+      // feature
+      "use_spkid": false,
+      "use_uv": false,
+      "use_frame_pitch": false,
+      "use_phone_pitch": false,
+      "use_frame_energy": false,
+      "use_phone_energy": false,
+      "use_mel": false,
+      "use_audio": false,
+      "use_label": false,
+      "use_one_hot": false,
+      // feature for text to audio
+      "use_caption": true,
+      "use_melspec": true,
+      "use_wav": false,
+      // feature dir
+      "melspec_dir": "mel",
+      "wav_dir": "wav"
+    },
+    // Specify the output root path to save model ckpts and logs
+    "log_dir": "ckpts/tta",
+    // For example: "/home/TTADataset/processed_data/logs"
+    // model
+    "model": {
+      "audioldm": {
+        "image_size": 32,
+        "in_channels": 4,
+        "out_channels": 4,
+        "model_channels": 256,
+        "attention_resolutions": [4, 2, 1],
+        "num_res_blocks": 2,
+        "channel_mult": [1, 2, 4],
+        "num_heads": 8,
+        "use_spatial_transformer": true,
+        "transformer_depth": 1,
+        "context_dim": 768,
+        "use_checkpoint": true,
+        "legacy": false
+      },
+      "autoencoderkl": {
+        "ch": 128,
+        "ch_mult": [1,1,2,2,4],
+        "num_res_blocks": 2,
+        "in_channels": 1,
+        "z_channels": 4,
+        "out_ch": 1,
+        "double_z": true
+      },
+      "noise_scheduler": {
+        "num_train_timesteps": 1000,
+        "beta_start": 0.00085,
+        "beta_end": 0.012,
+        "beta_schedule": "scaled_linear",
+        "clip_sample": false,
+        "steps_offset": 1,
+        "set_alpha_to_one": false,
+        "skip_prk_steps": true,
+        "prediction_type": "epsilon"
+      },
+      "autoencoder_path": "ckpts/tta/autoencoder_kl_debug/checkpoints/step-0445000_loss-0.3306.pt"
+    },
+    // train
+    "train": {
+      "adam": {
+        "lr": 5.0e-5
+      },
+      "ddp": false,
+      "random_seed": 12345,
+      "batch_size": 12,
+      "epochs": 50000,
+      "max_steps": 1000000,
+      "total_training_steps": 800000,
+      "save_summary_steps": 1000,
+      "save_checkpoints_steps": 5000,
+      "valid_interval": 5000,
+      "keep_checkpoint_max": 100
+    }
+  }

Amphion/egs/tta/audioldm/run_train.sh ADDED Viewed

	@@ -0,0 +1,26 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $exp_dir)))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Set Experiment Configuration ###########
+exp_config="$exp_dir/exp_config.json"
+exp_name="audioldm_debug_latent_size_4_5_39"
+num_workers=8
+export CUDA_VISIBLE_DEVICES="0"
+######## Train Model ###########
+python "${work_dir}"/bins/tta/train_tta.py \
+    --config=$exp_config \
+    --num_workers=$num_workers \
+    --exp_name=$exp_name \
+    --stdout_interval=25 \

Amphion/egs/tta/audioldm/run_train_latent_4_10_78.sh ADDED Viewed

	@@ -0,0 +1,26 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $exp_dir)))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Set Experiment Configuration ###########
+exp_config="$exp_dir/exp_config_latent_4_10_78.json"
+exp_name="audioldm_debug_latent_size_4_10_78"
+num_workers=8
+export CUDA_VISIBLE_DEVICES="0"
+######## Train Model ###########
+python "${work_dir}"/bins/tta/train_tta.py \
+    --config=$exp_config \
+    --num_workers=$num_workers \
+    --exp_name=$exp_name \
+    --stdout_interval=25 \

Amphion/egs/tta/autoencoderkl/run_train_latent_4_10_78.sh ADDED Viewed

	@@ -0,0 +1,26 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $exp_dir)))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Set Experiment Configuration ###########
+exp_config="$exp_dir/exp_config_latent_4_10_78.json"
+exp_name="autoencoder_kl_debug_latent_size_4_10_78"
+num_workers=8
+export CUDA_VISIBLE_DEVICES="0"
+######## Train Model ###########
+python "${work_dir}"/bins/tta/train_tta.py \
+    --config=$exp_config \
+    --num_workers=$num_workers \
+    --exp_name=$exp_name \
+    --stdout_interval=25 \

Amphion/egs/tts/FastSpeech2/prepare_mfa.sh ADDED Viewed

	@@ -0,0 +1,29 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+#!/bin/bash
+# Navigate to the 'pretrained' directory
+cd pretrained || { echo "Failed to change directory to 'pretrained'"; exit 1; }
+# Create and navigate to the 'mfa' directory
+mkdir -p mfa && cd mfa || { echo "Failed to create or change directory to 'mfa'"; exit 1; }
+# Define the MFA file URL and the file name
+mfa_url="https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz"
+mfa_file="montreal-forced-aligner_linux.tar.gz"
+# Download MFA if it doesn't exist
+if [ ! -f "$mfa_file" ]; then
+    wget "$mfa_url" || { echo "Failed to download MFA"; exit 1; }
+fi
+# Extract MFA
+tar -zxvf "$mfa_file" || { echo "Failed to extract MFA"; exit 1; }
+# Optionally, remove the tar.gz file after extraction
+rm "$mfa_file"
+echo "MFA setup completed successfully."

Amphion/egs/tts/FastSpeech2/run.sh ADDED Viewed

	@@ -0,0 +1,155 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $exp_dir)))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+cd $work_dir/modules/monotonic_align
+mkdir -p monotonic_align
+python setup.py build_ext --inplace
+cd $work_dir
+mfa_dir=$work_dir/pretrained/mfa
+echo $mfa_dir
+######## Parse the Given Parameters from the Commond ###########
+# options=$(getopt -o c:n:s --long gpu:,config:,infer_expt_dir:,infer_output_dir:,infer_source_file:,infer_source_audio_dir:,infer_target_speaker:,infer_key_shift:,infer_vocoder_dir:,name:,stage: -- "$@")
+options=$(getopt -o c:n:s --long gpu:,config:,infer_expt_dir:,infer_output_dir:,infer_mode:,infer_dataset:,infer_testing_set:,infer_text:,name:,stage:,vocoder_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    # [Only for Inference] The inference mode. It can be "batch" to generate speech by batch, or "single" to generage a single clip of speech.
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inference dataset. It is only used when the inference model is "batch".
+    --infer_dataset) shift; infer_dataset=$1 ; shift ;;
+    # [Only for Inference] The inference testing set. It is only used when the inference model is "batch". It can be "test" set split from the dataset, or "golden_test" carefully selected from the testing set.
+    --infer_testing_set) shift; infer_testing_set=$1 ; shift ;;
+    # [Only for Inference] The text to be synthesized from. It is only used when the inference model is "single".
+    --infer_text) shift; infer_text=$1 ; shift ;;
+    # [Only for Inference] The output dir to the vocoder.
+    --vocoder_dir) shift; vocoder_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    if [ ! -d "$mfa_dir/montreal-forced-aligner" ]; then
+        bash ${exp_dir}/prepare_mfa.sh
+    fi
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/tts/preprocess.py \
+        --config=$exp_config \
+        --num_workers=4 \
+        --prepare_alignment=true
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    CUDA_VISIBLE_DEVICES=$gpu accelerate launch "${work_dir}"/bins/tts/train.py \
+        --config $exp_config \
+        --exp_name $exp_name \
+        --log_level debug
+fi
+######## Inference ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$expt_dir/result"
+    fi
+    if [ -z "$vocoder_dir" ]; then
+        echo "[Error] Please specify the vocoder directory to reconstruct waveform from mel spectrogram."
+        exit 1
+    fi
+    if [ -z "$infer_mode" ]; then
+        echo "[Error] Please specify the inference mode, e.g., "batch", "single""
+        exit 1
+    fi
+    if [ "$infer_mode" = "batch" ] && [ -z "$infer_dataset" ]; then
+        echo "[Error] Please specify the dataset used in inference when the inference mode is batch"
+        exit 1
+    fi
+    if [ "$infer_mode" = "batch" ] && [ -z "$infer_testing_set" ]; then
+        echo "[Error] Please specify the testing set used in inference when the inference mode is batch"
+        exit 1
+    fi
+    if [ "$infer_mode" = "single" ] && [ -z "$infer_text" ]; then
+        echo "[Error] Please specify the text to be synthesized when the inference mode is single"
+        exit 1
+    fi
+    if [ "$infer_mode" = "single" ]; then
+        echo 'Text: ' ${infer_text}
+        infer_dataset=None
+        infer_testing_set=None
+    elif [ "$infer_mode" = "batch" ]; then
+        infer_text=''
+    fi
+    CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/tts/inference.py \
+        --config $exp_config \
+        --acoustics_dir $infer_expt_dir \
+        --output_dir $infer_output_dir  \
+        --mode $infer_mode \
+        --dataset $infer_dataset \
+        --testing_set $infer_testing_set \
+        --text "$infer_text" \
+        --log_level debug \
+        --vocoder_dir $vocoder_dir
+fi

Amphion/egs/tts/NaturalSpeech2/run_inference.sh ADDED Viewed

	@@ -0,0 +1,49 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $exp_dir)))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Set Experiment Configuration ###########
+exp_config="$exp_dir/exp_config.json"
+exp_name="ns2_libritts"
+ref_audio="$work_dir/egs/tts/NaturalSpeech2/prompt_example/ref_audio.wav"
+checkpoint_path="$work_dir/ckpts/tts/naturalspeech2_libritts/checkpoint/epoch-0089_step-0512912_loss-6.367693"
+output_dir="$work_dir/output"
+mode="single"
+export CUDA_VISIBLE_DEVICES="0"
+######## Parse Command Line Arguments ###########
+while [[ $# -gt 0 ]]
+do
+key="$1"
+case $key in
+    --text)
+    text="$2"
+    shift # past argument
+    shift # past value
+    ;;
+    *)    # unknown option
+    shift # past argument
+    ;;
+esac
+done
+######## Train Model ###########
+python "${work_dir}"/bins/tts/inference.py \
+    --config=$exp_config \
+    --text="$text" \
+    --mode=$mode \
+    --checkpoint_path=$checkpoint_path \
+    --ref_audio=$ref_audio \
+    --output_dir=$output_dir \

Amphion/egs/tts/VALLE/README.md ADDED Viewed

	@@ -0,0 +1,207 @@

+# VALL-E Recipe
+In this recipe, we will show how to train [VALL-E](https://arxiv.org/abs/2301.02111) using Amphion's infrastructure. VALL-E is a zero-shot TTS architecture that uses a neural codec language model with discrete codes.
+There are four stages in total:
+1. Data preparation
+2. Features extraction
+3. Training
+4. Inference
+> **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
+> ```bash
+> cd Amphion
+> ```
+## 1. Data Preparation
+### Dataset Download
+You can use the commonly used TTS dataset to train the VALL-E model, e.g., LibriTTS, etc. We strongly recommend you use LibriTTS to train the VALL-E model for the first time. How to download the dataset is detailed [here](../../datasets/README.md).
+### Configuration
+After downloading the dataset, you can set the dataset paths in  `exp_config.json`. Note that you can change the `dataset` list to use your preferred datasets.
+```json
+    "dataset": [
+        "libritts",
+    ],
+    "dataset_path": {
+        // TODO: Fill in your dataset path
+        "libritts": "[LibriTTS dataset path]",
+    },
+```
+## 2. Features Extraction
+### Configuration
+Specify the `processed_dir` and the `log_dir` and for saving the processed data and the checkpoints in `exp_config.json`:
+```json
+    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
+    "log_dir": "ckpts/tts",
+    "preprocess": {
+        // TODO: Fill in the output data path. The default value is "Amphion/data"
+        "processed_dir": "data",
+        ...
+    },
+```
+### Run
+Run the `run.sh` as the preprocess stage (set  `--stage 1`):
+```bash
+sh egs/tts/VALLE/run.sh --stage 1
+```
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "1"`.
+## 3. Training
+### Configuration
+We provide the default hyperparameters in the `exp_config.json`. They can work on a single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
+```json
+"train": {
+        "batch_size": 4,
+    }
+```
+### Train From Scratch
+Run the `run.sh` as the training stage (set  `--stage 2`). Specify an experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
+Specifically, VALL-E needs to train an autoregressive (AR) model and then a non-autoregressive (NAR) model. So, you can set `--model_train_stage 1` to train AR model, and set `--model_train_stage 2` to train NAR model, where `--ar_model_ckpt_dir` should be set as the checkpoint path to the trained AR model.
+Train an AR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName]
+```
+Train a NAR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName]
+```
+<!-- > **NOTE:** To train a NAR model, `--checkpoint_path` should be set as the checkpoint path to the trained AR model. -->
+### Train From Existing Source
+We support training from existing sources for various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
+By setting `--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in `Amphion/ckpts/tts/[YourExptName]/checkpoint`,
+Train an AR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
+    --resume true
+```
+Train a NAR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName] \
+    --resume true
+```
+You can also choose a **specific checkpoint** for retraining by `--resume_from_ckpt_path` argument. For example, if you want to resume training from the checkpoint `Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]`,
+Train an AR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificARCheckpoint]"
+```
+Train a NAR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificNARCheckpoint]"
+```
+If you want to **fine-tune from another checkpoint**, just use `--resume_type` and set it to `"finetune"`. For example, If you want to fine-tune the model from the checkpoint `Amphion/ckpts/tts/[AnotherExperiment]/checkpoint/[SpecificCheckpoint]`,
+Train an AR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 1 --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificARCheckpoint]" \
+    --resume_type "finetune"
+```
+Train a NAR model, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName] \
+    --resume true \
+    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificNARCheckpoint]" \
+    --resume_type "finetune"
+```
+> **NOTE:** The `--resume_type` is set as `"resume"` in default. It's not necessary to specify it when resuming training.
+>
+> The difference between `"resume"` and `"finetune"` is that the `"finetune"` will **only** load the pretrained model weights from the checkpoint, while the `"resume"` will load all the training states (including optimizer, scheduler, etc.) from the checkpoint.
+> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
+## 4. Inference
+### Configuration
+For inference, you need to specify the following configurations when running `run.sh`:
+| Parameters            | Description                                                                            | Example                                                                                                                                                                         |
+| --------------------- | -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--infer_expt_dir`    | The experimental directory of NAR model which contains `checkpoint`                                 | `Amphion/ckpts/tts/[YourExptName]`                                                                                                                                              |
+| `--infer_output_dir`  | The output directory to save inferred audios.                                          | `Amphion/ckpts/tts/[YourExptName]/result`                                                                                                                                       |
+| `--infer_mode`        | The inference mode, e.g., "`single`", "`batch`".                                       | "`single`" to generate a clip of speech, "`batch`" to generate a batch of speech at a time.                                                                                     |
+| `--infer_text`        | The text to be synthesized.                                                            | "`This is a clip of generated speech with the given text from a TTS model.`"                                                                                                    |
+| `--infer_text_prompt`     | The text prompt for inference.                                                        | The text prompt should be aligned with the audio prompt.                                                                                                                |
+| `--infer_audio_prompt` | The audio prompt for inference. | The audio prompt should be aligned with text prompt.|
+| `--test_list_file` | The test list file used for batch inference. | The format of test list file is `text\|text_prompt\|audio_prompt`.|
+### Run
+For example, if you want to generate a single clip of speech, just run:
+```bash
+sh egs/tts/VALLE/run.sh --stage 3 --gpu "0" \
+    --infer_expt_dir Amphion/ckpts/tts/[YourExptName] \
+    --infer_output_dir Amphion/ckpts/tts/[YourExptName]/result \
+    --infer_mode "single" \
+    --infer_text "This is a clip of generated speech with the given text from a TTS model." \
+    --infer_text_prompt "But even the unsuccessful dramatist has his moments." \
+    --infer_audio_prompt egs/tts/VALLE/prompt_examples/7176_92135_000004_000000.wav
+```
+We have released pre-trained VALL-E models, so you can download the pre-trained model and then generate speech following the above inference instruction. Specifically,
+1. The pre-trained VALL-E trained on [LibriTTS](https://github.com/open-mmlab/Amphion/tree/main/egs/datasets#libritts) can be downloaded [here](https://huggingface.co/amphion/valle-libritts).
+2. The pre-trained VALL-E trained on the part of [Libri-light](https://ai.meta.com/tools/libri-light/) (about 6k hours) can be downloaded [here](https://huggingface.co/amphion/valle_librilight_6k).
+```bibtex
+@article{wang2023neural,
+  title={Neural codec language models are zero-shot text to speech synthesizers},
+  author={Wang, Chengyi and Chen, Sanyuan and Wu, Yu and Zhang, Ziqiang and Zhou, Long and Liu, Shujie and Chen, Zhuo and Liu, Yanqing and Wang, Huaming and Li, Jinyu and others},
+  journal={arXiv preprint arXiv:2301.02111},
+  year={2023}
+}
+```

Amphion/egs/tts/VALLE/prompt_examples/5142_33396_000002_000004.wav ADDED Viewed

Binary file (144 kB). View file

Amphion/egs/tts/VALLE/prompt_examples/7176_92135_000004_000000.normalized.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ But even the unsuccessful dramatist has his moments.

Amphion/egs/tts/VITS/exp_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "base_config": "config/vits.json",
+  "model_type": "VITS",
+  "dataset": [
+    "LJSpeech",
+    //"hifitts"
+  ],
+  "dataset_path": {
+    // TODO: Fill in your dataset path
+    "LJSpeech": "[LJSpeech dataset path]",
+    //"hifitts": "[Hi-Fi TTS dataset path]
+  },
+  // TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
+  "log_dir": "ckpts/tts",
+  "preprocess": {
+    //"extract_audio":true,
+    "use_phone": true,
+    // linguistic features
+    "extract_phone": true,
+    "phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
+    // TODO: Fill in the output data path. The default value is "Amphion/data"
+    "processed_dir": "data",
+    "sample_rate": 22050, // target sampling rate
+    "valid_file": "valid.json", // validation set
+    //"use_spkid": true // use speaker ID to train multi-speaker TTS model
+  },
+  "model":{
+    //"n_speakers": 10 // number of speakers, greater than or equal to the number of speakers in the dataset(s) used. The default value is 0 if not specified.
+  },
+  "train": {
+    "batch_size": 16,
+    //"multi_speaker_training": true
+  }
+}

Amphion/egs/vocoder/gan/bigvgan_large/exp_config.json ADDED Viewed

	@@ -0,0 +1,70 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true
+  },
+  "model": {
+    "generator": "bigvgan",
+    "bigvgan": {
+      "resblock": "1",
+      "activation": "snakebeta",
+      "snake_logscale": true,
+      "upsample_rates": [
+        4,
+        4,
+        2,
+        2,
+        2,
+        2
+      ],
+      "upsample_kernel_sizes": [
+        8,
+        8,
+        4,
+        4,
+        4,
+        4
+      ],
+      "upsample_initial_channel": 1536,
+      "resblock_kernel_sizes": [
+        3,
+        7,
+        11
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ],
+        [
+          1,
+          3,
+          5
+        ]
+      ]
+    },
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "mel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

Amphion/egs/vocoder/gan/hifigan/exp_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "base_config": "egs/vocoder/gan/exp_config_base.json",
+  "preprocess": {
+    // acoustic features
+    "extract_mel": true,
+    "extract_audio": true,
+    // Features used for model training
+    "use_mel": true,
+    "use_audio": true
+  },
+  "model": {
+    "generator": "hifigan",
+    "hifigan": {
+      "resblock": "2",
+      "upsample_rates": [
+        8,
+        8,
+        4
+      ],
+      "upsample_kernel_sizes": [
+        16,
+        16,
+        8
+      ],
+      "upsample_initial_channel": 256,
+      "resblock_kernel_sizes": [
+        3,
+        5,
+        7
+      ],
+      "resblock_dilation_sizes": [
+        [
+          1,
+          2
+        ],
+        [
+          2,
+          6
+        ],
+        [
+          3,
+          12
+        ]
+      ]
+    }
+  },
+  "train": {
+    "criterions": [
+        "feature",
+        "discriminator",
+        "generator",
+        "mel",
+    ]
+  },
+  "inference": {
+    "batch_size": 1,
+  }
+}

Amphion/egs/vocoder/gan/hifigan/run.sh ADDED Viewed

	@@ -0,0 +1,141 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,checkpoint:,resume_type:,main_process_port:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; checkpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Traiing] `main_process_port` for multi gpu training
+    --main_process_port) shift; main_process_port=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+if [ -z "$main_process_port" ]; then
+    main_process_port=29500
+fi
+echo "Main Process Port: $main_process_port"
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    CUDA_VISIBLE_DEVICES=$gpu accelerate launch \
+        --main_process_port "$main_process_port" \
+        "${work_dir}"/bins/vocoder/train.py \
+        --config "$exp_config" \
+        --exp_name "$exp_name" \
+        --log_level info \
+        --checkpoint "$checkpoint" \
+        --resume_type "$resume_type"
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

Amphion/egs/vocoder/gan/nsfhifigan/run.sh ADDED Viewed

	@@ -0,0 +1,141 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+######## Build Experiment Environment ###########
+exp_dir=$(cd `dirname $0`; pwd)
+work_dir=$(dirname $(dirname $(dirname $(dirname $exp_dir))))
+export WORK_DIR=$work_dir
+export PYTHONPATH=$work_dir
+export PYTHONIOENCODING=UTF-8
+######## Parse the Given Parameters from the Commond ###########
+options=$(getopt -o c:n:s --long gpu:,config:,name:,stage:,checkpoint:,resume_type:,main_process_port:,infer_mode:,infer_datasets:,infer_feature_dir:,infer_audio_dir:,infer_expt_dir:,infer_output_dir: -- "$@")
+eval set -- "$options"
+while true; do
+  case $1 in
+    # Experimental Configuration File
+    -c | --config) shift; exp_config=$1 ; shift ;;
+    # Experimental Name
+    -n | --name) shift; exp_name=$1 ; shift ;;
+    # Running Stage
+    -s | --stage) shift; running_stage=$1 ; shift ;;
+    # Visible GPU machines. The default value is "0".
+    --gpu) shift; gpu=$1 ; shift ;;
+    # [Only for Training] The specific checkpoint path that you want to resume from.
+    --checkpoint) shift; checkpoint=$1 ; shift ;;
+    # [Only for Training] `resume` for loading all the things (including model weights, optimizer, scheduler, and random states). `finetune` for loading only the model weights.
+    --resume_type) shift; resume_type=$1 ; shift ;;
+    # [Only for Traiing] `main_process_port` for multi gpu training
+    --main_process_port) shift; main_process_port=$1 ; shift ;;
+    # [Only for Inference] The inference mode
+    --infer_mode) shift; infer_mode=$1 ; shift ;;
+    # [Only for Inference] The inferenced datasets
+    --infer_datasets) shift; infer_datasets=$1 ; shift ;;
+    # [Only for Inference] The feature dir for inference
+    --infer_feature_dir) shift; infer_feature_dir=$1 ; shift ;;
+    # [Only for Inference] The audio dir for inference
+    --infer_audio_dir) shift; infer_audio_dir=$1 ; shift ;;
+    # [Only for Inference] The experiment dir. The value is like "[Your path to save logs and checkpoints]/[YourExptName]"
+    --infer_expt_dir) shift; infer_expt_dir=$1 ; shift ;;
+    # [Only for Inference] The output dir to save inferred audios. Its default value is "$expt_dir/result"
+    --infer_output_dir) shift; infer_output_dir=$1 ; shift ;;
+    --) shift ; break ;;
+    *) echo "Invalid option: $1" exit 1 ;;
+  esac
+done
+### Value check ###
+if [ -z "$running_stage" ]; then
+    echo "[Error] Please specify the running stage"
+    exit 1
+fi
+if [ -z "$exp_config" ]; then
+    exp_config="${exp_dir}"/exp_config.json
+fi
+echo "Exprimental Configuration File: $exp_config"
+if [ -z "$gpu" ]; then
+    gpu="0"
+fi
+if [ -z "$main_process_port" ]; then
+    main_process_port=29500
+fi
+echo "Main Process Port: $main_process_port"
+######## Features Extraction ###########
+if [ $running_stage -eq 1 ]; then
+    CUDA_VISIBLE_DEVICES=$gpu python "${work_dir}"/bins/vocoder/preprocess.py \
+        --config $exp_config \
+        --num_workers 8
+fi
+######## Training ###########
+if [ $running_stage -eq 2 ]; then
+    if [ -z "$exp_name" ]; then
+        echo "[Error] Please specify the experiments name"
+        exit 1
+    fi
+    echo "Exprimental Name: $exp_name"
+    CUDA_VISIBLE_DEVICES=$gpu accelerate launch \
+        --main_process_port "$main_process_port" \
+        "${work_dir}"/bins/vocoder/train.py \
+        --config "$exp_config" \
+        --exp_name "$exp_name" \
+        --log_level info \
+        --checkpoint "$checkpoint" \
+        --resume_type "$resume_type"
+fi
+######## Inference/Conversion ###########
+if [ $running_stage -eq 3 ]; then
+    if [ -z "$infer_expt_dir" ]; then
+        echo "[Error] Please specify the experimental directionary. The value is like [Your path to save logs and checkpoints]/[YourExptName]"
+        exit 1
+    fi
+    if [ -z "$infer_output_dir" ]; then
+        infer_output_dir="$infer_expt_dir/result"
+    fi
+    if [ $infer_mode = "infer_from_dataset" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --infer_datasets $infer_datasets \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_feature" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --feature_folder $infer_feature_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+    if [ $infer_mode = "infer_from_audio" ]; then
+        CUDA_VISIBLE_DEVICES=$gpu accelerate launch "$work_dir"/bins/vocoder/inference.py \
+            --config $exp_config \
+            --infer_mode $infer_mode \
+            --audio_folder $infer_audio_dir \
+            --vocoder_dir $infer_expt_dir \
+            --output_dir $infer_output_dir  \
+            --log_level debug
+    fi
+fi

Amphion/models/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (186 Bytes). View file

Amphion/models/base/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .new_trainer import BaseTrainer
+from .new_inference import BaseInference

Amphion/models/base/new_dataset.py ADDED Viewed

	@@ -0,0 +1,50 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import json
+import os
+from abc import abstractmethod
+from pathlib import Path
+import json5
+import torch
+import yaml
+# TODO: for training and validating
+class BaseDataset(torch.utils.data.Dataset):
+    r"""Base dataset for training and validating."""
+    def __init__(self, args, cfg, is_valid=False):
+        pass
+class BaseTestDataset(torch.utils.data.Dataset):
+    r"""Test dataset for inference."""
+    def __init__(self, args=None, cfg=None, infer_type="from_dataset"):
+        assert infer_type in ["from_dataset", "from_file"]
+        self.args = args
+        self.cfg = cfg
+        self.infer_type = infer_type
+    @abstractmethod
+    def __getitem__(self, index):
+        pass
+    def __len__(self):
+        return len(self.metadata)
+    def get_metadata(self):
+        path = Path(self.args.source)
+        if path.suffix == ".json" or path.suffix == ".jsonc":
+            metadata = json5.load(open(self.args.source, "r"))
+        elif path.suffix == ".yaml" or path.suffix == ".yml":
+            metadata = yaml.full_load(open(self.args.source, "r"))
+        else:
+            raise ValueError(f"Unsupported file type: {path.suffix}")
+        return metadata

Amphion/models/base/new_trainer.py ADDED Viewed

	@@ -0,0 +1,727 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import json
+import os
+import random
+import shutil
+import time
+from abc import abstractmethod
+from pathlib import Path
+import accelerate
+import json5
+import numpy as np
+import torch
+from accelerate.logging import get_logger
+from accelerate.utils import ProjectConfiguration
+from torch.utils.data import ConcatDataset, DataLoader
+from tqdm import tqdm
+from models.base.base_sampler import build_samplers
+from optimizer.optimizers import NoamLR
+class BaseTrainer(object):
+    r"""The base trainer for all tasks. Any trainer should inherit from this class."""
+    def __init__(self, args=None, cfg=None):
+        super().__init__()
+        self.args = args
+        self.cfg = cfg
+        cfg.exp_name = args.exp_name
+        # init with accelerate
+        self._init_accelerator()
+        self.accelerator.wait_for_everyone()
+        # Use accelerate logger for distributed training
+        with self.accelerator.main_process_first():
+            self.logger = get_logger(args.exp_name, log_level=args.log_level)
+        # Log some info
+        self.logger.info("=" * 56)
+        self.logger.info("||\t\t" + "New training process started." + "\t\t||")
+        self.logger.info("=" * 56)
+        self.logger.info("\n")
+        self.logger.debug(f"Using {args.log_level.upper()} logging level.")
+        self.logger.info(f"Experiment name: {args.exp_name}")
+        self.logger.info(f"Experiment directory: {self.exp_dir}")
+        self.checkpoint_dir = os.path.join(self.exp_dir, "checkpoint")
+        if self.accelerator.is_main_process:
+            os.makedirs(self.checkpoint_dir, exist_ok=True)
+        self.logger.debug(f"Checkpoint directory: {self.checkpoint_dir}")
+        # init counts
+        self.batch_count: int = 0
+        self.step: int = 0
+        self.epoch: int = 0
+        self.max_epoch = (
+            self.cfg.train.max_epoch if self.cfg.train.max_epoch > 0 else float("inf")
+        )
+        self.logger.info(
+            "Max epoch: {}".format(
+                self.max_epoch if self.max_epoch < float("inf") else "Unlimited"
+            )
+        )
+        # Check values
+        if self.accelerator.is_main_process:
+            self.__check_basic_configs()
+            # Set runtime configs
+            self.save_checkpoint_stride = self.cfg.train.save_checkpoint_stride
+            self.checkpoints_path = [
+                [] for _ in range(len(self.save_checkpoint_stride))
+            ]
+            self.keep_last = [
+                i if i > 0 else float("inf") for i in self.cfg.train.keep_last
+            ]
+            self.run_eval = self.cfg.train.run_eval
+        # set random seed
+        with self.accelerator.main_process_first():
+            start = time.monotonic_ns()
+            self._set_random_seed(self.cfg.train.random_seed)
+            end = time.monotonic_ns()
+            self.logger.debug(
+                f"Setting random seed done in {(end - start) / 1e6:.2f}ms"
+            )
+            self.logger.debug(f"Random seed: {self.cfg.train.random_seed}")
+        # setup data_loader
+        with self.accelerator.main_process_first():
+            self.logger.info("Building dataset...")
+            start = time.monotonic_ns()
+            self.train_dataloader, self.valid_dataloader = self._build_dataloader()
+            end = time.monotonic_ns()
+            self.logger.info(f"Building dataset done in {(end - start) / 1e6:.2f}ms")
+        # setup model
+        with self.accelerator.main_process_first():
+            self.logger.info("Building model...")
+            start = time.monotonic_ns()
+            self.model = self._build_model()
+            end = time.monotonic_ns()
+            self.logger.debug(self.model)
+            self.logger.info(f"Building model done in {(end - start) / 1e6:.2f}ms")
+            self.logger.info(
+                f"Model parameters: {self.__count_parameters(self.model)/1e6:.2f}M"
+            )
+        # optimizer & scheduler
+        with self.accelerator.main_process_first():
+            self.logger.info("Building optimizer and scheduler...")
+            start = time.monotonic_ns()
+            self.optimizer = self._build_optimizer()
+            self.scheduler = self._build_scheduler()
+            end = time.monotonic_ns()
+            self.logger.info(
+                f"Building optimizer and scheduler done in {(end - start) / 1e6:.2f}ms"
+            )
+        # accelerate prepare
+        self.logger.info("Initializing accelerate...")
+        start = time.monotonic_ns()
+        self._accelerator_prepare()
+        end = time.monotonic_ns()
+        self.logger.info(f"Initializing accelerate done in {(end - start) / 1e6:.2f}ms")
+        # create criterion
+        with self.accelerator.main_process_first():
+            self.logger.info("Building criterion...")
+            start = time.monotonic_ns()
+            self.criterion = self._build_criterion()
+            end = time.monotonic_ns()
+            self.logger.info(f"Building criterion done in {(end - start) / 1e6:.2f}ms")
+        # Resume or Finetune
+        with self.accelerator.main_process_first():
+            if args.resume:
+                if args.resume_from_ckpt_path == "":
+                    ## Automatically resume according to the current exprimental name
+                    self.logger.info(
+                        "Automatically resuming from latest checkpoint in {}...".format(
+                            self.checkpoint_dir
+                        )
+                    )
+                    start = time.monotonic_ns()
+                    ckpt_path = self._load_model(
+                        checkpoint_dir=self.checkpoint_dir, resume_type=args.resume_type
+                    )
+                    end = time.monotonic_ns()
+                    self.logger.info(
+                        f"Resuming from checkpoint done in {(end - start) / 1e6:.2f}ms"
+                    )
+                    self.checkpoints_path = json.load(
+                        open(os.path.join(ckpt_path, "ckpts.json"), "r")
+                    )
+                else:
+                    ## Resume from the given checkpoint path
+                    if not os.path.exists(args.resume_from_ckpt_path):
+                        raise ValueError(
+                            "[Error] The resumed checkpoint path {} don't exist.".format(
+                                args.resume_from_ckpt_path
+                            )
+                        )
+                    self.logger.info(
+                        "Resuming from {}...".format(args.resume_from_ckpt_path)
+                    )
+                    start = time.monotonic_ns()
+                    ckpt_path = self._load_model(
+                        checkpoint_path=args.resume_from_ckpt_path,
+                        resume_type=args.resume_type,
+                    )
+                    end = time.monotonic_ns()
+                    self.logger.info(
+                        f"Resuming from checkpoint done in {(end - start) / 1e6:.2f}ms"
+                    )
+        # save config file path
+        self.config_save_path = os.path.join(self.exp_dir, "args.json")
+    def _accelerator_prepare(self):
+        (
+            self.train_dataloader,
+            self.valid_dataloader,
+            self.model,
+            self.optimizer,
+            self.scheduler,
+        ) = self.accelerator.prepare(
+            self.train_dataloader,
+            self.valid_dataloader,
+            self.model,
+            self.optimizer,
+            self.scheduler,
+        )
+    ### Following are abstract methods that should be implemented in child classes ###
+    @abstractmethod
+    def _build_dataset(self):
+        r"""Build dataset for model training/validating/evaluating."""
+        pass
+    @staticmethod
+    @abstractmethod
+    def _build_criterion():
+        r"""Build criterion function for model loss calculation."""
+        pass
+    @abstractmethod
+    def _build_model(self):
+        r"""Build model for training/validating/evaluating."""
+        pass
+    @abstractmethod
+    def _forward_step(self, batch):
+        r"""One forward step of the neural network. This abstract method is trying to
+        unify ``_train_step`` and ``_valid_step`` and avoid redundant implementation.
+        However, for special case that using different forward step pattern for
+        training and validating, you could just override this method with ``pass`` and
+        implement ``_train_step`` and ``_valid_step`` separately.
+        """
+        pass
+    @abstractmethod
+    def _save_auxiliary_states(self):
+        r"""To save some auxiliary states when saving model's ckpt"""
+        pass
+    ### Abstract methods end ###
+    ### THIS IS MAIN ENTRY ###
+    def train_loop(self):
+        r"""Training loop. The public entry of training process."""
+        # Wait everyone to prepare before we move on
+        self.accelerator.wait_for_everyone()
+        # dump config file
+        if self.accelerator.is_main_process:
+            self.__dump_cfg(self.config_save_path)
+        self.model.train()
+        self.optimizer.zero_grad()
+        # Wait to ensure good to go
+        self.accelerator.wait_for_everyone()
+        while self.epoch < self.max_epoch:
+            self.logger.info("\n")
+            self.logger.info("-" * 32)
+            self.logger.info("Epoch {}: ".format(self.epoch))
+            ### TODO: change the return values of _train_epoch() to a loss dict, or (total_loss, loss_dict)
+            ### It's inconvenient for the model with multiple losses
+            # Do training & validating epoch
+            train_loss = self._train_epoch()
+            self.logger.info("  |- Train/Loss: {:.6f}".format(train_loss))
+            valid_loss = self._valid_epoch()
+            self.logger.info("  |- Valid/Loss: {:.6f}".format(valid_loss))
+            self.accelerator.log(
+                {"Epoch/Train Loss": train_loss, "Epoch/Valid Loss": valid_loss},
+                step=self.epoch,
+            )
+            self.accelerator.wait_for_everyone()
+            # TODO: what is scheduler?
+            self.scheduler.step(valid_loss)  # FIXME: use epoch track correct?
+            # Check if hit save_checkpoint_stride and run_eval
+            run_eval = False
+            if self.accelerator.is_main_process:
+                save_checkpoint = False
+                hit_dix = []
+                for i, num in enumerate(self.save_checkpoint_stride):
+                    if self.epoch % num == 0:
+                        save_checkpoint = True
+                        hit_dix.append(i)
+                        run_eval |= self.run_eval[i]
+            self.accelerator.wait_for_everyone()
+            if self.accelerator.is_main_process and save_checkpoint:
+                path = os.path.join(
+                    self.checkpoint_dir,
+                    "epoch-{:04d}_step-{:07d}_loss-{:.6f}".format(
+                        self.epoch, self.step, train_loss
+                    ),
+                )
+                self.tmp_checkpoint_save_path = path
+                self.accelerator.save_state(path)
+                print(f"save checkpoint in {path}")
+                json.dump(
+                    self.checkpoints_path,
+                    open(os.path.join(path, "ckpts.json"), "w"),
+                    ensure_ascii=False,
+                    indent=4,
+                )
+                self._save_auxiliary_states()
+                # Remove old checkpoints
+                to_remove = []
+                for idx in hit_dix:
+                    self.checkpoints_path[idx].append(path)
+                    while len(self.checkpoints_path[idx]) > self.keep_last[idx]:
+                        to_remove.append((idx, self.checkpoints_path[idx].pop(0)))
+                # Search conflicts
+                total = set()
+                for i in self.checkpoints_path:
+                    total |= set(i)
+                do_remove = set()
+                for idx, path in to_remove[::-1]:
+                    if path in total:
+                        self.checkpoints_path[idx].insert(0, path)
+                    else:
+                        do_remove.add(path)
+                # Remove old checkpoints
+                for path in do_remove:
+                    shutil.rmtree(path, ignore_errors=True)
+                    self.logger.debug(f"Remove old checkpoint: {path}")
+            self.accelerator.wait_for_everyone()
+            if run_eval:
+                # TODO: run evaluation
+                pass
+            # Update info for each epoch
+            self.epoch += 1
+        # Finish training and save final checkpoint
+        self.accelerator.wait_for_everyone()
+        if self.accelerator.is_main_process:
+            self.accelerator.save_state(
+                os.path.join(
+                    self.checkpoint_dir,
+                    "final_epoch-{:04d}_step-{:07d}_loss-{:.6f}".format(
+                        self.epoch, self.step, valid_loss
+                    ),
+                )
+            )
+            self._save_auxiliary_states()
+        self.accelerator.end_training()
+    ### Following are methods that can be used directly in child classes ###
+    def _train_epoch(self):
+        r"""Training epoch. Should return average loss of a batch (sample) over
+        one epoch. See ``train_loop`` for usage.
+        """
+        self.model.train()
+        epoch_sum_loss: float = 0.0
+        epoch_step: int = 0
+        for batch in tqdm(
+            self.train_dataloader,
+            desc=f"Training Epoch {self.epoch}",
+            unit="batch",
+            colour="GREEN",
+            leave=False,
+            dynamic_ncols=True,
+            smoothing=0.04,
+            disable=not self.accelerator.is_main_process,
+        ):
+            # Do training step and BP
+            with self.accelerator.accumulate(self.model):
+                loss = self._train_step(batch)
+                self.accelerator.backward(loss)
+                self.optimizer.step()
+                self.optimizer.zero_grad()
+            self.batch_count += 1
+            # Update info for each step
+            # TODO: step means BP counts or batch counts?
+            if self.batch_count % self.cfg.train.gradient_accumulation_step == 0:
+                epoch_sum_loss += loss
+                self.accelerator.log(
+                    {
+                        "Step/Train Loss": loss,
+                        "Step/Learning Rate": self.optimizer.param_groups[0]["lr"],
+                    },
+                    step=self.step,
+                )
+                self.step += 1
+                epoch_step += 1
+        self.accelerator.wait_for_everyone()
+        return (
+            epoch_sum_loss
+            / len(self.train_dataloader)
+            * self.cfg.train.gradient_accumulation_step
+        )
+    @torch.inference_mode()
+    def _valid_epoch(self):
+        r"""Testing epoch. Should return average loss of a batch (sample) over
+        one epoch. See ``train_loop`` for usage.
+        """
+        self.model.eval()
+        epoch_sum_loss = 0.0
+        for batch in tqdm(
+            self.valid_dataloader,
+            desc=f"Validating Epoch {self.epoch}",
+            unit="batch",
+            colour="GREEN",
+            leave=False,
+            dynamic_ncols=True,
+            smoothing=0.04,
+            disable=not self.accelerator.is_main_process,
+        ):
+            batch_loss = self._valid_step(batch)
+            epoch_sum_loss += batch_loss.item()
+        self.accelerator.wait_for_everyone()
+        return epoch_sum_loss / len(self.valid_dataloader)
+    def _train_step(self, batch):
+        r"""Training forward step. Should return average loss of a sample over
+        one batch. Provoke ``_forward_step`` is recommended except for special case.
+        See ``_train_epoch`` for usage.
+        """
+        return self._forward_step(batch)
+    @torch.inference_mode()
+    def _valid_step(self, batch):
+        r"""Testing forward step. Should return average loss of a sample over
+        one batch. Provoke ``_forward_step`` is recommended except for special case.
+        See ``_test_epoch`` for usage.
+        """
+        return self._forward_step(batch)
+    def _load_model(
+        self,
+        checkpoint_dir: str = None,
+        checkpoint_path: str = None,
+        resume_type: str = "",
+    ):
+        r"""Load model from checkpoint. If checkpoint_path is None, it will
+        load the latest checkpoint in checkpoint_dir. If checkpoint_path is not
+        None, it will load the checkpoint specified by checkpoint_path. **Only use this
+        method after** ``accelerator.prepare()``.
+        """
+        if checkpoint_path is None:
+            ls = [str(i) for i in Path(checkpoint_dir).glob("*")]
+            ls.sort(key=lambda x: int(x.split("_")[-3].split("-")[-1]), reverse=True)
+            checkpoint_path = ls[0]
+            self.logger.info("Resume from {}...".format(checkpoint_path))
+        if resume_type in ["resume", ""]:
+            # Load all the things, including model weights, optimizer, scheduler, and random states.
+            self.accelerator.load_state(input_dir=checkpoint_path)
+            # set epoch and step
+            self.epoch = int(checkpoint_path.split("_")[-3].split("-")[-1]) + 1
+            self.step = int(checkpoint_path.split("_")[-2].split("-")[-1]) + 1
+        elif resume_type == "finetune":
+            # Load only the model weights
+            accelerate.load_checkpoint_and_dispatch(
+                self.accelerator.unwrap_model(self.model),
+                os.path.join(checkpoint_path, "pytorch_model.bin"),
+            )
+            self.logger.info("Load model weights for finetune...")
+        else:
+            raise ValueError("Resume_type must be `resume` or `finetune`.")
+        return checkpoint_path
+    def _build_dataloader(self):
+        Dataset, Collator = self._build_dataset()
+        # build dataset instance for each dataset and combine them by ConcatDataset
+        datasets_list = []
+        for dataset in self.cfg.dataset:
+            subdataset = Dataset(self.cfg, dataset, is_valid=False)
+            datasets_list.append(subdataset)
+        train_dataset = ConcatDataset(datasets_list)
+        train_collate = Collator(self.cfg)
+        _, batch_sampler = build_samplers(train_dataset, self.cfg, self.logger, "train")
+        self.logger.debug(f"train batch_sampler: {list(batch_sampler)}")
+        self.logger.debug(f"length: {train_dataset.cumulative_sizes}")
+        # TODO: use config instead of (sampler, shuffle, drop_last, batch_size)
+        train_loader = DataLoader(
+            train_dataset,
+            # shuffle=True,
+            collate_fn=train_collate,
+            batch_sampler=batch_sampler,
+            num_workers=self.cfg.train.dataloader.num_worker,
+            pin_memory=self.cfg.train.dataloader.pin_memory,
+        )
+        # Build valid dataloader
+        datasets_list = []
+        for dataset in self.cfg.dataset:
+            subdataset = Dataset(self.cfg, dataset, is_valid=True)
+            datasets_list.append(subdataset)
+        valid_dataset = ConcatDataset(datasets_list)
+        valid_collate = Collator(self.cfg)
+        _, batch_sampler = build_samplers(valid_dataset, self.cfg, self.logger, "valid")
+        self.logger.debug(f"valid batch_sampler: {list(batch_sampler)}")
+        self.logger.debug(f"length: {valid_dataset.cumulative_sizes}")
+        valid_loader = DataLoader(
+            valid_dataset,
+            collate_fn=valid_collate,
+            batch_sampler=batch_sampler,
+            num_workers=self.cfg.train.dataloader.num_worker,
+            pin_memory=self.cfg.train.dataloader.pin_memory,
+        )
+        return train_loader, valid_loader
+    @staticmethod
+    def _set_random_seed(seed):
+        r"""Set random seed for all possible random modules."""
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.random.manual_seed(seed)
+    def _check_nan(self, loss, y_pred, y_gt):
+        if torch.any(torch.isnan(loss)):
+            self.logger.error("Fatal Error: Training is down since loss has Nan!")
+            self.logger.error("loss = {:.6f}".format(loss.item()), in_order=True)
+            ### y_pred ###
+            if torch.any(torch.isnan(y_pred)):
+                self.logger.error(
+                    f"y_pred has Nan: {torch.any(torch.isnan(y_pred))}", in_order=True
+                )
+                self.logger.error(f"y_pred: {y_pred}", in_order=True)
+            else:
+                self.logger.debug(
+                    f"y_pred has Nan: {torch.any(torch.isnan(y_pred))}", in_order=True
+                )
+                self.logger.debug(f"y_pred: {y_pred}", in_order=True)
+            ### y_gt ###
+            if torch.any(torch.isnan(y_gt)):
+                self.logger.error(
+                    f"y_gt has Nan: {torch.any(torch.isnan(y_gt))}", in_order=True
+                )
+                self.logger.error(f"y_gt: {y_gt}", in_order=True)
+            else:
+                self.logger.debug(
+                    f"y_gt has nan: {torch.any(torch.isnan(y_gt))}", in_order=True
+                )
+                self.logger.debug(f"y_gt: {y_gt}", in_order=True)
+            self.accelerator.end_training()
+            raise RuntimeError("Loss has Nan! See log for more info.")
+    ### Protected methods end ###
+    ## Following are private methods ##
+    def _build_optimizer(self):
+        r"""Build optimizer for model."""
+        # Make case-insensitive matching
+        if self.cfg.train.optimizer.lower() == "adadelta":
+            optimizer = torch.optim.Adadelta(
+                self.model.parameters(), **self.cfg.train.adadelta
+            )
+            self.logger.info("Using Adadelta optimizer.")
+        elif self.cfg.train.optimizer.lower() == "adagrad":
+            optimizer = torch.optim.Adagrad(
+                self.model.parameters(), **self.cfg.train.adagrad
+            )
+            self.logger.info("Using Adagrad optimizer.")
+        elif self.cfg.train.optimizer.lower() == "adam":
+            optimizer = torch.optim.Adam(self.model.parameters(), **self.cfg.train.adam)
+            self.logger.info("Using Adam optimizer.")
+        elif self.cfg.train.optimizer.lower() == "adamw":
+            optimizer = torch.optim.AdamW(
+                self.model.parameters(), **self.cfg.train.adamw
+            )
+        elif self.cfg.train.optimizer.lower() == "sparseadam":
+            optimizer = torch.optim.SparseAdam(
+                self.model.parameters(), **self.cfg.train.sparseadam
+            )
+        elif self.cfg.train.optimizer.lower() == "adamax":
+            optimizer = torch.optim.Adamax(
+                self.model.parameters(), **self.cfg.train.adamax
+            )
+        elif self.cfg.train.optimizer.lower() == "asgd":
+            optimizer = torch.optim.ASGD(self.model.parameters(), **self.cfg.train.asgd)
+        elif self.cfg.train.optimizer.lower() == "lbfgs":
+            optimizer = torch.optim.LBFGS(
+                self.model.parameters(), **self.cfg.train.lbfgs
+            )
+        elif self.cfg.train.optimizer.lower() == "nadam":
+            optimizer = torch.optim.NAdam(
+                self.model.parameters(), **self.cfg.train.nadam
+            )
+        elif self.cfg.train.optimizer.lower() == "radam":
+            optimizer = torch.optim.RAdam(
+                self.model.parameters(), **self.cfg.train.radam
+            )
+        elif self.cfg.train.optimizer.lower() == "rmsprop":
+            optimizer = torch.optim.RMSprop(
+                self.model.parameters(), **self.cfg.train.rmsprop
+            )
+        elif self.cfg.train.optimizer.lower() == "rprop":
+            optimizer = torch.optim.Rprop(
+                self.model.parameters(), **self.cfg.train.rprop
+            )
+        elif self.cfg.train.optimizer.lower() == "sgd":
+            optimizer = torch.optim.SGD(self.model.parameters(), **self.cfg.train.sgd)
+        else:
+            raise NotImplementedError(
+                f"Optimizer {self.cfg.train.optimizer} not supported yet!"
+            )
+        return optimizer
+    def _build_scheduler(self):
+        r"""Build scheduler for optimizer."""
+        # Make case-insensitive matching
+        if self.cfg.train.scheduler.lower() == "lambdalr":
+            scheduler = torch.optim.lr_scheduler.LambdaLR(
+                self.optimizer, **self.cfg.train.lambdalr
+            )
+        elif self.cfg.train.scheduler.lower() == "multiplicativelr":
+            scheduler = torch.optim.lr_scheduler.MultiplicativeLR(
+                self.optimizer, **self.cfg.train.multiplicativelr
+            )
+        elif self.cfg.train.scheduler.lower() == "steplr":
+            scheduler = torch.optim.lr_scheduler.StepLR(
+                self.optimizer, **self.cfg.train.steplr
+            )
+        elif self.cfg.train.scheduler.lower() == "multisteplr":
+            scheduler = torch.optim.lr_scheduler.MultiStepLR(
+                self.optimizer, **self.cfg.train.multisteplr
+            )
+        elif self.cfg.train.scheduler.lower() == "constantlr":
+            scheduler = torch.optim.lr_scheduler.ConstantLR(
+                self.optimizer, **self.cfg.train.constantlr
+            )
+        elif self.cfg.train.scheduler.lower() == "linearlr":
+            scheduler = torch.optim.lr_scheduler.LinearLR(
+                self.optimizer, **self.cfg.train.linearlr
+            )
+        elif self.cfg.train.scheduler.lower() == "exponentiallr":
+            scheduler = torch.optim.lr_scheduler.ExponentialLR(
+                self.optimizer, **self.cfg.train.exponentiallr
+            )
+        elif self.cfg.train.scheduler.lower() == "polynomiallr":
+            scheduler = torch.optim.lr_scheduler.PolynomialLR(
+                self.optimizer, **self.cfg.train.polynomiallr
+            )
+        elif self.cfg.train.scheduler.lower() == "cosineannealinglr":
+            scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+                self.optimizer, **self.cfg.train.cosineannealinglr
+            )
+        elif self.cfg.train.scheduler.lower() == "sequentiallr":
+            scheduler = torch.optim.lr_scheduler.SequentialLR(
+                self.optimizer, **self.cfg.train.sequentiallr
+            )
+        elif self.cfg.train.scheduler.lower() == "reducelronplateau":
+            scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
+                self.optimizer, **self.cfg.train.reducelronplateau
+            )
+        elif self.cfg.train.scheduler.lower() == "cycliclr":
+            scheduler = torch.optim.lr_scheduler.CyclicLR(
+                self.optimizer, **self.cfg.train.cycliclr
+            )
+        elif self.cfg.train.scheduler.lower() == "onecyclelr":
+            scheduler = torch.optim.lr_scheduler.OneCycleLR(
+                self.optimizer, **self.cfg.train.onecyclelr
+            )
+        elif self.cfg.train.scheduler.lower() == "cosineannearingwarmrestarts":
+            scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
+                self.optimizer, **self.cfg.train.cosineannearingwarmrestarts
+            )
+        elif self.cfg.train.scheduler.lower() == "noamlr":
+            scheduler = NoamLR(self.optimizer, **self.cfg.train.lr_scheduler)
+        else:
+            raise NotImplementedError(
+                f"Scheduler {self.cfg.train.scheduler} not supported yet!"
+            )
+        return scheduler
+    def _init_accelerator(self):
+        self.exp_dir = os.path.join(
+            os.path.abspath(self.cfg.log_dir), self.args.exp_name
+        )
+        project_config = ProjectConfiguration(
+            project_dir=self.exp_dir,
+            logging_dir=os.path.join(self.exp_dir, "log"),
+        )
+        self.accelerator = accelerate.Accelerator(
+            gradient_accumulation_steps=self.cfg.train.gradient_accumulation_step,
+            log_with=self.cfg.train.tracker,
+            project_config=project_config,
+        )
+        if self.accelerator.is_main_process:
+            os.makedirs(project_config.project_dir, exist_ok=True)
+            os.makedirs(project_config.logging_dir, exist_ok=True)
+        with self.accelerator.main_process_first():
+            self.accelerator.init_trackers(self.args.exp_name)
+    def __check_basic_configs(self):
+        if self.cfg.train.gradient_accumulation_step <= 0:
+            self.logger.fatal("Invalid gradient_accumulation_step value!")
+            self.logger.error(
+                f"Invalid gradient_accumulation_step value: {self.cfg.train.gradient_accumulation_step}. It should be positive."
+            )
+            self.accelerator.end_training()
+            raise ValueError(
+                f"Invalid gradient_accumulation_step value: {self.cfg.train.gradient_accumulation_step}. It should be positive."
+            )
+        # TODO: check other values
+    @staticmethod
+    def __count_parameters(model):
+        model_param = 0.0
+        if isinstance(model, dict):
+            for key, value in model.items():
+                model_param += sum(p.numel() for p in model[key].parameters())
+        else:
+            model_param = sum(p.numel() for p in model.parameters())
+        return model_param
+    def __dump_cfg(self, path):
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+        json5.dump(
+            self.cfg,
+            open(path, "w"),
+            indent=4,
+            sort_keys=True,
+            ensure_ascii=False,
+            quote_keys=True,
+        )
+    ### Private methods end ###

Amphion/models/codec/ns3_codec/__pycache__/facodec.cpython-310.pyc ADDED Viewed

Binary file (24.2 kB). View file

Amphion/models/codec/ns3_codec/alias_free_torch/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (277 Bytes). View file

Amphion/models/codec/ns3_codec/alias_free_torch/__pycache__/resample.cpython-310.pyc ADDED Viewed

Binary file (1.97 kB). View file

Amphion/models/codec/ns3_codec/quantize/__pycache__/rvq.cpython-310.pyc ADDED Viewed

Binary file (2.51 kB). View file

Amphion/models/codec/ns3_codec/quantize/rvq.py ADDED Viewed

	@@ -0,0 +1,87 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import math
+import torch
+from torch import nn
+from .fvq import FactorizedVectorQuantize
+class ResidualVQ(nn.Module):
+    """Follows Algorithm 1. in https://arxiv.org/pdf/2107.03312.pdf"""
+    def __init__(self, *, num_quantizers, codebook_size, **kwargs):
+        super().__init__()
+        VQ = FactorizedVectorQuantize
+        if type(codebook_size) == int:
+            codebook_size = [codebook_size] * num_quantizers
+        self.layers = nn.ModuleList(
+            [VQ(codebook_size=2**size, **kwargs) for size in codebook_size]
+        )
+        self.num_quantizers = num_quantizers
+        self.quantizer_dropout = kwargs.get("quantizer_dropout", 0.0)
+        self.dropout_type = kwargs.get("dropout_type", None)
+    def forward(self, x, n_quantizers=None):
+        quantized_out = 0.0
+        residual = x
+        all_losses = []
+        all_indices = []
+        all_quantized = []
+        if n_quantizers is None:
+            n_quantizers = self.num_quantizers
+        if self.training:
+            n_quantizers = torch.ones((x.shape[0],)) * self.num_quantizers + 1
+            if self.dropout_type == "linear":
+                dropout = torch.randint(1, self.num_quantizers + 1, (x.shape[0],))
+            elif self.dropout_type == "exp":
+                dropout = torch.randint(
+                    1, int(math.log2(self.num_quantizers)), (x.shape[0],)
+                )
+                dropout = torch.pow(2, dropout)
+            n_dropout = int(x.shape[0] * self.quantizer_dropout)
+            n_quantizers[:n_dropout] = dropout[:n_dropout]
+            n_quantizers = n_quantizers.to(x.device)
+        for idx, layer in enumerate(self.layers):
+            if not self.training and idx >= n_quantizers:
+                break
+            quantized, indices, loss = layer(residual)
+            mask = (
+                torch.full((x.shape[0],), fill_value=idx, device=x.device)
+                < n_quantizers
+            )
+            residual = residual - quantized
+            quantized_out = quantized_out + quantized * mask[:, None, None]
+            # loss
+            loss = (loss * mask).mean()
+            all_indices.append(indices)
+            all_losses.append(loss)
+            all_quantized.append(quantized)
+        all_losses, all_indices, all_quantized = map(
+            torch.stack, (all_losses, all_indices, all_quantized)
+        )
+        return quantized_out, all_indices, all_losses, all_quantized
+    def vq2emb(self, vq):
+        # vq: [n_quantizers, B, T]
+        quantized_out = 0.0
+        for idx, layer in enumerate(self.layers):
+            quantized = layer.vq2emb(vq[idx])
+            quantized_out += quantized
+        return quantized_out
+    def get_emb(self):
+        embs = []
+        for idx, layer in enumerate(self.layers):
+            embs.append(layer.get_emb())
+        return embs

Amphion/models/svc/transformer/transformer.py ADDED Viewed

	@@ -0,0 +1,82 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import math
+import torch
+import torch.nn as nn
+from torch.nn import TransformerEncoder, TransformerEncoderLayer
+class Transformer(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        self.cfg = cfg
+        dropout = self.cfg.dropout
+        nhead = self.cfg.n_heads
+        nlayers = self.cfg.n_layers
+        input_dim = self.cfg.input_dim
+        output_dim = self.cfg.output_dim
+        d_model = input_dim
+        self.pos_encoder = PositionalEncoding(d_model, dropout)
+        encoder_layers = TransformerEncoderLayer(
+            d_model, nhead, dropout=dropout, batch_first=True
+        )
+        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
+        self.output_mlp = nn.Linear(d_model, output_dim)
+    def forward(self, x, mask=None):
+        """
+        Args:
+            x: (N, seq_len, input_dim)
+        Returns:
+            output: (N, seq_len, output_dim)
+        """
+        # (N, seq_len, d_model)
+        src = self.pos_encoder(x)
+        # model_stats["pos_embedding"] = x
+        # (N, seq_len, d_model)
+        output = self.transformer_encoder(src)
+        # (N, seq_len, output_dim)
+        output = self.output_mlp(output)
+        return output
+class PositionalEncoding(nn.Module):
+    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
+        super().__init__()
+        self.dropout = nn.Dropout(p=dropout)
+        position = torch.arange(max_len).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)
+        )
+        # Assume that x is (seq_len, N, d)
+        # pe = torch.zeros(max_len, 1, d_model)
+        # pe[:, 0, 0::2] = torch.sin(position * div_term)
+        # pe[:, 0, 1::2] = torch.cos(position * div_term)
+        # Assume that x in (N, seq_len, d)
+        pe = torch.zeros(1, max_len, d_model)
+        pe[0, :, 0::2] = torch.sin(position * div_term)
+        pe[0, :, 1::2] = torch.cos(position * div_term)
+        self.register_buffer("pe", pe)
+    def forward(self, x):
+        """
+        Args:
+            x: Tensor, shape [N, seq_len, d]
+        """
+        # Old: Assume that x is (seq_len, N, d), and self.pe is (max_len, 1, d_model)
+        # x = x + self.pe[: x.size(0)]
+        # Now: self.pe is (1, max_len, d)
+        x = x + self.pe[:, : x.size(1), :]
+        return self.dropout(x)

Amphion/models/tts/fastspeech2/fs2.py ADDED Viewed

	@@ -0,0 +1,548 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+#  This code is modified from https://github.com/ming024/FastSpeech2/blob/master/model/fastspeech2.py
+import torch
+import torch.nn as nn
+import numpy as np
+import torch.nn.functional as F
+from modules.transformer.Models import Encoder, Decoder
+from modules.transformer.Layers import PostNet
+from collections import OrderedDict
+import os
+import json
+def get_mask_from_lengths(lengths, max_len=None):
+    device = lengths.device
+    batch_size = lengths.shape[0]
+    if max_len is None:
+        max_len = torch.max(lengths).item()
+    ids = torch.arange(0, max_len).unsqueeze(0).expand(batch_size, -1).to(device)
+    mask = ids >= lengths.unsqueeze(1).expand(-1, max_len)
+    return mask
+def pad(input_ele, mel_max_length=None):
+    if mel_max_length:
+        max_len = mel_max_length
+    else:
+        max_len = max([input_ele[i].size(0) for i in range(len(input_ele))])
+    out_list = list()
+    for i, batch in enumerate(input_ele):
+        if len(batch.shape) == 1:
+            one_batch_padded = F.pad(
+                batch, (0, max_len - batch.size(0)), "constant", 0.0
+            )
+        elif len(batch.shape) == 2:
+            one_batch_padded = F.pad(
+                batch, (0, 0, 0, max_len - batch.size(0)), "constant", 0.0
+            )
+        out_list.append(one_batch_padded)
+    out_padded = torch.stack(out_list)
+    return out_padded
+class VarianceAdaptor(nn.Module):
+    """Variance Adaptor"""
+    def __init__(self, cfg):
+        super(VarianceAdaptor, self).__init__()
+        self.duration_predictor = VariancePredictor(cfg)
+        self.length_regulator = LengthRegulator()
+        self.pitch_predictor = VariancePredictor(cfg)
+        self.energy_predictor = VariancePredictor(cfg)
+        # assign the pitch/energy feature level
+        if cfg.preprocess.use_frame_pitch:
+            self.pitch_feature_level = "frame_level"
+            self.pitch_dir = cfg.preprocess.pitch_dir
+        else:
+            self.pitch_feature_level = "phoneme_level"
+            self.pitch_dir = cfg.preprocess.phone_pitch_dir
+        if cfg.preprocess.use_frame_energy:
+            self.energy_feature_level = "frame_level"
+            self.energy_dir = cfg.preprocess.energy_dir
+        else:
+            self.energy_feature_level = "phoneme_level"
+            self.energy_dir = cfg.preprocess.phone_energy_dir
+        assert self.pitch_feature_level in ["phoneme_level", "frame_level"]
+        assert self.energy_feature_level in ["phoneme_level", "frame_level"]
+        pitch_quantization = cfg.model.variance_embedding.pitch_quantization
+        energy_quantization = cfg.model.variance_embedding.energy_quantization
+        n_bins = cfg.model.variance_embedding.n_bins
+        assert pitch_quantization in ["linear", "log"]
+        assert energy_quantization in ["linear", "log"]
+        with open(
+            os.path.join(
+                cfg.preprocess.processed_dir,
+                cfg.dataset[0],
+                self.energy_dir,
+                "statistics.json",
+            )
+        ) as f:
+            stats = json.load(f)
+            stats = stats[cfg.dataset[0] + "_" + cfg.dataset[0]]
+            mean, std = (
+                stats["voiced_positions"]["mean"],
+                stats["voiced_positions"]["std"],
+            )
+            energy_min = (stats["total_positions"]["min"] - mean) / std
+            energy_max = (stats["total_positions"]["max"] - mean) / std
+        with open(
+            os.path.join(
+                cfg.preprocess.processed_dir,
+                cfg.dataset[0],
+                self.pitch_dir,
+                "statistics.json",
+            )
+        ) as f:
+            stats = json.load(f)
+            stats = stats[cfg.dataset[0] + "_" + cfg.dataset[0]]
+            mean, std = (
+                stats["voiced_positions"]["mean"],
+                stats["voiced_positions"]["std"],
+            )
+            pitch_min = (stats["total_positions"]["min"] - mean) / std
+            pitch_max = (stats["total_positions"]["max"] - mean) / std
+        if pitch_quantization == "log":
+            self.pitch_bins = nn.Parameter(
+                torch.exp(
+                    torch.linspace(np.log(pitch_min), np.log(pitch_max), n_bins - 1)
+                ),
+                requires_grad=False,
+            )
+        else:
+            self.pitch_bins = nn.Parameter(
+                torch.linspace(pitch_min, pitch_max, n_bins - 1),
+                requires_grad=False,
+            )
+        if energy_quantization == "log":
+            self.energy_bins = nn.Parameter(
+                torch.exp(
+                    torch.linspace(np.log(energy_min), np.log(energy_max), n_bins - 1)
+                ),
+                requires_grad=False,
+            )
+        else:
+            self.energy_bins = nn.Parameter(
+                torch.linspace(energy_min, energy_max, n_bins - 1),
+                requires_grad=False,
+            )
+        self.pitch_embedding = nn.Embedding(
+            n_bins, cfg.model.transformer.encoder_hidden
+        )
+        self.energy_embedding = nn.Embedding(
+            n_bins, cfg.model.transformer.encoder_hidden
+        )
+    def get_pitch_embedding(self, x, target, mask, control):
+        prediction = self.pitch_predictor(x, mask)
+        if target is not None:
+            embedding = self.pitch_embedding(torch.bucketize(target, self.pitch_bins))
+        else:
+            prediction = prediction * control
+            embedding = self.pitch_embedding(
+                torch.bucketize(prediction, self.pitch_bins)
+            )
+        return prediction, embedding
+    def get_energy_embedding(self, x, target, mask, control):
+        prediction = self.energy_predictor(x, mask)
+        if target is not None:
+            embedding = self.energy_embedding(torch.bucketize(target, self.energy_bins))
+        else:
+            prediction = prediction * control
+            embedding = self.energy_embedding(
+                torch.bucketize(prediction, self.energy_bins)
+            )
+        return prediction, embedding
+    def forward(
+        self,
+        x,
+        src_mask,
+        mel_mask=None,
+        max_len=None,
+        pitch_target=None,
+        energy_target=None,
+        duration_target=None,
+        p_control=1.0,
+        e_control=1.0,
+        d_control=1.0,
+    ):
+        log_duration_prediction = self.duration_predictor(x, src_mask)
+        if self.pitch_feature_level == "phoneme_level":
+            pitch_prediction, pitch_embedding = self.get_pitch_embedding(
+                x, pitch_target, src_mask, p_control
+            )
+            x = x + pitch_embedding
+        if self.energy_feature_level == "phoneme_level":
+            energy_prediction, energy_embedding = self.get_energy_embedding(
+                x, energy_target, src_mask, e_control
+            )
+            x = x + energy_embedding
+        if duration_target is not None:
+            x, mel_len = self.length_regulator(x, duration_target, max_len)
+            duration_rounded = duration_target
+        else:
+            duration_rounded = torch.clamp(
+                (torch.round(torch.exp(log_duration_prediction) - 1) * d_control),
+                min=0,
+            )
+            x, mel_len = self.length_regulator(x, duration_rounded, max_len)
+            mel_mask = get_mask_from_lengths(mel_len)
+        if self.pitch_feature_level == "frame_level":
+            pitch_prediction, pitch_embedding = self.get_pitch_embedding(
+                x, pitch_target, mel_mask, p_control
+            )
+            x = x + pitch_embedding
+        if self.energy_feature_level == "frame_level":
+            energy_prediction, energy_embedding = self.get_energy_embedding(
+                x, energy_target, mel_mask, p_control
+            )
+            x = x + energy_embedding
+        return (
+            x,
+            pitch_prediction,
+            energy_prediction,
+            log_duration_prediction,
+            duration_rounded,
+            mel_len,
+            mel_mask,
+        )
+class LengthRegulator(nn.Module):
+    """Length Regulator"""
+    def __init__(self):
+        super(LengthRegulator, self).__init__()
+    def LR(self, x, duration, max_len):
+        device = x.device
+        output = list()
+        mel_len = list()
+        for batch, expand_target in zip(x, duration):
+            expanded = self.expand(batch, expand_target)
+            output.append(expanded)
+            mel_len.append(expanded.shape[0])
+        if max_len is not None:
+            output = pad(output, max_len)
+        else:
+            output = pad(output)
+        return output, torch.LongTensor(mel_len).to(device)
+    def expand(self, batch, predicted):
+        out = list()
+        for i, vec in enumerate(batch):
+            expand_size = predicted[i].item()
+            out.append(vec.expand(max(int(expand_size), 0), -1))
+        out = torch.cat(out, 0)
+        return out
+    def forward(self, x, duration, max_len):
+        output, mel_len = self.LR(x, duration, max_len)
+        return output, mel_len
+class VariancePredictor(nn.Module):
+    """Duration, Pitch and Energy Predictor"""
+    def __init__(self, cfg):
+        super(VariancePredictor, self).__init__()
+        self.input_size = cfg.model.transformer.encoder_hidden
+        self.filter_size = cfg.model.variance_predictor.filter_size
+        self.kernel = cfg.model.variance_predictor.kernel_size
+        self.conv_output_size = cfg.model.variance_predictor.filter_size
+        self.dropout = cfg.model.variance_predictor.dropout
+        self.conv_layer = nn.Sequential(
+            OrderedDict(
+                [
+                    (
+                        "conv1d_1",
+                        Conv(
+                            self.input_size,
+                            self.filter_size,
+                            kernel_size=self.kernel,
+                            padding=(self.kernel - 1) // 2,
+                        ),
+                    ),
+                    ("relu_1", nn.ReLU()),
+                    ("layer_norm_1", nn.LayerNorm(self.filter_size)),
+                    ("dropout_1", nn.Dropout(self.dropout)),
+                    (
+                        "conv1d_2",
+                        Conv(
+                            self.filter_size,
+                            self.filter_size,
+                            kernel_size=self.kernel,
+                            padding=1,
+                        ),
+                    ),
+                    ("relu_2", nn.ReLU()),
+                    ("layer_norm_2", nn.LayerNorm(self.filter_size)),
+                    ("dropout_2", nn.Dropout(self.dropout)),
+                ]
+            )
+        )
+        self.linear_layer = nn.Linear(self.conv_output_size, 1)
+    def forward(self, encoder_output, mask):
+        out = self.conv_layer(encoder_output)
+        out = self.linear_layer(out)
+        out = out.squeeze(-1)
+        if mask is not None:
+            out = out.masked_fill(mask, 0.0)
+        return out
+class Conv(nn.Module):
+    """
+    Convolution Module
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size=1,
+        stride=1,
+        padding=0,
+        dilation=1,
+        bias=True,
+        w_init="linear",
+    ):
+        """
+        :param in_channels: dimension of input
+        :param out_channels: dimension of output
+        :param kernel_size: size of kernel
+        :param stride: size of stride
+        :param padding: size of padding
+        :param dilation: dilation rate
+        :param bias: boolean. if True, bias is included.
+        :param w_init: str. weight inits with xavier initialization.
+        """
+        super(Conv, self).__init__()
+        self.conv = nn.Conv1d(
+            in_channels,
+            out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            bias=bias,
+        )
+    def forward(self, x):
+        x = x.contiguous().transpose(1, 2)
+        x = self.conv(x)
+        x = x.contiguous().transpose(1, 2)
+        return x
+class FastSpeech2(nn.Module):
+    def __init__(self, cfg) -> None:
+        super(FastSpeech2, self).__init__()
+        self.cfg = cfg
+        self.encoder = Encoder(cfg.model)
+        self.variance_adaptor = VarianceAdaptor(cfg)
+        self.decoder = Decoder(cfg.model)
+        self.mel_linear = nn.Linear(
+            cfg.model.transformer.decoder_hidden,
+            cfg.preprocess.n_mel,
+        )
+        self.postnet = PostNet(n_mel_channels=cfg.preprocess.n_mel)
+        self.speaker_emb = None
+        if cfg.train.multi_speaker_training:
+            with open(
+                os.path.join(
+                    cfg.preprocess.processed_dir, cfg.dataset[0], "spk2id.json"
+                ),
+                "r",
+            ) as f:
+                n_speaker = len(json.load(f))
+            self.speaker_emb = nn.Embedding(
+                n_speaker,
+                cfg.model.transformer.encoder_hidden,
+            )
+    def forward(self, data, p_control=1.0, e_control=1.0, d_control=1.0):
+        speakers = data["spk_id"]
+        texts = data["texts"]
+        src_lens = data["text_len"]
+        max_src_len = max(src_lens)
+        mel_lens = data["target_len"] if "target_len" in data else None
+        max_mel_len = max(mel_lens) if "target_len" in data else None
+        p_targets = data["pitch"] if "pitch" in data else None
+        e_targets = data["energy"] if "energy" in data else None
+        d_targets = data["durations"] if "durations" in data else None
+        src_masks = get_mask_from_lengths(src_lens, max_src_len)
+        mel_masks = (
+            get_mask_from_lengths(mel_lens, max_mel_len)
+            if mel_lens is not None
+            else None
+        )
+        output = self.encoder(texts, src_masks)
+        if self.speaker_emb is not None:
+            output = output + self.speaker_emb(speakers).unsqueeze(1).expand(
+                -1, max_src_len, -1
+            )
+        (
+            output,
+            p_predictions,
+            e_predictions,
+            log_d_predictions,
+            d_rounded,
+            mel_lens,
+            mel_masks,
+        ) = self.variance_adaptor(
+            output,
+            src_masks,
+            mel_masks,
+            max_mel_len,
+            p_targets,
+            e_targets,
+            d_targets,
+            p_control,
+            e_control,
+            d_control,
+        )
+        output, mel_masks = self.decoder(output, mel_masks)
+        output = self.mel_linear(output)
+        postnet_output = self.postnet(output) + output
+        return {
+            "output": output,
+            "postnet_output": postnet_output,
+            "p_predictions": p_predictions,
+            "e_predictions": e_predictions,
+            "log_d_predictions": log_d_predictions,
+            "d_rounded": d_rounded,
+            "src_masks": src_masks,
+            "mel_masks": mel_masks,
+            "src_lens": src_lens,
+            "mel_lens": mel_lens,
+        }
+class FastSpeech2Loss(nn.Module):
+    """FastSpeech2 Loss"""
+    def __init__(self, cfg):
+        super(FastSpeech2Loss, self).__init__()
+        if cfg.preprocess.use_frame_pitch:
+            self.pitch_feature_level = "frame_level"
+        else:
+            self.pitch_feature_level = "phoneme_level"
+        if cfg.preprocess.use_frame_energy:
+            self.energy_feature_level = "frame_level"
+        else:
+            self.energy_feature_level = "phoneme_level"
+        self.mse_loss = nn.MSELoss()
+        self.mae_loss = nn.L1Loss()
+    def forward(self, data, predictions):
+        mel_targets = data["mel"]
+        pitch_targets = data["pitch"].float()
+        energy_targets = data["energy"].float()
+        duration_targets = data["durations"]
+        mel_predictions = predictions["output"]
+        postnet_mel_predictions = predictions["postnet_output"]
+        pitch_predictions = predictions["p_predictions"]
+        energy_predictions = predictions["e_predictions"]
+        log_duration_predictions = predictions["log_d_predictions"]
+        src_masks = predictions["src_masks"]
+        mel_masks = predictions["mel_masks"]
+        src_masks = ~src_masks
+        mel_masks = ~mel_masks
+        log_duration_targets = torch.log(duration_targets.float() + 1)
+        mel_targets = mel_targets[:, : mel_masks.shape[1], :]
+        mel_masks = mel_masks[:, : mel_masks.shape[1]]
+        log_duration_targets.requires_grad = False
+        pitch_targets.requires_grad = False
+        energy_targets.requires_grad = False
+        mel_targets.requires_grad = False
+        if self.pitch_feature_level == "phoneme_level":
+            pitch_predictions = pitch_predictions.masked_select(src_masks)
+            pitch_targets = pitch_targets.masked_select(src_masks)
+        elif self.pitch_feature_level == "frame_level":
+            pitch_predictions = pitch_predictions.masked_select(mel_masks)
+            pitch_targets = pitch_targets.masked_select(mel_masks)
+        if self.energy_feature_level == "phoneme_level":
+            energy_predictions = energy_predictions.masked_select(src_masks)
+            energy_targets = energy_targets.masked_select(src_masks)
+        if self.energy_feature_level == "frame_level":
+            energy_predictions = energy_predictions.masked_select(mel_masks)
+            energy_targets = energy_targets.masked_select(mel_masks)
+        log_duration_predictions = log_duration_predictions.masked_select(src_masks)
+        log_duration_targets = log_duration_targets.masked_select(src_masks)
+        mel_predictions = mel_predictions.masked_select(mel_masks.unsqueeze(-1))
+        postnet_mel_predictions = postnet_mel_predictions.masked_select(
+            mel_masks.unsqueeze(-1)
+        )
+        mel_targets = mel_targets.masked_select(mel_masks.unsqueeze(-1))
+        mel_loss = self.mae_loss(mel_predictions, mel_targets)
+        postnet_mel_loss = self.mae_loss(postnet_mel_predictions, mel_targets)
+        pitch_loss = self.mse_loss(pitch_predictions, pitch_targets)
+        energy_loss = self.mse_loss(energy_predictions, energy_targets)
+        duration_loss = self.mse_loss(log_duration_predictions, log_duration_targets)
+        total_loss = (
+            mel_loss + postnet_mel_loss + duration_loss + pitch_loss + energy_loss
+        )
+        return {
+            "loss": total_loss,
+            "mel_loss": mel_loss,
+            "postnet_mel_loss": postnet_mel_loss,
+            "pitch_loss": pitch_loss,
+            "energy_loss": energy_loss,
+            "duration_loss": duration_loss,
+        }

Amphion/models/tts/fastspeech2/fs2_inference.py ADDED Viewed

	@@ -0,0 +1,193 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import torch
+from tqdm import tqdm
+from collections import OrderedDict
+from models.tts.base.tts_inferece import TTSInference
+from models.tts.fastspeech2.fs2_dataset import FS2TestDataset, FS2TestCollator
+from utils.util import load_config
+from utils.io import save_audio
+from models.tts.fastspeech2.fs2 import FastSpeech2
+from models.vocoders.vocoder_inference import synthesis
+from pathlib import Path
+from processors.phone_extractor import phoneExtractor
+from text.text_token_collation import phoneIDCollation
+import numpy as np
+import json
+class FastSpeech2Inference(TTSInference):
+    def __init__(self, args, cfg):
+        TTSInference.__init__(self, args, cfg)
+        self.args = args
+        self.cfg = cfg
+        self.infer_type = args.mode
+    def _build_model(self):
+        self.model = FastSpeech2(self.cfg)
+        return self.model
+    def load_model(self, state_dict):
+        raw_dict = state_dict["model"]
+        clean_dict = OrderedDict()
+        for k, v in raw_dict.items():
+            if k.startswith("module."):
+                clean_dict[k[7:]] = v
+            else:
+                clean_dict[k] = v
+        self.model.load_state_dict(clean_dict)
+    def _build_test_dataset(self):
+        return FS2TestDataset, FS2TestCollator
+    @staticmethod
+    def _parse_vocoder(vocoder_dir):
+        r"""Parse vocoder config"""
+        vocoder_dir = os.path.abspath(vocoder_dir)
+        ckpt_list = [ckpt for ckpt in Path(vocoder_dir).glob("*.pt")]
+        # last step (different from the base *int(x.stem)*)
+        ckpt_list.sort(
+            key=lambda x: int(x.stem.split("_")[-2].split("-")[-1]), reverse=True
+        )
+        ckpt_path = str(ckpt_list[0])
+        vocoder_cfg = load_config(
+            os.path.join(vocoder_dir, "args.json"), lowercase=True
+        )
+        return vocoder_cfg, ckpt_path
+    @torch.inference_mode()
+    def inference_for_batches(self):
+        y_pred = []
+        for i, batch in tqdm(enumerate(self.test_dataloader)):
+            y_pred, mel_lens, _ = self._inference_each_batch(batch)
+            y_ls = y_pred.chunk(self.test_batch_size)
+            tgt_ls = mel_lens.chunk(self.test_batch_size)
+            j = 0
+            for it, l in zip(y_ls, tgt_ls):
+                l = l.item()
+                it = it.squeeze(0)[:l].detach().cpu()
+                uid = self.test_dataset.metadata[i * self.test_batch_size + j]["Uid"]
+                torch.save(it, os.path.join(self.args.output_dir, f"{uid}.pt"))
+                j += 1
+        vocoder_cfg, vocoder_ckpt = self._parse_vocoder(self.args.vocoder_dir)
+        res = synthesis(
+            cfg=vocoder_cfg,
+            vocoder_weight_file=vocoder_ckpt,
+            n_samples=None,
+            pred=[
+                torch.load(
+                    os.path.join(self.args.output_dir, "{}.pt".format(item["Uid"]))
+                ).numpy()
+                for item in self.test_dataset.metadata
+            ],
+        )
+        for it, wav in zip(self.test_dataset.metadata, res):
+            uid = it["Uid"]
+            save_audio(
+                os.path.join(self.args.output_dir, f"{uid}.wav"),
+                wav.numpy(),
+                self.cfg.preprocess.sample_rate,
+                add_silence=True,
+                turn_up=True,
+            )
+            os.remove(os.path.join(self.args.output_dir, f"{uid}.pt"))
+    @torch.inference_mode()
+    def _inference_each_batch(self, batch_data):
+        device = self.accelerator.device
+        control_values = (
+            self.args.pitch_control,
+            self.args.energy_control,
+            self.args.duration_control,
+        )
+        for k, v in batch_data.items():
+            batch_data[k] = v.to(device)
+        pitch_control, energy_control, duration_control = control_values
+        output = self.model(
+            batch_data,
+            p_control=pitch_control,
+            e_control=energy_control,
+            d_control=duration_control,
+        )
+        pred_res = output["postnet_output"]
+        mel_lens = output["mel_lens"].cpu()
+        return pred_res, mel_lens, 0
+    def inference_for_single_utterance(self):
+        text = self.args.text
+        control_values = (
+            self.args.pitch_control,
+            self.args.energy_control,
+            self.args.duration_control,
+        )
+        pitch_control, energy_control, duration_control = control_values
+        # get phone symbol file
+        phone_symbol_file = None
+        if self.cfg.preprocess.phone_extractor != "lexicon":
+            phone_symbol_file = os.path.join(
+                self.exp_dir, self.cfg.preprocess.symbols_dict
+            )
+            assert os.path.exists(phone_symbol_file)
+        # convert text to phone sequence
+        phone_extractor = phoneExtractor(self.cfg)
+        phone_seq = phone_extractor.extract_phone(text)  # phone_seq: list
+        # convert phone sequence to phone id sequence
+        phon_id_collator = phoneIDCollation(
+            self.cfg, symbols_dict_file=phone_symbol_file
+        )
+        phone_seq = ["{"] + phone_seq + ["}"]
+        phone_id_seq = phon_id_collator.get_phone_id_sequence(self.cfg, phone_seq)
+        # convert phone sequence to phone id sequence
+        phone_id_seq = np.array(phone_id_seq)
+        phone_id_seq = torch.from_numpy(phone_id_seq)
+        # get speaker id if multi-speaker training and use speaker id
+        speaker_id = None
+        if self.cfg.preprocess.use_spkid and self.cfg.train.multi_speaker_training:
+            spk2id_file = os.path.join(self.exp_dir, self.cfg.preprocess.spk2id)
+            with open(spk2id_file, "r") as f:
+                spk2id = json.load(f)
+                speaker_id = spk2id[self.args.speaker_name]
+                speaker_id = torch.from_numpy(np.array([speaker_id], dtype=np.int32))
+        else:
+            speaker_id = torch.Tensor(0).view(-1)
+        with torch.no_grad():
+            x_tst = phone_id_seq.to(self.device).unsqueeze(0)
+            x_tst_lengths = torch.LongTensor([phone_id_seq.size(0)]).to(self.device)
+            if speaker_id is not None:
+                speaker_id = speaker_id.to(self.device)
+            data = {}
+            data["texts"] = x_tst
+            data["text_len"] = x_tst_lengths
+            data["spk_id"] = speaker_id
+            output = self.model(
+                data,
+                p_control=pitch_control,
+                e_control=energy_control,
+                d_control=duration_control,
+            )
+            pred_res = output["postnet_output"]
+            vocoder_cfg, vocoder_ckpt = self._parse_vocoder(self.args.vocoder_dir)
+            audio = synthesis(
+                cfg=vocoder_cfg,
+                vocoder_weight_file=vocoder_ckpt,
+                n_samples=None,
+                pred=pred_res,
+            )
+        return audio[0]

Amphion/models/tts/naturalspeech2/__init__.py ADDED Viewed

File without changes

Amphion/models/tts/naturalspeech2/wavenet.py ADDED Viewed

	@@ -0,0 +1,206 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+import torch.nn as nn
+import numpy as np
+import torch.nn.functional as F
+import math
+class FiLM(nn.Module):
+    def __init__(self, in_dim, cond_dim):
+        super().__init__()
+        self.gain = Linear(cond_dim, in_dim)
+        self.bias = Linear(cond_dim, in_dim)
+        nn.init.xavier_uniform_(self.gain.weight)
+        nn.init.constant_(self.gain.bias, 1)
+        nn.init.xavier_uniform_(self.bias.weight)
+        nn.init.constant_(self.bias.bias, 0)
+    def forward(self, x, condition):
+        gain = self.gain(condition)
+        bias = self.bias(condition)
+        if gain.dim() == 2:
+            gain = gain.unsqueeze(-1)
+        if bias.dim() == 2:
+            bias = bias.unsqueeze(-1)
+        return x * gain + bias
+class Mish(nn.Module):
+    def forward(self, x):
+        return x * torch.tanh(F.softplus(x))
+def Conv1d(*args, **kwargs):
+    layer = nn.Conv1d(*args, **kwargs)
+    nn.init.kaiming_normal_(layer.weight)
+    return layer
+def Linear(*args, **kwargs):
+    layer = nn.Linear(*args, **kwargs)
+    layer.weight.data.normal_(0.0, 0.02)
+    return layer
+class SinusoidalPosEmb(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+    def forward(self, x):
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
+        emb = x[:, None] * emb[None, :]
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+class ResidualBlock(nn.Module):
+    def __init__(self, hidden_dim, attn_head, dilation, drop_out, has_cattn=False):
+        super().__init__()
+        self.hidden_dim = hidden_dim
+        self.dilation = dilation
+        self.has_cattn = has_cattn
+        self.attn_head = attn_head
+        self.drop_out = drop_out
+        self.dilated_conv = Conv1d(
+            hidden_dim, 2 * hidden_dim, 3, padding=dilation, dilation=dilation
+        )
+        self.diffusion_proj = Linear(hidden_dim, hidden_dim)
+        self.cond_proj = Conv1d(hidden_dim, hidden_dim * 2, 1)
+        self.out_proj = Conv1d(hidden_dim, hidden_dim * 2, 1)
+        if self.has_cattn:
+            self.attn = nn.MultiheadAttention(
+                hidden_dim, attn_head, 0.1, batch_first=True
+            )
+            self.film = FiLM(hidden_dim * 2, hidden_dim)
+            self.ln = nn.LayerNorm(hidden_dim)
+        self.dropout = nn.Dropout(self.drop_out)
+    def forward(self, x, x_mask, cond, diffusion_step, spk_query_emb):
+        diffusion_step = self.diffusion_proj(diffusion_step).unsqueeze(-1)  # (B, d, 1)
+        cond = self.cond_proj(cond)  # (B, 2*d, T)
+        y = x + diffusion_step
+        if x_mask != None:
+            y = y * x_mask.to(y.dtype)[:, None, :]  # (B, 2*d, T)
+        if self.has_cattn:
+            y_ = y.transpose(1, 2)
+            y_ = self.ln(y_)
+            y_, _ = self.attn(y_, spk_query_emb, spk_query_emb)  # (B, T, d)
+        y = self.dilated_conv(y) + cond  # (B, 2*d, T)
+        if self.has_cattn:
+            y = self.film(y.transpose(1, 2), y_)  # (B, T, 2*d)
+            y = y.transpose(1, 2)  # (B, 2*d, T)
+        gate, filter_ = torch.chunk(y, 2, dim=1)
+        y = torch.sigmoid(gate) * torch.tanh(filter_)
+        y = self.out_proj(y)
+        residual, skip = torch.chunk(y, 2, dim=1)
+        if x_mask != None:
+            residual = residual * x_mask.to(y.dtype)[:, None, :]
+            skip = skip * x_mask.to(y.dtype)[:, None, :]
+        return (x + residual) / math.sqrt(2.0), skip
+class WaveNet(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        self.cfg = cfg
+        self.in_dim = cfg.input_size
+        self.hidden_dim = cfg.hidden_size
+        self.out_dim = cfg.out_size
+        self.num_layers = cfg.num_layers
+        self.cross_attn_per_layer = cfg.cross_attn_per_layer
+        self.dilation_cycle = cfg.dilation_cycle
+        self.attn_head = cfg.attn_head
+        self.drop_out = cfg.drop_out
+        self.in_proj = Conv1d(self.in_dim, self.hidden_dim, 1)
+        self.diffusion_embedding = SinusoidalPosEmb(self.hidden_dim)
+        self.mlp = nn.Sequential(
+            Linear(self.hidden_dim, self.hidden_dim * 4),
+            Mish(),
+            Linear(self.hidden_dim * 4, self.hidden_dim),
+        )
+        self.cond_ln = nn.LayerNorm(self.hidden_dim)
+        self.layers = nn.ModuleList(
+            [
+                ResidualBlock(
+                    self.hidden_dim,
+                    self.attn_head,
+                    2 ** (i % self.dilation_cycle),
+                    self.drop_out,
+                    has_cattn=(i % self.cross_attn_per_layer == 0),
+                )
+                for i in range(self.num_layers)
+            ]
+        )
+        self.skip_proj = Conv1d(self.hidden_dim, self.hidden_dim, 1)
+        self.out_proj = Conv1d(self.hidden_dim, self.out_dim, 1)
+        nn.init.zeros_(self.out_proj.weight)
+    def forward(self, x, x_mask, cond, diffusion_step, spk_query_emb):
+        """
+        x: (B, 128, T)
+        x_mask: (B, T), mask is 0
+        cond: (B, T, 512)
+        diffusion_step: (B,)
+        spk_query_emb: (B, 32, 512)
+        """
+        cond = self.cond_ln(cond)
+        cond_input = cond.transpose(1, 2)
+        x_input = self.in_proj(x)
+        x_input = F.relu(x_input)
+        diffusion_step = self.diffusion_embedding(diffusion_step).to(x.dtype)
+        diffusion_step = self.mlp(diffusion_step)
+        skip = []
+        for _, layer in enumerate(self.layers):
+            x_input, skip_connection = layer(
+                x_input, x_mask, cond_input, diffusion_step, spk_query_emb
+            )
+            skip.append(skip_connection)
+        x_input = torch.sum(torch.stack(skip), dim=0) / math.sqrt(self.num_layers)
+        x_out = self.skip_proj(x_input)
+        x_out = F.relu(x_out)
+        x_out = self.out_proj(x_out)  # (B, 128, T)
+        return x_out

Amphion/models/tts/valle/valle.py ADDED Viewed

	@@ -0,0 +1,794 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# This code is modified from https://github.com/lifeiteng/vall-e/blob/main/valle/models/valle.py
+import random
+from typing import Dict, Iterator, List, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torchmetrics.classification import MulticlassAccuracy
+from utils.util import make_pad_mask
+from utils.topk_sampling import topk_sampling
+from modules.general import Transpose
+from modules.encoder import TokenEmbedding
+from modules.general import PromptedFeatures
+from modules.transformer import SinePositionalEmbedding
+from modules.norms import AdaptiveLayerNorm, LayerNorm
+from modules.transformer.transformer import TransformerEncoder, TransformerEncoderLayer
+class VALLE(nn.Module):
+    def __init__(
+        self,
+        cfg,
+        decoder_cls=TransformerEncoder,
+        decoder_layer_cls=TransformerEncoderLayer,
+    ):
+        super().__init__()
+        decoder_dim = cfg.decoder_dim
+        nhead = cfg.nhead
+        nar_scale_factor = cfg.nar_scale_factor
+        num_quantizers = cfg.num_quantizers
+        num_decoder_layers = cfg.num_decoder_layers
+        nar_decoder_dim = int(decoder_dim * nar_scale_factor)
+        self.ar_text_embedding = TokenEmbedding(decoder_dim, cfg.text_token_num)
+        self.nar_text_embedding = TokenEmbedding(nar_decoder_dim, cfg.text_token_num)
+        self.ar_audio_prepend_bos = cfg.prepend_bos
+        self.ar_audio_embedding = TokenEmbedding(
+            decoder_dim, cfg.audio_token_num + 1 + int(cfg.prepend_bos)
+        )
+        self.audio_token_num = cfg.audio_token_num
+        # PreNet of AR
+        if cfg.add_prenet:
+            self.ar_text_prenet = nn.Sequential(
+                Transpose(),
+                nn.Conv1d(decoder_dim, decoder_dim, kernel_size=5, padding="same"),
+                nn.BatchNorm1d(decoder_dim),
+                nn.ReLU(),
+                nn.Dropout(0.5),
+                nn.Conv1d(decoder_dim, decoder_dim, kernel_size=5, padding="same"),
+                nn.BatchNorm1d(decoder_dim),
+                nn.ReLU(),
+                nn.Dropout(0.5),
+                nn.Conv1d(decoder_dim, decoder_dim, kernel_size=5, padding="same"),
+                nn.BatchNorm1d(decoder_dim),
+                nn.ReLU(),
+                nn.Dropout(0.5),
+                Transpose(),
+                nn.Linear(decoder_dim, decoder_dim),
+            )
+            self.ar_audio_prenet = nn.Sequential(
+                nn.Linear(decoder_dim, 256),
+                nn.ReLU(),
+                nn.Dropout(0.25),
+                nn.Linear(256, 256),
+                nn.ReLU(),
+                nn.Dropout(0.25),
+                nn.Linear(256, decoder_dim),
+            )
+        else:
+            self.ar_text_prenet = nn.Identity()
+            self.ar_audio_prenet = nn.Identity()
+        self.ar_text_position = SinePositionalEmbedding(
+            decoder_dim,
+            dropout=0.1,
+            scale=False,
+            alpha=True,
+        )
+        self.ar_audio_position = SinePositionalEmbedding(
+            decoder_dim,
+            dropout=0.1,
+            scale=False,
+            alpha=True,
+        )
+        self.ar_decoder = decoder_cls(
+            decoder_layer_cls(
+                decoder_dim,
+                nhead,
+                dim_feedforward=decoder_dim * 4,  # *4?
+                dropout=0.1,
+                batch_first=True,
+                norm_first=cfg.norm_first,
+            ),
+            num_layers=num_decoder_layers,
+            norm=LayerNorm(decoder_dim) if cfg.norm_first else None,
+        )
+        self.ar_predict_layer = nn.Linear(
+            decoder_dim, cfg.audio_token_num + 1, bias=False
+        )
+        self.ar_accuracy_metric = MulticlassAccuracy(
+            cfg.audio_token_num + 1,
+            top_k=10,
+            average="micro",
+            multidim_average="global",
+            ignore_index=cfg.audio_token_num,
+        )
+        self.rng = random.Random(0)
+        self.num_heads = nhead
+        self.prefix_mode = cfg.prefix_mode
+        self.num_quantizers = num_quantizers
+        assert num_quantizers >= 1
+        if num_quantizers > 1:
+            self.nar_audio_embeddings = nn.ModuleList(
+                [
+                    TokenEmbedding(nar_decoder_dim, cfg.audio_token_num + 1)
+                ]  # Why the first layer is audio_token_num + 1?
+                + [
+                    TokenEmbedding(nar_decoder_dim, cfg.audio_token_num)
+                    for i in range(num_quantizers - 1)
+                ]
+            )
+            if cfg.add_prenet:
+                self.nar_text_prenet = nn.Sequential(
+                    Transpose(),
+                    nn.Conv1d(
+                        nar_decoder_dim, nar_decoder_dim, kernel_size=5, padding="same"
+                    ),
+                    nn.BatchNorm1d(nar_decoder_dim),
+                    nn.ReLU(),
+                    nn.Dropout(0.5),
+                    nn.Conv1d(
+                        nar_decoder_dim, nar_decoder_dim, kernel_size=5, padding="same"
+                    ),
+                    nn.BatchNorm1d(nar_decoder_dim),
+                    nn.ReLU(),
+                    nn.Dropout(0.5),
+                    nn.Conv1d(
+                        nar_decoder_dim, nar_decoder_dim, kernel_size=5, padding="same"
+                    ),
+                    nn.BatchNorm1d(nar_decoder_dim),
+                    nn.ReLU(),
+                    nn.Dropout(0.5),
+                    Transpose(),
+                    nn.Linear(nar_decoder_dim, nar_decoder_dim),
+                )
+                self.nar_audio_prenet = nn.Sequential(
+                    nn.Linear(nar_decoder_dim, 256),
+                    nn.ReLU(),
+                    nn.Dropout(0.25),
+                    nn.Linear(256, 256),
+                    nn.ReLU(),
+                    nn.Dropout(0.25),
+                    nn.Linear(256, nar_decoder_dim),
+                )
+            else:
+                self.nar_text_prenet = nn.Identity()
+                self.nar_audio_prenet = nn.Identity()
+            self.nar_text_position = SinePositionalEmbedding(
+                nar_decoder_dim,
+                dropout=0.0,
+                scale=False,
+                alpha=False,
+            )
+            self.nar_audio_position = SinePositionalEmbedding(
+                nar_decoder_dim,
+                dropout=0.1,
+                scale=False,
+                alpha=False,
+            )
+            self.nar_decoder = decoder_cls(
+                decoder_layer_cls(
+                    nar_decoder_dim,
+                    int(nhead * nar_scale_factor),
+                    dim_feedforward=nar_decoder_dim * 4,
+                    dropout=0.1,
+                    batch_first=True,
+                    norm_first=cfg.norm_first,
+                    adaptive_layer_norm=True,
+                ),
+                num_layers=int(num_decoder_layers * nar_scale_factor),
+                norm=(
+                    AdaptiveLayerNorm(
+                        nar_decoder_dim, norm=nn.LayerNorm(nar_decoder_dim)
+                    )
+                    if cfg.norm_first
+                    else None
+                ),
+            )
+            self.nar_predict_layers = nn.ModuleList(
+                [
+                    nn.Linear(nar_decoder_dim, cfg.audio_token_num, bias=False)
+                    for i in range(num_quantizers - 1)
+                ]
+            )
+            self.nar_stage_embeddings = nn.ModuleList(
+                [TokenEmbedding(nar_decoder_dim, 1) for i in range(num_quantizers - 1)]
+            )
+            if cfg.share_embedding:
+                for j in range(0, num_quantizers - 2):
+                    self.nar_predict_layers[j].weight = self.nar_audio_embeddings[
+                        j + 2
+                    ].weight
+            self.nar_accuracy_metric = MulticlassAccuracy(
+                cfg.audio_token_num + 1,
+                top_k=10,
+                average="micro",
+                multidim_average="global",
+                ignore_index=cfg.audio_token_num,
+            )
+    def forward(
+        self,
+        x: torch.Tensor,
+        x_lens: torch.Tensor,
+        y: Union[torch.Tensor, PromptedFeatures],
+        y_lens: Union[torch.Tensor, PromptedFeatures],
+        reduction: str = "sum",
+        train_stage: int = 0,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Union[torch.Tensor, None]]:
+        """
+        Args:
+          x:
+            A 2-D tensor of shape (N, S).
+          x_lens:
+            A 1-D tensor of shape (N,). It contains the number of tokens in `x`
+            before padding.
+          y:
+            A 3-D tensor of shape (N, T, 8).
+          y_lens:
+            A 1-D tensor of shape (N,). It contains the number of tokens in `x`
+            before padding.
+          train_stage:
+            0: AR & NAR modules, 1: AR modules, 2: NAR modules
+        Returns:
+          Return the predicted audio code matrix, cross-entropy loss and Top-10 accuracy.
+        """
+        assert x.ndim == 2, x.shape
+        assert x_lens.ndim == 1, x_lens.shape
+        y_prompts_codes = None
+        if isinstance(y, PromptedFeatures):
+            y_prompts_codes, y = y.data
+            prompts_len, y_lens = y_lens.data
+            assert prompts_len.min() == prompts_len.max()
+            assert self.prefix_mode == 4
+            y_prompts_codes = y_prompts_codes.type(torch.int64)
+        assert y.ndim == 3, y.shape
+        assert y_lens.ndim == 1, y_lens.shape
+        x_mask = make_pad_mask(x_lens).to(x.device)
+        y_mask = make_pad_mask(y_lens).to(y.device)
+        y_mask_int = y_mask.type(torch.int64)
+        text = x
+        codes = y.type(torch.int64) * (1 - y_mask_int.unsqueeze(dim=-1))
+        y, targets = self.pad_y_eos(
+            codes[..., 0], y_mask_int, eos_id=self.audio_token_num
+        )
+        self.y_mask_int = y_mask_int
+        metrics = {}
+        total_loss = 0.0
+        xy_padding_mask = torch.concat([x_mask, y_mask], dim=1)
+        if self.ar_audio_prepend_bos:
+            ar_xy_padding_mask = torch.concat(
+                [x_mask, F.pad(y_mask, (1, 0), value=False)], dim=1
+            )
+        else:
+            ar_xy_padding_mask = xy_padding_mask
+        self.xy_padding_mask = xy_padding_mask
+        self.ar_xy_padding_mask = ar_xy_padding_mask
+        # AR Decoder
+        if train_stage in [0, 1]:
+            ar_loss, ar_metrics = self._forward_ar_decoder(
+                text, x_lens.max(), y, y_lens.max(), targets, x_mask, y_mask, reduction
+            )
+            total_loss += ar_loss
+            metrics["AR_Top100Acc"] = ar_metrics
+        # NAR Decoder
+        if self.ar_audio_prepend_bos:
+            y = y[:, 1:]
+        if self.num_quantizers > 1 and train_stage in [0, 2]:
+            nar_loss, nar_metrics = self._forward_nar_decoder(
+                text,
+                x_lens,
+                y,
+                y_lens,
+                codes,
+                y_prompts_codes,
+                x_mask,
+                y_mask,
+                reduction,
+            )
+            total_loss += nar_loss
+            metrics["NAR_Top100Acc"] = nar_metrics
+        if train_stage == 0:
+            total_loss = total_loss / 2.0
+        return total_loss, metrics
+    def _forward_ar_decoder(
+        self, x, x_len, y, y_lens, targets, x_mask, y_mask, reduction
+    ):
+        x = self.ar_text_embedding(x)
+        x = self.ar_text_prenet(x)
+        x = self.ar_text_position(x)
+        y_len = y_lens.max() + int(self.ar_audio_prepend_bos)
+        x_attn_mask = F.pad(
+            torch.zeros((x_len, x_len), dtype=torch.bool, device=x.device),
+            (0, y_len),
+            value=True,
+        )
+        y_attn_mask = F.pad(
+            torch.triu(
+                torch.ones(y_len, y_len, dtype=torch.bool, device=x.device),
+                diagonal=1,
+            ),
+            (x_len, 0),
+            value=False,
+        )
+        xy_attn_mask = torch.concat([x_attn_mask, y_attn_mask], dim=0)
+        bsz, src_len = x.shape[0], x_len + y_len
+        _xy_padding_mask = (
+            self.ar_xy_padding_mask.view(bsz, 1, 1, src_len)
+            .expand(-1, self.num_heads, -1, -1)
+            .reshape(bsz * self.num_heads, 1, src_len)
+        )
+        xy_attn_mask = xy_attn_mask.logical_or(_xy_padding_mask)
+        new_attn_mask = torch.zeros_like(xy_attn_mask, dtype=x.dtype)
+        new_attn_mask.masked_fill_(xy_attn_mask, float("-inf"))
+        xy_attn_mask = new_attn_mask
+        y_emb = self.ar_audio_embedding(y)
+        y_emb = self.ar_audio_prenet(y_emb)
+        y_pos = self.ar_audio_position(y_emb)
+        xy_pos = torch.concat([x, y_pos], dim=1)
+        xy_dec, _ = self.ar_decoder(
+            (xy_pos, None),
+            mask=xy_attn_mask,
+        )
+        logits = self.ar_predict_layer(xy_dec[:, x_len:]).permute(0, 2, 1)
+        ar_loss = F.cross_entropy(logits, targets, reduction=reduction)
+        ar_metrics = self.ar_accuracy_metric(
+            logits.detach(), targets
+        ).item() * y_lens.sum().type(torch.float32)
+        return ar_loss, ar_metrics
+    def _forward_nar_decoder(
+        self, x, x_lens, y, y_lens, codes, y_prompts_codes, x_mask, y_mask, reduction
+    ):
+        num_nar_layers = self.num_quantizers - 1
+        nar_stage = self.rng.choices(
+            [_k for _k in range(1, self.num_quantizers)],
+            weights=[1.0 / num_nar_layers] * num_nar_layers,
+            k=1,
+        )[0]
+        x = self.nar_text_embedding(x)
+        x = self.nar_text_prenet(x)
+        x = self.nar_text_position(x)
+        y_emb, prefix_len = self._prepare_prompts(
+            y, y_lens, codes, nar_stage, y_prompts_codes
+        )
+        y_len = y_lens.max()
+        targets = codes[..., nar_stage] + self.audio_token_num * self.y_mask_int
+        if self.prefix_mode in [2, 4]:
+            xy_padding_mask = torch.concat(
+                [
+                    x_mask,
+                    F.pad(y_mask, (y_emb.shape[1] - y_len, 0), value=False),
+                ],
+                dim=1,
+            )
+        elif self.prefix_mode == 1:
+            targets = targets[:, prefix_len:]
+        y_pos = self.nar_audio_prenet(y_emb)
+        y_pos = self.nar_audio_position(y_pos)
+        xy_pos = torch.concat([x, y_pos], dim=1)
+        xy_dec, _ = self.nar_decoder(
+            (xy_pos, self.nar_stage_embeddings[nar_stage - 1].weight),
+            src_key_padding_mask=self.xy_padding_mask,
+        )
+        xy_dec = xy_dec[:, x_lens.max() + prefix_len :]
+        if self.prefix_mode == 4:
+            prefix_len = 0
+        logits = self.nar_predict_layers[nar_stage - 1](xy_dec).permute(0, 2, 1)
+        total_length = (y_lens).sum().type(torch.float32)
+        nar_loss = F.cross_entropy(
+            logits,
+            targets,
+            ignore_index=self.audio_token_num,
+            reduction=reduction,
+        ) * (total_length / (total_length - prefix_len * x.shape[0]))
+        nar_metrics = (
+            self.nar_accuracy_metric(
+                F.pad(
+                    logits.detach(),
+                    (0, 0, 0, 1, 0, 0),
+                    value=logits.min().cpu().item(),
+                ),
+                targets,
+            ).item()
+            * total_length
+        )
+        return nar_loss, nar_metrics
+    def inference(
+        self,
+        x: torch.Tensor,
+        x_lens: torch.Tensor,
+        y: torch.Tensor,
+        enroll_x_lens: torch.Tensor,
+        top_k: int = -100,
+        temperature: float = 1.0,
+    ) -> torch.Tensor:
+        """
+        Args:
+          x:
+            A 2-D tensor of shape (1, S).
+          x_lens:
+            A 1-D tensor of shape (1,). It contains the number of tokens in `x`
+            before padding.
+          y:
+            A 3-D tensor of shape (1, T, 8).
+          top_k: (`optional`) int
+            The number of highest probability tokens to keep for top-k-filtering. Default to -100.
+          temperature: (`optional`) float
+            The value used to module the next token probabilities. Must be strictly positive. Default to 1.0.
+        Returns:
+          Return the predicted audio code matrix.
+        """
+        assert x.ndim == 2, x.shape
+        assert x_lens.ndim == 1, x_lens.shape
+        assert y.ndim == 3, y.shape
+        assert y.shape[0] == 1, y.shape
+        assert torch.all(x_lens > 0)
+        text = x
+        x = self.ar_text_embedding(text)
+        x = self.ar_text_prenet(x)
+        x = self.ar_text_position(x)
+        text_len = x_lens.max()
+        prompts = y
+        prefix_len = y.shape[1]
+        # AR Decoder
+        y = prompts[..., 0]
+        if self.ar_audio_prepend_bos:
+            y = F.pad(y, (1, 0), value=self.audio_token_num + 1)
+        x_len = x_lens.max()
+        x_attn_mask = torch.zeros((x_len, x_len), dtype=torch.bool)
+        while True:
+            y_emb = self.ar_audio_embedding(y)
+            y_emb = self.ar_audio_prenet(y_emb)
+            y_pos = self.ar_audio_position(y_emb)
+            xy_pos = torch.concat([x, y_pos], dim=1)
+            y_len = y.shape[1]
+            x_attn_mask_pad = F.pad(
+                x_attn_mask,
+                (0, y_len),
+                value=True,
+            )
+            y_attn_mask = F.pad(
+                torch.triu(torch.ones(y_len, y_len, dtype=torch.bool), diagonal=1),
+                (x_len, 0),
+                value=False,
+            )
+            xy_attn_mask = torch.concat([x_attn_mask_pad, y_attn_mask], dim=0).to(
+                y.device
+            )
+            xy_dec, _ = self.ar_decoder(
+                (xy_pos, None),
+                mask=xy_attn_mask,
+            )
+            logits = self.ar_predict_layer(xy_dec[:, -1])
+            samples = topk_sampling(
+                logits, top_k=top_k, top_p=1.0, temperature=temperature
+            )
+            if (
+                torch.argmax(logits, dim=-1)[0] == self.audio_token_num
+                or samples[0, 0] == self.audio_token_num
+                or (y.shape[1] - prompts.shape[1]) > x_lens.max() * 16
+            ):
+                if prompts.shape[1] == y.shape[1]:
+                    raise SyntaxError("well trained model shouldn't reach here.")
+                break
+            y = torch.concat([y, samples], dim=1)
+        codes = [y[:, prefix_len + int(self.ar_audio_prepend_bos) :]]
+        if self.num_quantizers == 1:
+            return torch.stack(codes, dim=-1)
+        # Non-AR Decoders
+        y_emb = self.nar_audio_embeddings[0](y[:, int(self.ar_audio_prepend_bos) :])
+        if self.prefix_mode in [2, 4]:
+            enrolled_len = enroll_x_lens.max().item()
+            # SOS + Synthesis Text + EOS
+            text = torch.concat(
+                [
+                    text[:, :1],
+                    text[:, enrolled_len - 1 :],
+                ],
+                dim=1,
+            )
+            text_len = text_len - (enrolled_len - 2)
+            assert text.shape[0] == 1
+        x = self.nar_text_embedding(text)
+        x = self.nar_text_prenet(x)
+        x = self.nar_text_position(x)
+        if self.prefix_mode == 0:
+            for i, (predict_layer, embedding_layer) in enumerate(
+                zip(
+                    self.nar_predict_layers,
+                    self.nar_audio_embeddings[1:],
+                )
+            ):
+                y_pos = self.nar_audio_prenet(y_emb)
+                y_pos = self.nar_audio_position(y_pos)
+                xy_pos = torch.concat([x, y_pos], dim=1)
+                xy_dec, _ = self.nar_decoder(
+                    (xy_pos, self.nar_stage_embeddings[i].weight)
+                )
+                logits = predict_layer(xy_dec[:, text_len + prefix_len :])
+                samples = torch.argmax(logits, dim=-1)
+                codes.append(samples)
+                if i < self.num_quantizers - 2:
+                    y_emb[:, :prefix_len] += embedding_layer(prompts[..., i + 1])
+                    y_emb[:, prefix_len:] += embedding_layer(samples)
+        else:
+            for j in range(1, self.num_quantizers):
+                y_emb[:, :prefix_len] += self.nar_audio_embeddings[j](prompts[..., j])
+            for i, (predict_layer, embedding_layer) in enumerate(
+                zip(
+                    self.nar_predict_layers,
+                    self.nar_audio_embeddings[1:],
+                )
+            ):
+                y_pos = self.nar_audio_prenet(y_emb)
+                y_pos = self.nar_audio_position(y_pos)
+                xy_pos = torch.concat([x, y_pos], dim=1)
+                xy_dec, _ = self.nar_decoder(
+                    (xy_pos, self.nar_stage_embeddings[i].weight)
+                )
+                logits = predict_layer(xy_dec[:, text_len + prefix_len :])
+                samples = torch.argmax(logits, dim=-1)
+                codes.append(samples)
+                if i < self.num_quantizers - 2:
+                    y_emb[:, prefix_len:] += embedding_layer(samples)
+        assert len(codes) == self.num_quantizers
+        return torch.stack(codes, dim=-1)
+    def continual(
+        self,
+        x: torch.Tensor,
+        x_lens: torch.Tensor,
+        y: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Args:
+          x:
+            A 2-D tensor of shape (1, S).
+          x_lens:
+            A 1-D tensor of shape (1,). It contains the number of tokens in `x`
+            before padding.
+          y:
+            A 3-D tensor of shape (1, T, 8).
+        Returns:
+          Return the predicted audio code matrix.
+        """
+        assert x.ndim == 2, x.shape
+        assert x_lens.ndim == 1, x_lens.shape
+        assert y.ndim == 3, y.shape
+        assert y.shape[0] == 1, y.shape
+        assert torch.all(x_lens > 0)
+        assert self.num_quantizers == 8
+        text = x
+        x = self.ar_text_embedding(text)
+        x = self.ar_text_prenet(x)
+        x = self.ar_text_position(x)
+        text_len = x_lens.max()
+        prefix_len = min(int(y.shape[1] * 0.5), 3 * 75)
+        # AR Decoder
+        prompts = y[:, :prefix_len]
+        codes = [y[:, prefix_len:, 0]]
+        # Non-AR Decoders
+        x = self.nar_text_embedding(text)
+        x = self.nar_text_prenet(x)
+        x = self.nar_text_position(x)
+        y_emb = self.nar_audio_embeddings[0](y[..., 0])
+        if self.prefix_mode == 0:
+            for i, (predict_layer, embedding_layer) in enumerate(
+                zip(
+                    self.nar_predict_layers,
+                    self.nar_audio_embeddings[1:],
+                )
+            ):
+                y_pos = self.nar_audio_position(y_emb)
+                y_pos = self.nar_audio_prenet(y_pos)
+                xy_pos = torch.concat([x, y_pos], dim=1)
+                xy_dec, _ = self.nar_decoder(
+                    (xy_pos, self.nar_stage_embeddings[i].weight)
+                )
+                logits = predict_layer(xy_dec[:, text_len + prefix_len :])
+                samples = torch.argmax(logits, dim=-1)
+                codes.append(samples)
+                if i < 6:
+                    y_emb[:, :prefix_len] += embedding_layer(prompts[..., i + 1])
+                    y_emb[:, prefix_len:] += embedding_layer(samples)
+        else:
+            for j in range(1, 8):
+                y_emb[:, :prefix_len] += self.nar_audio_embeddings[j](prompts[..., j])
+            for i, (predict_layer, embedding_layer) in enumerate(
+                zip(
+                    self.nar_predict_layers,
+                    self.nar_audio_embeddings[1:],
+                )
+            ):
+                y_pos = self.nar_audio_prenet(y_emb)
+                y_pos = self.nar_audio_position(y_pos)
+                xy_pos = torch.concat([x, y_pos], dim=1)
+                xy_dec, _ = self.nar_decoder(
+                    (xy_pos, self.nar_stage_embeddings[i].weight)
+                )
+                logits = predict_layer(xy_dec[:, text_len + prefix_len :])
+                samples = torch.argmax(logits, dim=-1)
+                codes.append(samples)
+                if i < 6:
+                    y_emb[:, prefix_len:] += embedding_layer(samples)
+        assert len(codes) == 8
+        return torch.stack(codes, dim=-1)
+    def stage_parameters(self, stage: int = 1) -> Iterator[nn.Parameter]:
+        assert stage > 0
+        if stage == 1:
+            for name, param in self.named_parameters():
+                if name.startswith("ar_"):
+                    yield param
+        if stage == 2:
+            for name, param in self.named_parameters():
+                if name.startswith("nar_"):
+                    yield param
+    def stage_named_parameters(
+        self, stage: int = 1
+    ) -> Iterator[Tuple[str, nn.Parameter]]:
+        assert stage > 0
+        if stage == 1:
+            for pair in self.named_parameters():
+                if pair[0].startswith("ar_"):
+                    yield pair
+        if stage == 2:
+            for pair in self.named_parameters():
+                if pair[0].startswith("nar_"):
+                    yield pair
+    def pad_y_eos(self, y, y_mask_int, eos_id):
+        targets = F.pad(y, (0, 1), value=0) + eos_id * F.pad(
+            y_mask_int, (0, 1), value=1
+        )
+        if self.ar_audio_prepend_bos:
+            return (
+                F.pad(targets[:, :-1], (1, 0), value=self.audio_token_num + 1),
+                targets,
+            )
+        return targets[:, :-1], targets[:, 1:]
+    def _prepare_prompts(self, y, y_lens, codes, nar_stage, y_prompts_codes):
+        # 5.1 For the NAR acoustic prompt tokens, we select a random segment waveform of 3 seconds
+        # from the same utterance.
+        # We implement this differently.
+        if self.prefix_mode == 0:
+            # no prefix
+            prefix_len = 0
+            y_emb = self.nar_audio_embeddings[0](y)
+            for j in range(1, nar_stage):
+                # Formula (4) (5)
+                y_emb = y_emb + self.nar_audio_embeddings[j](codes[..., j])
+        elif self.prefix_mode == 1:
+            # prefix at begining
+            int_low = (0.25 * y_lens.min()).type(torch.int64).item()
+            prefix_len = torch.randint(int_low, int_low * 2, size=()).item()
+            prefix_len = min(prefix_len, 225)  # 24000/320 * 3s = 225 frames
+            y_prompts = self.nar_audio_embeddings[0](y[:, :prefix_len])
+            y_emb = self.nar_audio_embeddings[0](y[:, prefix_len:])
+            for j in range(1, self.num_quantizers):
+                y_prompts += self.nar_audio_embeddings[j](codes[:, :prefix_len, j])
+                if j < nar_stage:
+                    y_emb += self.nar_audio_embeddings[j](codes[:, prefix_len:, j])
+            y_emb = torch.concat([y_prompts, y_emb], axis=1)
+        elif self.prefix_mode in [2, 4]:
+            if self.prefix_mode == 2:
+                # random prefix
+                prefix_len = min(225, int(0.25 * y_lens.min().item()))
+                y_prompts_codes = []
+                for b in range(codes.shape[0]):
+                    start = self.rng.randint(0, y_lens[b].item() - prefix_len)
+                    y_prompts_codes.append(
+                        torch.clone(codes[b, start : start + prefix_len])
+                    )
+                    codes[b, start : start + prefix_len, nar_stage] = NUM_AUDIO_TOKENS
+                y_prompts_codes = torch.stack(y_prompts_codes, dim=0)
+            else:
+                prefix_len = y_prompts_codes.shape[1]
+            y_prompts = self.nar_audio_embeddings[0](y_prompts_codes[..., 0])
+            y_emb = self.nar_audio_embeddings[0](y)
+            for j in range(1, self.num_quantizers):
+                y_prompts += self.nar_audio_embeddings[j](y_prompts_codes[..., j])
+                if j < nar_stage:
+                    y_emb += self.nar_audio_embeddings[j](codes[..., j])
+            y_emb = torch.concat([y_prompts, y_emb], axis=1)
+        else:
+            raise ValueError
+        return y_emb, prefix_len

Amphion/models/tts/valle/valle_inference.py ADDED Viewed

	@@ -0,0 +1,237 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import numpy as np
+import torch
+import torchaudio
+import argparse
+from text.g2p_module import G2PModule
+from utils.tokenizer import AudioTokenizer, tokenize_audio
+from models.tts.valle.valle import VALLE
+from models.tts.base.tts_inferece import TTSInference
+from models.tts.valle.valle_dataset import VALLETestDataset, VALLETestCollator
+from processors.phone_extractor import phoneExtractor
+from text.text_token_collation import phoneIDCollation
+class VALLEInference(TTSInference):
+    def __init__(self, args=None, cfg=None):
+        TTSInference.__init__(self, args, cfg)
+        self.g2p_module = G2PModule(backend=self.cfg.preprocess.phone_extractor)
+        text_token_path = os.path.join(
+            cfg.preprocess.processed_dir, cfg.dataset[0], cfg.preprocess.symbols_dict
+        )
+        self.audio_tokenizer = AudioTokenizer()
+    def _build_model(self):
+        model = VALLE(self.cfg.model)
+        return model
+    def _build_test_dataset(self):
+        return VALLETestDataset, VALLETestCollator
+    def inference_one_clip(self, text, text_prompt, audio_file, save_name="pred"):
+        # get phone symbol file
+        phone_symbol_file = None
+        if self.cfg.preprocess.phone_extractor != "lexicon":
+            phone_symbol_file = os.path.join(
+                self.exp_dir, self.cfg.preprocess.symbols_dict
+            )
+            assert os.path.exists(phone_symbol_file)
+        # convert text to phone sequence
+        phone_extractor = phoneExtractor(self.cfg)
+        # convert phone sequence to phone id sequence
+        phon_id_collator = phoneIDCollation(
+            self.cfg, symbols_dict_file=phone_symbol_file
+        )
+        text = f"{text_prompt} {text}".strip()
+        phone_seq = phone_extractor.extract_phone(text)  # phone_seq: list
+        phone_id_seq = phon_id_collator.get_phone_id_sequence(self.cfg, phone_seq)
+        phone_id_seq_len = torch.IntTensor([len(phone_id_seq)]).to(self.device)
+        # convert phone sequence to phone id sequence
+        phone_id_seq = np.array([phone_id_seq])
+        phone_id_seq = torch.from_numpy(phone_id_seq).to(self.device)
+        # extract acoustic token
+        encoded_frames = tokenize_audio(self.audio_tokenizer, audio_file)
+        audio_prompt_token = encoded_frames[0][0].transpose(2, 1).to(self.device)
+        # copysyn
+        if self.args.copysyn:
+            samples = self.audio_tokenizer.decode(encoded_frames)
+            audio_copysyn = samples[0].cpu().detach()
+            out_path = os.path.join(
+                self.args.output_dir, self.infer_type, f"{save_name}_copysyn.wav"
+            )
+            torchaudio.save(out_path, audio_copysyn, self.cfg.preprocess.sampling_rate)
+        if self.args.continual:
+            encoded_frames = self.model.continual(
+                phone_id_seq,
+                phone_id_seq_len,
+                audio_prompt_token,
+            )
+        else:
+            enroll_x_lens = None
+            if text_prompt:
+                # prompt_phone_seq = tokenize_text(self.g2p_module, text=f"{text_prompt}".strip())
+                # _, enroll_x_lens = self.text_tokenizer.get_token_id_seq(prompt_phone_seq)
+                text = f"{text_prompt}".strip()
+                prompt_phone_seq = phone_extractor.extract_phone(
+                    text
+                )  # phone_seq: list
+                prompt_phone_id_seq = phon_id_collator.get_phone_id_sequence(
+                    self.cfg, prompt_phone_seq
+                )
+                prompt_phone_id_seq_len = torch.IntTensor(
+                    [len(prompt_phone_id_seq)]
+                ).to(self.device)
+            encoded_frames = self.model.inference(
+                phone_id_seq,
+                phone_id_seq_len,
+                audio_prompt_token,
+                enroll_x_lens=prompt_phone_id_seq_len,
+                top_k=self.args.top_k,
+                temperature=self.args.temperature,
+            )
+        samples = self.audio_tokenizer.decode([(encoded_frames.transpose(2, 1), None)])
+        audio = samples[0].squeeze(0).cpu().detach()
+        return audio
+    def inference_for_single_utterance(self):
+        text = self.args.text
+        text_prompt = self.args.text_prompt
+        audio_file = self.args.audio_prompt
+        if not self.args.continual:
+            assert text != ""
+        else:
+            text = ""
+        assert text_prompt != ""
+        assert audio_file != ""
+        audio = self.inference_one_clip(text, text_prompt, audio_file)
+        return audio
+    def inference_for_batches(self):
+        test_list_file = self.args.test_list_file
+        assert test_list_file is not None
+        pred_res = []
+        with open(test_list_file, "r") as fin:
+            for idx, line in enumerate(fin.readlines()):
+                fields = line.strip().split("|")
+                if self.args.continual:
+                    assert len(fields) == 2
+                    text_prompt, audio_prompt_path = fields
+                    text = ""
+                else:
+                    assert len(fields) == 3
+                    text_prompt, audio_prompt_path, text = fields
+                audio = self.inference_one_clip(
+                    text, text_prompt, audio_prompt_path, str(idx)
+                )
+                pred_res.append(audio)
+        return pred_res
+        """
+        TODO: batch inference
+        ###### Construct test_batch ######
+        n_batch = len(self.test_dataloader)
+        now = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))
+        print(
+            "Model eval time: {}, batch_size = {}, n_batch = {}".format(
+                now, self.test_batch_size, n_batch
+            )
+        )
+        ###### Inference for each batch ######
+        pred_res = []
+        with torch.no_grad():
+            for i, batch_data in enumerate(
+                self.test_dataloader if n_batch == 1 else tqdm(self.test_dataloader)
+            ):
+                if self.args.continual:
+                    encoded_frames = self.model.continual(
+                        batch_data["phone_seq"],
+                        batch_data["phone_len"],
+                        batch_data["acoustic_token"],
+                    )
+                else:
+                    encoded_frames = self.model.inference(
+                        batch_data["phone_seq"],
+                        batch_data["phone_len"],
+                        batch_data["acoustic_token"],
+                        enroll_x_lens=batch_data["pmt_phone_len"],
+                        top_k=self.args.top_k,
+                        temperature=self.args.temperature
+                    )
+                samples = self.audio_tokenizer.decode(
+                    [(encoded_frames.transpose(2, 1), None)]
+                )
+                for idx in range(samples.size(0)):
+                    audio = samples[idx].cpu()
+                    pred_res.append(audio)
+        return pred_res
+        """
+    def add_arguments(parser: argparse.ArgumentParser):
+        parser.add_argument(
+            "--text_prompt",
+            type=str,
+            default="",
+            help="Text prompt that should be aligned with --audio_prompt.",
+        )
+        parser.add_argument(
+            "--audio_prompt",
+            type=str,
+            default="",
+            help="Audio prompt that should be aligned with --text_prompt.",
+        )
+        parser.add_argument(
+            "--top-k",
+            type=int,
+            default=-100,
+            help="Whether AR Decoder do top_k(if > 0) sampling.",
+        )
+        parser.add_argument(
+            "--temperature",
+            type=float,
+            default=1.0,
+            help="The temperature of AR Decoder top_k sampling.",
+        )
+        parser.add_argument(
+            "--continual",
+            action="store_true",
+            help="Inference for continual task.",
+        )
+        parser.add_argument(
+            "--copysyn",
+            action="store_true",
+            help="Copysyn: generate audio by decoder of the original audio tokenizer.",
+        )

Amphion/models/tts/valle/valle_trainer.py ADDED Viewed

	@@ -0,0 +1,367 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+from tqdm import tqdm
+import torch
+import numpy as np
+from torch.utils.data import DataLoader
+from torch.nn.parallel import DistributedDataParallel
+from optimizer.optimizers import Eve, ScaledAdam
+from schedulers.scheduler import NoamScheduler, Eden
+from models.tts.valle.valle_dataset import (
+    VALLEDataset,
+    VALLECollator,
+    batch_by_size,
+)
+from models.base.base_sampler import VariableSampler
+from models.tts.base import TTSTrainer
+from models.tts.valle.valle import VALLE
+import diffusers
+class VALLETrainer(TTSTrainer):
+    def __init__(self, args, cfg):
+        TTSTrainer.__init__(self, args, cfg)
+    def _build_model(self):
+        model = VALLE(self.cfg.model)
+        return model
+    def _build_dataset(self):
+        return VALLEDataset, VALLECollator
+    def _build_optimizer(self):
+        if self.args.train_stage:
+            if isinstance(self.model, DistributedDataParallel):
+                model = self.model.module
+            else:
+                model = self.model
+            model_parameters = model.stage_parameters(self.args.train_stage)
+        else:
+            model_parameters = self.model.parameters()
+        if self.cfg.train.optimizer == "ScaledAdam":
+            parameters_names = []
+            if self.args.train_stage != 0:
+                parameters_names.append(
+                    [
+                        name_param_pair[0]
+                        for name_param_pair in model.stage_named_parameters(
+                            self.args.train_stage
+                        )
+                    ]
+                )
+            else:
+                parameters_names.append(
+                    [name_param_pair[0] for name_param_pair in model.named_parameters()]
+                )
+            optimizer = ScaledAdam(
+                model_parameters,
+                lr=self.cfg.train.base_lr,
+                betas=(0.9, 0.95),
+                clipping_scale=2.0,
+                parameters_names=parameters_names,
+                show_dominant_parameters=False,
+                clipping_update_period=1000,
+            )
+        elif self.cfg.train.optimizer == "Eve":
+            optimizer = Eve(
+                model_parameters,
+                lr=self.cfg.train.base_lr,
+                betas=(0.9, 0.98),
+                target_rms=0.1,
+            )
+        elif self.cfg.train.optimizer == "AdamW":
+            optimizer = torch.optim.AdamW(
+                model_parameters,
+                lr=self.cfg.train.base_lr,
+                betas=(0.9, 0.95),
+                weight_decay=1e-2,
+                eps=1e-8,
+            )
+        elif self.cfg.train.optimizer == "Adam":
+            optimizer = torch.optim.Adam(
+                model_parameters,
+                lr=self.cfg.train.base_lr,
+                betas=(0.9, 0.95),
+                eps=1e-8,
+            )
+        else:
+            raise NotImplementedError()
+        return optimizer
+    def _build_scheduler(self):
+        if self.cfg.train.scheduler.lower() == "eden":
+            scheduler = Eden(
+                self.optimizer, 5000, 4, warmup_batches=self.cfg.train.warmup_steps
+            )
+        elif self.cfg.train.scheduler.lower() == "noam":
+            scheduler = NoamScheduler(
+                self.cfg.train.base_lr,
+                self.optimizer,
+                self.cfg.model.decoder_dim,
+                warmup_steps=self.cfg.train.warmup_steps,
+            )
+        elif self.cfg.train.scheduler.lower() == "cosine":
+            from diffusers.optimization import get_cosine_schedule_with_warmup
+            scheduler = get_cosine_schedule_with_warmup(
+                self.optimizer,
+                num_warmup_steps=self.cfg.train.warmup_steps
+                * self.accelerator.num_processes,
+                num_training_steps=self.cfg.train.total_training_steps
+                * self.accelerator.num_processes,
+            )
+        else:
+            raise NotImplementedError(f"{self.cfg.train.scheduler}")
+        return scheduler
+    def _train_epoch(self):
+        r"""Training epoch. Should return average loss of a batch (sample) over
+        one epoch. See ``train_loop`` for usage.
+        """
+        if isinstance(self.model, dict):
+            for key in self.model.keys():
+                self.model[key].train()
+        else:
+            self.model.train()
+        epoch_sum_loss: float = 0.0
+        epoch_losses: dict = {}
+        epoch_step: int = 0
+        for batch in tqdm(
+            self.train_dataloader,
+            desc=f"Training Epoch {self.epoch}",
+            unit="batch",
+            colour="GREEN",
+            leave=False,
+            dynamic_ncols=True,
+            smoothing=0.04,
+            disable=not self.accelerator.is_main_process,
+        ):
+            # Do training step and BP
+            with self.accelerator.accumulate(self.model):
+                total_loss, train_losses = self._train_step(batch)
+                self.accelerator.backward(total_loss)
+                self.optimizer.step()
+                self.optimizer.zero_grad()
+            self.batch_count += 1
+            if self.batch_count % self.cfg.train.gradient_accumulation_step == 0:
+                if self.cfg.train.optimizer not in ["ScaledAdam", "Eve"]:
+                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
+                for k in range(self.cfg.train.gradient_accumulation_step):
+                    if isinstance(self.scheduler, Eden):
+                        self.scheduler.step_batch(self.step)
+                    else:
+                        self.scheduler.step()
+                epoch_sum_loss += total_loss.detach().cpu().item()
+                if isinstance(train_losses, dict):
+                    for key, value in train_losses.items():
+                        if key not in epoch_losses.keys():
+                            epoch_losses[key] = value
+                        else:
+                            epoch_losses[key] += value
+                if isinstance(train_losses, dict):
+                    for key, loss in train_losses.items():
+                        self.accelerator.log(
+                            {"Step/Train {}".format(key): "{:.6f}".format(loss)},
+                            step=self.step,
+                        )
+                else:
+                    self.accelerator.log(
+                        {"Step/Train Loss": loss},
+                        step=self.step,
+                    )
+                self.accelerator.log(
+                    {"Step/lr": self.scheduler.get_last_lr()[0]},
+                    step=self.step,
+                )
+                # print loss every log_epoch_step steps
+                # if epoch_step % self.cfg.train.log_epoch_step == 0:
+                #     for key, loss in train_losses.items():
+                #         self.logger.info("Step/Train {}: {:.6f}".format(key, loss))
+                #         print("Step/Train {}: {:.6f}".format(key, loss))
+                self.step += 1
+                epoch_step += 1
+        self.accelerator.wait_for_everyone()
+        epoch_sum_loss = (
+            epoch_sum_loss
+            / len(self.train_dataloader)
+            * self.cfg.train.gradient_accumulation_step
+        )
+        for key in epoch_losses.keys():
+            epoch_losses[key] = (
+                epoch_losses[key]
+                / len(self.train_dataloader)
+                * self.cfg.train.gradient_accumulation_step
+            )
+        return epoch_sum_loss, epoch_losses
+    def _train_step(self, batch, is_training=True):
+        text_tokens = batch["phone_seq"].to(self.device)
+        text_tokens_lens = batch["phone_len"].to(self.device)
+        assert text_tokens.ndim == 2
+        audio_features = batch["acoustic_token"].to(self.device)
+        audio_features_lens = batch["target_len"].to(self.device)
+        assert audio_features.ndim == 3
+        with torch.set_grad_enabled(is_training):
+            loss, losses = self.model(
+                x=text_tokens,
+                x_lens=text_tokens_lens,
+                y=audio_features,
+                y_lens=audio_features_lens,
+                train_stage=self.args.train_stage,
+            )
+        assert loss.requires_grad == is_training
+        loss_dict = {}
+        frames_sum = (audio_features_lens).sum()
+        avg_loss = loss / frames_sum
+        loss_dict["loss"] = avg_loss.detach().cpu().item()
+        for l in losses:
+            loss_dict[l] = losses[l].detach().cpu().item() / frames_sum.item()
+        return avg_loss, loss_dict
+    def _valid_step(self, batch):
+        valid_losses = {}
+        total_loss = 0
+        valid_stats = {}
+        total_loss, valid_losses = self._train_step(
+            batch=batch,
+            is_training=False,
+        )
+        assert total_loss.requires_grad is False
+        total_loss = total_loss.detach().cpu().item()
+        return total_loss, valid_losses, valid_stats
+    def _build_dataloader(self):
+        if not self.cfg.train.use_dynamic_batchsize:
+            return super()._build_dataloader()
+        if len(self.cfg.dataset) > 1:
+            raise Exception("use_dynamic_batchsize only supports single dataset now.")
+        Dataset, Collator = self._build_dataset()
+        train_dataset = Dataset(
+            self.cfg, self.cfg.dataset[0], is_valid=False
+        )  # TODO: support use_dynamic_batchsize for more than one datasets.
+        train_collate = Collator(self.cfg)
+        batch_sampler = batch_by_size(
+            train_dataset.num_frame_indices,
+            train_dataset.get_num_frames,
+            max_tokens=self.cfg.train.max_tokens * self.accelerator.num_processes,
+            max_sentences=self.cfg.train.max_sentences * self.accelerator.num_processes,
+            required_batch_size_multiple=self.accelerator.num_processes,
+        )
+        np.random.seed(1234)
+        np.random.shuffle(batch_sampler)
+        print(batch_sampler[:1])
+        batches = [
+            x[self.accelerator.local_process_index :: self.accelerator.num_processes]
+            for x in batch_sampler
+            if len(x) % self.accelerator.num_processes == 0
+        ]
+        train_loader = DataLoader(
+            train_dataset,
+            collate_fn=train_collate,
+            num_workers=self.cfg.train.dataloader.num_worker,
+            batch_sampler=VariableSampler(
+                batches, drop_last=False, use_random_sampler=True
+            ),
+            pin_memory=False,
+        )
+        self.accelerator.wait_for_everyone()
+        valid_dataset = Dataset(self.cfg, self.cfg.dataset[0], is_valid=True)
+        valid_collate = Collator(self.cfg)
+        batch_sampler = batch_by_size(
+            valid_dataset.num_frame_indices,
+            valid_dataset.get_num_frames,
+            max_tokens=self.cfg.train.max_tokens * self.accelerator.num_processes,
+            max_sentences=self.cfg.train.max_sentences * self.accelerator.num_processes,
+            required_batch_size_multiple=self.accelerator.num_processes,
+        )
+        batches = [
+            x[self.accelerator.local_process_index :: self.accelerator.num_processes]
+            for x in batch_sampler
+            if len(x) % self.accelerator.num_processes == 0
+        ]
+        valid_loader = DataLoader(
+            valid_dataset,
+            collate_fn=valid_collate,
+            num_workers=self.cfg.train.dataloader.num_worker,
+            batch_sampler=VariableSampler(batches, drop_last=False),
+            pin_memory=False,
+        )
+        self.accelerator.wait_for_everyone()
+        return train_loader, valid_loader
+    def _accelerator_prepare(self):
+        if not self.cfg.train.use_dynamic_batchsize:
+            (
+                self.train_dataloader,
+                self.valid_dataloader,
+            ) = self.accelerator.prepare(
+                self.train_dataloader,
+                self.valid_dataloader,
+            )
+        if isinstance(self.model, dict):
+            for key in self.model.keys():
+                self.model[key] = self.accelerator.prepare(self.model[key])
+        else:
+            self.model = self.accelerator.prepare(self.model)
+        if isinstance(self.optimizer, dict):
+            for key in self.optimizer.keys():
+                self.optimizer[key] = self.accelerator.prepare(self.optimizer[key])
+        else:
+            self.optimizer = self.accelerator.prepare(self.optimizer)
+        if isinstance(self.scheduler, dict):
+            for key in self.scheduler.keys():
+                self.scheduler[key] = self.accelerator.prepare(self.scheduler[key])
+        else:
+            self.scheduler = self.accelerator.prepare(self.scheduler)
+    def add_arguments(parser: argparse.ArgumentParser):
+        parser.add_argument(
+            "--train_stage",
+            type=int,
+            default="1",
+            help="0: train all modules, 1: AR Decoder, 2: NAR Decoder",
+        )
+        parser.add_argument(
+            "--ar_model_ckpt_dir",
+            type=str,
+            default=None,
+            help="Checkpoint for ar model ckeckpoint in the first training stage.",
+        )

Amphion/models/tts/vits/__init__.py ADDED Viewed

File without changes

Amphion/models/tts/vits/vits.py ADDED Viewed

	@@ -0,0 +1,379 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# This code is modified from https://github.com/jaywalnut310/vits/blob/main/models.py
+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+from utils.util import *
+from modules.flow.modules import *
+from modules.base.base_module import *
+from modules.transformer.attentions import Encoder
+from modules.duration_predictor.standard_duration_predictor import DurationPredictor
+from modules.duration_predictor.stochastic_duration_predictor import (
+    StochasticDurationPredictor,
+)
+from models.vocoders.gan.generator.hifigan import HiFiGAN_vits as Generator
+try:
+    from modules import monotonic_align
+except ImportError:
+    print("Monotonic align not found. Please make sure you have compiled it.")
+class TextEncoder(nn.Module):
+    def __init__(
+        self,
+        n_vocab,
+        out_channels,
+        hidden_channels,
+        filter_channels,
+        n_heads,
+        n_layers,
+        kernel_size,
+        p_dropout,
+    ):
+        super().__init__()
+        self.n_vocab = n_vocab
+        self.out_channels = out_channels
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.emb = nn.Embedding(n_vocab, hidden_channels)
+        nn.init.normal_(self.emb.weight, 0.0, hidden_channels**-0.5)
+        self.encoder = Encoder(
+            hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
+        )
+        self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
+    def forward(self, x, x_lengths):
+        x = self.emb(x) * math.sqrt(self.hidden_channels)  # [b, t, h]
+        x = torch.transpose(x, 1, -1)  # [b, h, t]
+        x_mask = torch.unsqueeze(sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
+        x = self.encoder(x * x_mask, x_mask)
+        stats = self.proj(x) * x_mask
+        m, logs = torch.split(stats, self.out_channels, dim=1)
+        return x, m, logs, x_mask
+class ResidualCouplingBlock(nn.Module):
+    def __init__(
+        self,
+        channels,
+        hidden_channels,
+        kernel_size,
+        dilation_rate,
+        n_layers,
+        n_flows=4,
+        gin_channels=0,
+    ):
+        super().__init__()
+        self.channels = channels
+        self.hidden_channels = hidden_channels
+        self.kernel_size = kernel_size
+        self.dilation_rate = dilation_rate
+        self.n_layers = n_layers
+        self.n_flows = n_flows
+        self.gin_channels = gin_channels
+        self.flows = nn.ModuleList()
+        for i in range(n_flows):
+            self.flows.append(
+                ResidualCouplingLayer(
+                    channels,
+                    hidden_channels,
+                    kernel_size,
+                    dilation_rate,
+                    n_layers,
+                    gin_channels=gin_channels,
+                    mean_only=True,
+                )
+            )
+            self.flows.append(Flip())
+    def forward(self, x, x_mask, g=None, reverse=False):
+        if not reverse:
+            for flow in self.flows:
+                x, _ = flow(x, x_mask, g=g, reverse=reverse)
+        else:
+            for flow in reversed(self.flows):
+                x = flow(x, x_mask, g=g, reverse=reverse)
+        return x
+class PosteriorEncoder(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        hidden_channels,
+        kernel_size,
+        dilation_rate,
+        n_layers,
+        gin_channels=0,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.hidden_channels = hidden_channels
+        self.kernel_size = kernel_size
+        self.dilation_rate = dilation_rate
+        self.n_layers = n_layers
+        self.gin_channels = gin_channels
+        self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
+        self.enc = WN(
+            hidden_channels,
+            kernel_size,
+            dilation_rate,
+            n_layers,
+            gin_channels=gin_channels,
+        )
+        self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
+    def forward(self, x, x_lengths, g=None):
+        x_mask = torch.unsqueeze(sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
+        x = self.pre(x) * x_mask
+        x = self.enc(x, x_mask, g=g)
+        stats = self.proj(x) * x_mask
+        m, logs = torch.split(stats, self.out_channels, dim=1)
+        z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
+        return z, m, logs, x_mask
+class SynthesizerTrn(nn.Module):
+    """
+    Synthesizer for Training
+    """
+    def __init__(
+        self,
+        n_vocab,
+        spec_channels,
+        segment_size,
+        inter_channels,
+        hidden_channels,
+        filter_channels,
+        n_heads,
+        n_layers,
+        kernel_size,
+        p_dropout,
+        resblock,
+        resblock_kernel_sizes,
+        resblock_dilation_sizes,
+        upsample_rates,
+        upsample_initial_channel,
+        upsample_kernel_sizes,
+        n_speakers=0,
+        gin_channels=0,
+        use_sdp=True,
+        **kwargs,
+    ):
+        super().__init__()
+        self.n_vocab = n_vocab
+        self.spec_channels = spec_channels
+        self.inter_channels = inter_channels
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.resblock = resblock
+        self.resblock_kernel_sizes = resblock_kernel_sizes
+        self.resblock_dilation_sizes = resblock_dilation_sizes
+        self.upsample_rates = upsample_rates
+        self.upsample_initial_channel = upsample_initial_channel
+        self.upsample_kernel_sizes = upsample_kernel_sizes
+        self.segment_size = segment_size
+        self.n_speakers = n_speakers
+        self.gin_channels = gin_channels
+        self.use_sdp = use_sdp
+        self.enc_p = TextEncoder(
+            n_vocab,
+            inter_channels,
+            hidden_channels,
+            filter_channels,
+            n_heads,
+            n_layers,
+            kernel_size,
+            p_dropout,
+        )
+        self.dec = Generator(
+            inter_channels,
+            resblock,
+            resblock_kernel_sizes,
+            resblock_dilation_sizes,
+            upsample_rates,
+            upsample_initial_channel,
+            upsample_kernel_sizes,
+            gin_channels=gin_channels,
+        )
+        self.enc_q = PosteriorEncoder(
+            spec_channels,
+            inter_channels,
+            hidden_channels,
+            5,
+            1,
+            16,
+            gin_channels=gin_channels,
+        )
+        self.flow = ResidualCouplingBlock(
+            inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels
+        )
+        if use_sdp:
+            self.dp = StochasticDurationPredictor(
+                hidden_channels, 192, 3, 0.5, 4, gin_channels=gin_channels
+            )
+        else:
+            self.dp = DurationPredictor(
+                hidden_channels, 256, 3, 0.5, gin_channels=gin_channels
+            )
+        if n_speakers >= 1:
+            self.emb_g = nn.Embedding(n_speakers, gin_channels)
+    def forward(self, data):
+        x = data["phone_seq"]
+        x_lengths = data["phone_len"]
+        y = data["linear"]
+        y_lengths = data["target_len"]
+        x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths)
+        if self.n_speakers > 0:
+            g = self.emb_g(data["spk_id"].squeeze(-1)).unsqueeze(-1)  # [b, h, 1]
+        else:
+            g = None
+        z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
+        z_p = self.flow(z, y_mask, g=g)
+        with torch.no_grad():
+            # negative cross-entropy
+            s_p_sq_r = torch.exp(-2 * logs_p)  # [b, d, t]
+            neg_cent1 = torch.sum(
+                -0.5 * math.log(2 * math.pi) - logs_p, [1], keepdim=True
+            )  # [b, 1, t_s]
+            neg_cent2 = torch.matmul(
+                -0.5 * (z_p**2).transpose(1, 2), s_p_sq_r
+            )  # [b, t_t, d] x [b, d, t_s] = [b, t_t, t_s]
+            neg_cent3 = torch.matmul(
+                z_p.transpose(1, 2), (m_p * s_p_sq_r)
+            )  # [b, t_t, d] x [b, d, t_s] = [b, t_t, t_s]
+            neg_cent4 = torch.sum(
+                -0.5 * (m_p**2) * s_p_sq_r, [1], keepdim=True
+            )  # [b, 1, t_s]
+            neg_cent = neg_cent1 + neg_cent2 + neg_cent3 + neg_cent4
+            attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
+            attn = (
+                monotonic_align.maximum_path(neg_cent, attn_mask.squeeze(1))
+                .unsqueeze(1)
+                .detach()
+            )
+        w = attn.sum(2)
+        if self.use_sdp:
+            l_length = self.dp(x, x_mask, w, g=g)
+            l_length = l_length / torch.sum(x_mask)
+        else:
+            logw_ = torch.log(w + 1e-6) * x_mask
+            logw = self.dp(x, x_mask, g=g)
+            l_length = torch.sum((logw - logw_) ** 2, [1, 2]) / torch.sum(x_mask)
+        # expand prior
+        m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(1, 2)
+        logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(1, 2)
+        z_slice, ids_slice = rand_slice_segments(z, y_lengths, self.segment_size)
+        o = self.dec(z_slice, g=g)
+        outputs = {
+            "y_hat": o,
+            "l_length": l_length,
+            "attn": attn,
+            "ids_slice": ids_slice,
+            "x_mask": x_mask,
+            "z_mask": y_mask,
+            "z": z,
+            "z_p": z_p,
+            "m_p": m_p,
+            "logs_p": logs_p,
+            "m_q": m_q,
+            "logs_q": logs_q,
+        }
+        return outputs
+    def infer(
+        self,
+        x,
+        x_lengths,
+        sid=None,
+        noise_scale=1,
+        length_scale=1,
+        noise_scale_w=1.0,
+        max_len=None,
+    ):
+        x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths)
+        if self.n_speakers > 0:
+            sid = sid.squeeze(-1)
+            g = self.emb_g(sid).unsqueeze(-1)  # [b, h, 1]
+        else:
+            g = None
+        if self.use_sdp:
+            logw = self.dp(x, x_mask, g=g, reverse=True, noise_scale=noise_scale_w)
+        else:
+            logw = self.dp(x, x_mask, g=g)
+        w = torch.exp(logw) * x_mask * length_scale
+        w_ceil = torch.ceil(w)
+        y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
+        y_mask = torch.unsqueeze(sequence_mask(y_lengths, None), 1).to(x_mask.dtype)
+        attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
+        attn = generate_path(w_ceil, attn_mask)
+        m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(
+            1, 2
+        )  # [b, t', t], [b, t, d] -> [b, d, t']
+        logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(
+            1, 2
+        )  # [b, t', t], [b, t, d] -> [b, d, t']
+        z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
+        z = self.flow(z_p, y_mask, g=g, reverse=True)
+        o = self.dec((z * y_mask)[:, :, :max_len], g=g)
+        outputs = {
+            "y_hat": o,
+            "attn": attn,
+            "mask": y_mask,
+            "z": z,
+            "z_p": z_p,
+            "m_p": m_p,
+            "logs_p": logs_p,
+        }
+        return outputs
+    def voice_conversion(self, y, y_lengths, sid_src, sid_tgt):
+        assert self.n_speakers > 0, "n_speakers have to be larger than 0."
+        g_src = self.emb_g(sid_src).unsqueeze(-1)
+        g_tgt = self.emb_g(sid_tgt).unsqueeze(-1)
+        z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g_src)
+        z_p = self.flow(z, y_mask, g=g_src)
+        z_hat = self.flow(z_p, y_mask, g=g_tgt, reverse=True)
+        o_hat = self.dec(z_hat * y_mask, g=g_tgt)
+        return o_hat, y_mask, (z, z_p, z_hat)

Amphion/models/tts/vits/vits_dataset.py ADDED Viewed

	@@ -0,0 +1,140 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import json
+import numpy as np
+from text import text_to_sequence
+from text.text_token_collation import phoneIDCollation
+from models.tts.base.tts_dataset import (
+    TTSDataset,
+    TTSCollator,
+    TTSTestDataset,
+    TTSTestCollator,
+)
+class VITSDataset(TTSDataset):
+    def __init__(self, cfg, dataset, is_valid):
+        super().__init__(cfg, dataset, is_valid=is_valid)
+    def __getitem__(self, index):
+        single_feature = super().__getitem__(index)
+        return single_feature
+    def __len__(self):
+        return super().__len__()
+    def get_metadata(self):
+        metadata_filter = []
+        with open(self.metafile_path, "r", encoding="utf-8") as f:
+            metadata = json.load(f)
+        for utt_info in metadata:
+            duration = utt_info["Duration"]
+            frame_len = (
+                duration
+                * self.cfg.preprocess.sample_rate
+                // self.cfg.preprocess.hop_size
+            )
+            if (
+                frame_len
+                < self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size
+            ):
+                continue
+            metadata_filter.append(utt_info)
+        return metadata_filter
+class VITSCollator(TTSCollator):
+    """Zero-pads model inputs and targets based on number of frames per step"""
+    def __init__(self, cfg):
+        super().__init__(cfg)
+    def __call__(self, batch):
+        parsed_batch_features = super().__call__(batch)
+        return parsed_batch_features
+class VITSTestDataset(TTSTestDataset):
+    def __init__(self, args, cfg):
+        super().__init__(args, cfg)
+        processed_data_dir = os.path.join(cfg.preprocess.processed_dir, args.dataset)
+        if cfg.preprocess.use_spkid:
+            spk2id_path = os.path.join(processed_data_dir, cfg.preprocess.spk2id)
+            with open(spk2id_path, "r") as f:
+                self.spk2id = json.load(f)
+            utt2spk_path = os.path.join(processed_data_dir, cfg.preprocess.utt2spk)
+            self.utt2spk = dict()
+            with open(utt2spk_path, "r") as f:
+                for line in f.readlines():
+                    utt, spk = line.strip().split("\t")
+                    self.utt2spk[utt] = spk
+        if cfg.preprocess.use_text or cfg.preprocess.use_phone:
+            self.utt2seq = {}
+            for utt_info in self.metadata:
+                dataset = utt_info["Dataset"]
+                uid = utt_info["Uid"]
+                utt = "{}_{}".format(dataset, uid)
+                if cfg.preprocess.use_text:
+                    text = utt_info["Text"]
+                    sequence = text_to_sequence(text, cfg.preprocess.text_cleaners)
+                elif cfg.preprocess.use_phone:
+                    # load phoneme squence from phone file
+                    phone_path = os.path.join(
+                        processed_data_dir, cfg.preprocess.phone_dir, uid + ".phone"
+                    )
+                    with open(phone_path, "r") as fin:
+                        phones = fin.readlines()
+                        assert len(phones) == 1
+                        phones = phones[0].strip()
+                    phones_seq = phones.split(" ")
+                    phon_id_collator = phoneIDCollation(cfg, dataset=dataset)
+                    sequence = phon_id_collator.get_phone_id_sequence(cfg, phones_seq)
+                self.utt2seq[utt] = sequence
+    def __getitem__(self, index):
+        utt_info = self.metadata[index]
+        dataset = utt_info["Dataset"]
+        uid = utt_info["Uid"]
+        utt = "{}_{}".format(dataset, uid)
+        single_feature = dict()
+        if self.cfg.preprocess.use_spkid:
+            single_feature["spk_id"] = np.array(
+                [self.spk2id[self.utt2spk[utt]]], dtype=np.int32
+            )
+        if self.cfg.preprocess.use_phone or self.cfg.preprocess.use_text:
+            single_feature["phone_seq"] = np.array(self.utt2seq[utt])
+            single_feature["phone_len"] = len(self.utt2seq[utt])
+        return single_feature
+    def get_metadata(self):
+        with open(self.metafile_path, "r", encoding="utf-8") as f:
+            metadata = json.load(f)
+        return metadata
+    def __len__(self):
+        return len(self.metadata)
+class VITSTestCollator(TTSTestCollator):
+    """Zero-pads model inputs and targets based on number of frames per step"""
+    def __init__(self, cfg):
+        self.cfg = cfg
+    def __call__(self, batch):
+        return super().__call__(batch)

Amphion/models/tts/vits/vits_trainer.py ADDED Viewed

	@@ -0,0 +1,439 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+import torch.nn as nn
+from torch.optim.lr_scheduler import ExponentialLR
+from tqdm import tqdm
+from utils.util import *
+from utils.mel import mel_spectrogram_torch
+from models.tts.base import TTSTrainer
+from models.tts.vits.vits import SynthesizerTrn
+from models.tts.vits.vits_dataset import VITSDataset, VITSCollator
+from models.vocoders.gan.discriminator.mpd import (
+    MultiPeriodDiscriminator_vits as MultiPeriodDiscriminator,
+)
+class VITSTrainer(TTSTrainer):
+    def __init__(self, args, cfg):
+        TTSTrainer.__init__(self, args, cfg)
+        if cfg.preprocess.use_spkid and cfg.train.multi_speaker_training:
+            if cfg.model.n_speakers == 0:
+                cfg.model.n_speaker = len(self.speakers)
+    def _build_model(self):
+        net_g = SynthesizerTrn(
+            self.cfg.model.text_token_num,
+            self.cfg.preprocess.n_fft // 2 + 1,
+            self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size,
+            **self.cfg.model,
+        )
+        net_d = MultiPeriodDiscriminator(self.cfg.model.use_spectral_norm)
+        model = {"generator": net_g, "discriminator": net_d}
+        return model
+    def _build_dataset(self):
+        return VITSDataset, VITSCollator
+    def _build_optimizer(self):
+        optimizer_g = torch.optim.AdamW(
+            self.model["generator"].parameters(),
+            self.cfg.train.learning_rate,
+            betas=self.cfg.train.AdamW.betas,
+            eps=self.cfg.train.AdamW.eps,
+        )
+        optimizer_d = torch.optim.AdamW(
+            self.model["discriminator"].parameters(),
+            self.cfg.train.learning_rate,
+            betas=self.cfg.train.AdamW.betas,
+            eps=self.cfg.train.AdamW.eps,
+        )
+        optimizer = {"optimizer_g": optimizer_g, "optimizer_d": optimizer_d}
+        return optimizer
+    def _build_scheduler(self):
+        scheduler_g = ExponentialLR(
+            self.optimizer["optimizer_g"],
+            gamma=self.cfg.train.lr_decay,
+            last_epoch=self.epoch - 1,
+        )
+        scheduler_d = ExponentialLR(
+            self.optimizer["optimizer_d"],
+            gamma=self.cfg.train.lr_decay,
+            last_epoch=self.epoch - 1,
+        )
+        scheduler = {"scheduler_g": scheduler_g, "scheduler_d": scheduler_d}
+        return scheduler
+    def _build_criterion(self):
+        class GeneratorLoss(nn.Module):
+            def __init__(self, cfg):
+                super(GeneratorLoss, self).__init__()
+                self.cfg = cfg
+                self.l1_loss = nn.L1Loss()
+            def generator_loss(self, disc_outputs):
+                loss = 0
+                gen_losses = []
+                for dg in disc_outputs:
+                    dg = dg.float()
+                    l = torch.mean((1 - dg) ** 2)
+                    gen_losses.append(l)
+                    loss += l
+                return loss, gen_losses
+            def feature_loss(self, fmap_r, fmap_g):
+                loss = 0
+                for dr, dg in zip(fmap_r, fmap_g):
+                    for rl, gl in zip(dr, dg):
+                        rl = rl.float().detach()
+                        gl = gl.float()
+                        loss += torch.mean(torch.abs(rl - gl))
+                return loss * 2
+            def kl_loss(self, z_p, logs_q, m_p, logs_p, z_mask):
+                """
+                z_p, logs_q: [b, h, t_t]
+                m_p, logs_p: [b, h, t_t]
+                """
+                z_p = z_p.float()
+                logs_q = logs_q.float()
+                m_p = m_p.float()
+                logs_p = logs_p.float()
+                z_mask = z_mask.float()
+                kl = logs_p - logs_q - 0.5
+                kl += 0.5 * ((z_p - m_p) ** 2) * torch.exp(-2.0 * logs_p)
+                kl = torch.sum(kl * z_mask)
+                l = kl / torch.sum(z_mask)
+                return l
+            def forward(
+                self,
+                outputs_g,
+                outputs_d,
+                y_mel,
+                y_hat_mel,
+            ):
+                loss_g = {}
+                # duration loss
+                loss_dur = torch.sum(outputs_g["l_length"].float())
+                loss_g["loss_dur"] = loss_dur
+                # mel loss
+                loss_mel = self.l1_loss(y_mel, y_hat_mel) * self.cfg.train.c_mel
+                loss_g["loss_mel"] = loss_mel
+                # kl loss
+                loss_kl = (
+                    self.kl_loss(
+                        outputs_g["z_p"],
+                        outputs_g["logs_q"],
+                        outputs_g["m_p"],
+                        outputs_g["logs_p"],
+                        outputs_g["z_mask"],
+                    )
+                    * self.cfg.train.c_kl
+                )
+                loss_g["loss_kl"] = loss_kl
+                # feature loss
+                loss_fm = self.feature_loss(outputs_d["fmap_rs"], outputs_d["fmap_gs"])
+                loss_g["loss_fm"] = loss_fm
+                # gan loss
+                loss_gen, losses_gen = self.generator_loss(outputs_d["y_d_hat_g"])
+                loss_g["loss_gen"] = loss_gen
+                loss_g["loss_gen_all"] = (
+                    loss_dur + loss_mel + loss_kl + loss_fm + loss_gen
+                )
+                return loss_g
+        class DiscriminatorLoss(nn.Module):
+            def __init__(self, cfg):
+                super(DiscriminatorLoss, self).__init__()
+                self.cfg = cfg
+                self.l1Loss = torch.nn.L1Loss(reduction="mean")
+            def __call__(self, disc_real_outputs, disc_generated_outputs):
+                loss_d = {}
+                loss = 0
+                r_losses = []
+                g_losses = []
+                for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+                    dr = dr.float()
+                    dg = dg.float()
+                    r_loss = torch.mean((1 - dr) ** 2)
+                    g_loss = torch.mean(dg**2)
+                    loss += r_loss + g_loss
+                    r_losses.append(r_loss.item())
+                    g_losses.append(g_loss.item())
+                loss_d["loss_disc_all"] = loss
+                return loss_d
+        criterion = {
+            "generator": GeneratorLoss(self.cfg),
+            "discriminator": DiscriminatorLoss(self.cfg),
+        }
+        return criterion
+    def write_summary(
+        self,
+        losses,
+        stats,
+        images={},
+        audios={},
+        audio_sampling_rate=24000,
+        tag="train",
+    ):
+        for key, value in losses.items():
+            self.sw.add_scalar(tag + "/" + key, value, self.step)
+        self.sw.add_scalar(
+            "learning_rate",
+            self.optimizer["optimizer_g"].param_groups[0]["lr"],
+            self.step,
+        )
+        if len(images) != 0:
+            for key, value in images.items():
+                self.sw.add_image(key, value, self.global_step, batchformats="HWC")
+        if len(audios) != 0:
+            for key, value in audios.items():
+                self.sw.add_audio(key, value, self.global_step, audio_sampling_rate)
+    def write_valid_summary(
+        self, losses, stats, images={}, audios={}, audio_sampling_rate=24000, tag="val"
+    ):
+        for key, value in losses.items():
+            self.sw.add_scalar(tag + "/" + key, value, self.step)
+        if len(images) != 0:
+            for key, value in images.items():
+                self.sw.add_image(key, value, self.global_step, batchformats="HWC")
+        if len(audios) != 0:
+            for key, value in audios.items():
+                self.sw.add_audio(key, value, self.global_step, audio_sampling_rate)
+    def get_state_dict(self):
+        state_dict = {
+            "generator": self.model["generator"].state_dict(),
+            "discriminator": self.model["discriminator"].state_dict(),
+            "optimizer_g": self.optimizer["optimizer_g"].state_dict(),
+            "optimizer_d": self.optimizer["optimizer_d"].state_dict(),
+            "scheduler_g": self.scheduler["scheduler_g"].state_dict(),
+            "scheduler_d": self.scheduler["scheduler_d"].state_dict(),
+            "step": self.step,
+            "epoch": self.epoch,
+            "batch_size": self.cfg.train.batch_size,
+        }
+        return state_dict
+    def load_model(self, checkpoint):
+        self.step = checkpoint["step"]
+        self.epoch = checkpoint["epoch"]
+        self.model["generator"].load_state_dict(checkpoint["generator"])
+        self.model["discriminator"].load_state_dict(checkpoint["discriminator"])
+        self.optimizer["optimizer_g"].load_state_dict(checkpoint["optimizer_g"])
+        self.optimizer["optimizer_d"].load_state_dict(checkpoint["optimizer_d"])
+        self.scheduler["scheduler_g"].load_state_dict(checkpoint["scheduler_g"])
+        self.scheduler["scheduler_d"].load_state_dict(checkpoint["scheduler_d"])
+    @torch.inference_mode()
+    def _valid_step(self, batch):
+        r"""Testing forward step. Should return average loss of a sample over
+        one batch. Provoke ``_forward_step`` is recommended except for special case.
+        See ``_test_epoch`` for usage.
+        """
+        valid_losses = {}
+        total_loss = 0
+        valid_stats = {}
+        batch["linear"] = batch["linear"].transpose(2, 1)  # [b, d, t]
+        batch["mel"] = batch["mel"].transpose(2, 1)  # [b, d, t]
+        batch["audio"] = batch["audio"].unsqueeze(1)  # [b, d, t]
+        #  Discriminator
+        # Generator output
+        outputs_g = self.model["generator"](batch)
+        y_mel = slice_segments(
+            batch["mel"],
+            outputs_g["ids_slice"],
+            self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size,
+        )
+        y_hat_mel = mel_spectrogram_torch(
+            outputs_g["y_hat"].squeeze(1), self.cfg.preprocess
+        )
+        y = slice_segments(
+            batch["audio"],
+            outputs_g["ids_slice"] * self.cfg.preprocess.hop_size,
+            self.cfg.preprocess.segment_size,
+        )
+        # Discriminator output
+        outputs_d = self.model["discriminator"](y, outputs_g["y_hat"].detach())
+        ##  Discriminator loss
+        loss_d = self.criterion["discriminator"](
+            outputs_d["y_d_hat_r"], outputs_d["y_d_hat_g"]
+        )
+        valid_losses.update(loss_d)
+        ##  Generator
+        outputs_d = self.model["discriminator"](y, outputs_g["y_hat"])
+        loss_g = self.criterion["generator"](outputs_g, outputs_d, y_mel, y_hat_mel)
+        valid_losses.update(loss_g)
+        for item in valid_losses:
+            valid_losses[item] = valid_losses[item].item()
+        total_loss = loss_g["loss_gen_all"] + loss_d["loss_disc_all"]
+        return (
+            total_loss.item(),
+            valid_losses,
+            valid_stats,
+        )
+    def _train_step(self, batch):
+        r"""Forward step for training and inference. This function is called
+        in ``_train_step`` & ``_test_step`` function.
+        """
+        train_losses = {}
+        total_loss = 0
+        training_stats = {}
+        batch["linear"] = batch["linear"].transpose(2, 1)  # [b, d, t]
+        batch["mel"] = batch["mel"].transpose(2, 1)  # [b, d, t]
+        batch["audio"] = batch["audio"].unsqueeze(1)  # [b, d, t]
+        # Train Discriminator
+        # Generator output
+        outputs_g = self.model["generator"](batch)
+        y_mel = slice_segments(
+            batch["mel"],
+            outputs_g["ids_slice"],
+            self.cfg.preprocess.segment_size // self.cfg.preprocess.hop_size,
+        )
+        y_hat_mel = mel_spectrogram_torch(
+            outputs_g["y_hat"].squeeze(1), self.cfg.preprocess
+        )
+        y = slice_segments(
+            batch["audio"],
+            outputs_g["ids_slice"] * self.cfg.preprocess.hop_size,
+            self.cfg.preprocess.segment_size,
+        )
+        # Discriminator output
+        outputs_d = self.model["discriminator"](y, outputs_g["y_hat"].detach())
+        ##  Discriminator loss
+        loss_d = self.criterion["discriminator"](
+            outputs_d["y_d_hat_r"], outputs_d["y_d_hat_g"]
+        )
+        train_losses.update(loss_d)
+        # BP and Grad Updated
+        self.optimizer["optimizer_d"].zero_grad()
+        self.accelerator.backward(loss_d["loss_disc_all"])
+        self.optimizer["optimizer_d"].step()
+        ## Train Generator
+        outputs_d = self.model["discriminator"](y, outputs_g["y_hat"])
+        loss_g = self.criterion["generator"](outputs_g, outputs_d, y_mel, y_hat_mel)
+        train_losses.update(loss_g)
+        # BP and Grad Updated
+        self.optimizer["optimizer_g"].zero_grad()
+        self.accelerator.backward(loss_g["loss_gen_all"])
+        self.optimizer["optimizer_g"].step()
+        for item in train_losses:
+            train_losses[item] = train_losses[item].item()
+        total_loss = loss_g["loss_gen_all"] + loss_d["loss_disc_all"]
+        return (
+            total_loss.item(),
+            train_losses,
+            training_stats,
+        )
+    def _train_epoch(self):
+        r"""Training epoch. Should return average loss of a batch (sample) over
+        one epoch. See ``train_loop`` for usage.
+        """
+        epoch_sum_loss: float = 0.0
+        epoch_losses: dict = {}
+        epoch_step: int = 0
+        for batch in tqdm(
+            self.train_dataloader,
+            desc=f"Training Epoch {self.epoch}",
+            unit="batch",
+            colour="GREEN",
+            leave=False,
+            dynamic_ncols=True,
+            smoothing=0.04,
+            disable=not self.accelerator.is_main_process,
+        ):
+            with self.accelerator.accumulate(self.model):
+                total_loss, train_losses, training_stats = self._train_step(batch)
+            self.batch_count += 1
+            if self.batch_count % self.cfg.train.gradient_accumulation_step == 0:
+                epoch_sum_loss += total_loss
+                for key, value in train_losses.items():
+                    if key not in epoch_losses.keys():
+                        epoch_losses[key] = value
+                    else:
+                        epoch_losses[key] += value
+                self.accelerator.log(
+                    {
+                        "Step/Generator Loss": train_losses["loss_gen_all"],
+                        "Step/Discriminator Loss": train_losses["loss_disc_all"],
+                        "Step/Generator Learning Rate": self.optimizer[
+                            "optimizer_d"
+                        ].param_groups[0]["lr"],
+                        "Step/Discriminator Learning Rate": self.optimizer[
+                            "optimizer_g"
+                        ].param_groups[0]["lr"],
+                    },
+                    step=self.step,
+                )
+                self.step += 1
+                epoch_step += 1
+        self.accelerator.wait_for_everyone()
+        epoch_sum_loss = (
+            epoch_sum_loss
+            / len(self.train_dataloader)
+            * self.cfg.train.gradient_accumulation_step
+        )
+        for key in epoch_losses.keys():
+            epoch_losses[key] = (
+                epoch_losses[key]
+                / len(self.train_dataloader)
+                * self.cfg.train.gradient_accumulation_step
+            )
+        return epoch_sum_loss, epoch_losses

Amphion/models/vocoders/autoregressive/autoregressive_vocoder_trainer.py ADDED Viewed

File without changes

Amphion/modules/activation_functions/gated_activation_unit.py ADDED Viewed

	@@ -0,0 +1,61 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+import torch.nn as nn
+from modules.general.utils import Conv1d
+class GaU(nn.Module):
+    r"""Gated Activation Unit (GaU) proposed in `Gated Activation Units for Neural
+    Networks <https://arxiv.org/pdf/1606.05328.pdf>`_.
+    Args:
+        channels: number of input channels.
+        kernel_size: kernel size of the convolution.
+        dilation: dilation rate of the convolution.
+        d_context: dimension of context tensor, None if don't use context.
+    """
+    def __init__(
+        self,
+        channels: int,
+        kernel_size: int = 3,
+        dilation: int = 1,
+        d_context: int = None,
+    ):
+        super().__init__()
+        self.context = d_context
+        self.conv = Conv1d(
+            channels,
+            channels * 2,
+            kernel_size,
+            dilation=dilation,
+            padding=dilation * (kernel_size - 1) // 2,
+        )
+        if self.context:
+            self.context_proj = Conv1d(d_context, channels * 2, 1)
+    def forward(self, x: torch.Tensor, context: torch.Tensor = None):
+        r"""Calculate forward propagation.
+        Args:
+            x: input tensor with shape [B, C, T].
+            context: context tensor with shape [B, ``d_context``, T], default to None.
+        """
+        h = self.conv(x)
+        if self.context:
+            h = h + self.context_proj(context)
+        h1, h2 = h.chunk(2, 1)
+        h = torch.tanh(h1) * torch.sigmoid(h2)
+        return h

Amphion/modules/base/base_module.py ADDED Viewed

	@@ -0,0 +1,75 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import torch
+from torch import nn
+from torch.nn import functional as F
+class LayerNorm(nn.Module):
+    def __init__(self, channels, eps=1e-5):
+        super().__init__()
+        self.channels = channels
+        self.eps = eps
+        self.gamma = nn.Parameter(torch.ones(channels))
+        self.beta = nn.Parameter(torch.zeros(channels))
+    def forward(self, x):
+        x = x.transpose(1, -1)
+        x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
+        return x.transpose(1, -1)
+class ConvReluNorm(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        hidden_channels,
+        out_channels,
+        kernel_size,
+        n_layers,
+        p_dropout,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.hidden_channels = hidden_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.n_layers = n_layers
+        self.p_dropout = p_dropout
+        assert n_layers > 1, "Number of layers should be larger than 0."
+        self.conv_layers = nn.ModuleList()
+        self.norm_layers = nn.ModuleList()
+        self.conv_layers.append(
+            nn.Conv1d(
+                in_channels, hidden_channels, kernel_size, padding=kernel_size // 2
+            )
+        )
+        self.norm_layers.append(LayerNorm(hidden_channels))
+        self.relu_drop = nn.Sequential(nn.ReLU(), nn.Dropout(p_dropout))
+        for _ in range(n_layers - 1):
+            self.conv_layers.append(
+                nn.Conv1d(
+                    hidden_channels,
+                    hidden_channels,
+                    kernel_size,
+                    padding=kernel_size // 2,
+                )
+            )
+            self.norm_layers.append(LayerNorm(hidden_channels))
+        self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
+        self.proj.weight.data.zero_()
+        self.proj.bias.data.zero_()
+    def forward(self, x, x_mask):
+        x_org = x
+        for i in range(self.n_layers):
+            x = self.conv_layers[i](x * x_mask)
+            x = self.norm_layers[i](x)
+            x = self.relu_drop(x)
+        x = x_org + self.proj(x)
+        return x * x_mask

Amphion/modules/diffusion/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+from .bidilconv.bidilated_conv import BiDilConv
+from .unet.unet import UNet

Amphion/modules/duration_predictor/__init__.py ADDED Viewed

File without changes

Amphion/modules/duration_predictor/standard_duration_predictor.py ADDED Viewed

	@@ -0,0 +1,53 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# This code is modified from https://github.com/jaywalnut310/vits/blob/main/models.py
+import torch
+from torch import nn
+from modules.base.base_module import LayerNorm
+class DurationPredictor(nn.Module):
+    def __init__(
+        self, in_channels, filter_channels, kernel_size, p_dropout, gin_channels=0
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.filter_channels = filter_channels
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.gin_channels = gin_channels
+        self.drop = nn.Dropout(p_dropout)
+        self.conv_1 = nn.Conv1d(
+            in_channels, filter_channels, kernel_size, padding=kernel_size // 2
+        )
+        self.norm_1 = LayerNorm(filter_channels)
+        self.conv_2 = nn.Conv1d(
+            filter_channels, filter_channels, kernel_size, padding=kernel_size // 2
+        )
+        self.norm_2 = LayerNorm(filter_channels)
+        self.proj = nn.Conv1d(filter_channels, 1, 1)
+        if gin_channels != 0:
+            self.cond = nn.Conv1d(gin_channels, in_channels, 1)
+    def forward(self, x, x_mask, g=None):
+        x = torch.detach(x)
+        if g is not None:
+            g = torch.detach(g)
+            x = x + self.cond(g)
+        x = self.conv_1(x * x_mask)
+        x = torch.relu(x)
+        x = self.norm_1(x)
+        x = self.drop(x)
+        x = self.conv_2(x * x_mask)
+        x = torch.relu(x)
+        x = self.norm_2(x)
+        x = self.drop(x)
+        x = self.proj(x * x_mask)
+        return x * x_mask

Amphion/modules/duration_predictor/stochastic_duration_predictor.py ADDED Viewed

	@@ -0,0 +1,120 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+# This code is modified from https://github.com/jaywalnut310/vits/blob/main/models.pyimport torch
+from torch import nn
+from torch.nn import functional as F
+import math
+from modules.flow.modules import *
+class StochasticDurationPredictor(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        filter_channels,
+        kernel_size,
+        p_dropout,
+        n_flows=4,
+        gin_channels=0,
+    ):
+        super().__init__()
+        filter_channels = in_channels
+        self.in_channels = in_channels
+        self.filter_channels = filter_channels
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.n_flows = n_flows
+        self.gin_channels = gin_channels
+        self.log_flow = Log()
+        self.flows = nn.ModuleList()
+        self.flows.append(ElementwiseAffine(2))
+        for i in range(n_flows):
+            self.flows.append(ConvFlow(2, filter_channels, kernel_size, n_layers=3))
+            self.flows.append(Flip())
+        self.post_pre = nn.Conv1d(1, filter_channels, 1)
+        self.post_proj = nn.Conv1d(filter_channels, filter_channels, 1)
+        self.post_convs = DDSConv(
+            filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout
+        )
+        self.post_flows = nn.ModuleList()
+        self.post_flows.append(ElementwiseAffine(2))
+        for i in range(4):
+            self.post_flows.append(
+                ConvFlow(2, filter_channels, kernel_size, n_layers=3)
+            )
+            self.post_flows.append(Flip())
+        self.pre = nn.Conv1d(in_channels, filter_channels, 1)
+        self.proj = nn.Conv1d(filter_channels, filter_channels, 1)
+        self.convs = DDSConv(
+            filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout
+        )
+        if gin_channels != 0:
+            self.cond = nn.Conv1d(gin_channels, filter_channels, 1)
+    def forward(self, x, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
+        x = torch.detach(x)
+        x = self.pre(x)
+        if g is not None:
+            g = torch.detach(g)
+            x = x + self.cond(g)
+        x = self.convs(x, x_mask)
+        x = self.proj(x) * x_mask
+        if not reverse:
+            flows = self.flows
+            assert w is not None
+            logdet_tot_q = 0
+            h_w = self.post_pre(w)
+            h_w = self.post_convs(h_w, x_mask)
+            h_w = self.post_proj(h_w) * x_mask
+            e_q = (
+                torch.randn(w.size(0), 2, w.size(2)).to(device=x.device, dtype=x.dtype)
+                * x_mask
+            )
+            z_q = e_q
+            for flow in self.post_flows:
+                z_q, logdet_q = flow(z_q, x_mask, g=(x + h_w))
+                logdet_tot_q += logdet_q
+            z_u, z1 = torch.split(z_q, [1, 1], 1)
+            u = torch.sigmoid(z_u) * x_mask
+            z0 = (w - u) * x_mask
+            logdet_tot_q += torch.sum(
+                (F.logsigmoid(z_u) + F.logsigmoid(-z_u)) * x_mask, [1, 2]
+            )
+            logq = (
+                torch.sum(-0.5 * (math.log(2 * math.pi) + (e_q**2)) * x_mask, [1, 2])
+                - logdet_tot_q
+            )
+            logdet_tot = 0
+            z0, logdet = self.log_flow(z0, x_mask)
+            logdet_tot += logdet
+            z = torch.cat([z0, z1], 1)
+            for flow in flows:
+                z, logdet = flow(z, x_mask, g=x, reverse=reverse)
+                logdet_tot = logdet_tot + logdet
+            nll = (
+                torch.sum(0.5 * (math.log(2 * math.pi) + (z**2)) * x_mask, [1, 2])
+                - logdet_tot
+            )
+            return nll + logq
+        else:
+            flows = list(reversed(self.flows))
+            flows = flows[:-2] + [flows[-1]]
+            z = (
+                torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype)
+                * noise_scale
+            )
+            for flow in flows:
+                z = flow(z, x_mask, g=x, reverse=reverse)
+            z0, z1 = torch.split(z, [1, 1], 1)
+            logw = z0
+            return logw

Amphion/modules/general/scaling.py ADDED Viewed

	@@ -0,0 +1,1349 @@

+# This module is modified from https://github.com/Plachtaa/VALL-E-X/blob/3faaf8ccadb154d63b38070caf518ce9309ea0f4/modules/scaling.py
+import logging
+import random
+import math
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from torch import Tensor
+class Transpose(nn.Identity):
+    """(N, T, D) -> (N, D, T)"""
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return input.transpose(1, 2)
+class ActivationBalancerFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        x: Tensor,
+        scale_factor: Tensor,
+        sign_factor: Optional[Tensor],
+        channel_dim: int,
+    ) -> Tensor:
+        if channel_dim < 0:
+            channel_dim += x.ndim
+        ctx.channel_dim = channel_dim
+        xgt0 = x > 0
+        if sign_factor is None:
+            ctx.save_for_backward(xgt0, scale_factor)
+        else:
+            ctx.save_for_backward(xgt0, scale_factor, sign_factor)
+        return x
+    @staticmethod
+    def backward(ctx, x_grad: Tensor) -> Tuple[Tensor, None, None, None]:
+        if len(ctx.saved_tensors) == 3:
+            xgt0, scale_factor, sign_factor = ctx.saved_tensors
+            for _ in range(ctx.channel_dim, x_grad.ndim - 1):
+                scale_factor = scale_factor.unsqueeze(-1)
+                sign_factor = sign_factor.unsqueeze(-1)
+            factor = sign_factor + scale_factor * (xgt0.to(x_grad.dtype) - 0.5)
+        else:
+            xgt0, scale_factor = ctx.saved_tensors
+            for _ in range(ctx.channel_dim, x_grad.ndim - 1):
+                scale_factor = scale_factor.unsqueeze(-1)
+            factor = scale_factor * (xgt0.to(x_grad.dtype) - 0.5)
+        neg_delta_grad = x_grad.abs() * factor
+        return (
+            x_grad - neg_delta_grad,
+            None,
+            None,
+            None,
+        )
+def _compute_scale_factor(
+    x: Tensor,
+    channel_dim: int,
+    min_abs: float,
+    max_abs: float,
+    gain_factor: float,
+    max_factor: float,
+) -> Tensor:
+    if channel_dim < 0:
+        channel_dim += x.ndim
+    sum_dims = [d for d in range(x.ndim) if d != channel_dim]
+    x_abs_mean = torch.mean(x.abs(), dim=sum_dims).to(torch.float32)
+    if min_abs == 0.0:
+        below_threshold = 0.0
+    else:
+        # below_threshold is 0 if x_abs_mean > min_abs, can be at most max_factor if
+        # x_abs)_mean , min_abs.
+        below_threshold = ((min_abs - x_abs_mean) * (gain_factor / min_abs)).clamp(
+            min=0, max=max_factor
+        )
+    above_threshold = ((x_abs_mean - max_abs) * (gain_factor / max_abs)).clamp(
+        min=0, max=max_factor
+    )
+    return below_threshold - above_threshold
+def _compute_sign_factor(
+    x: Tensor,
+    channel_dim: int,
+    min_positive: float,
+    max_positive: float,
+    gain_factor: float,
+    max_factor: float,
+) -> Tensor:
+    if channel_dim < 0:
+        channel_dim += x.ndim
+    sum_dims = [d for d in range(x.ndim) if d != channel_dim]
+    proportion_positive = torch.mean((x > 0).to(torch.float32), dim=sum_dims)
+    if min_positive == 0.0:
+        factor1 = 0.0
+    else:
+        # 0 if proportion_positive >= min_positive, else can be
+        # as large as max_factor.
+        factor1 = (
+            (min_positive - proportion_positive) * (gain_factor / min_positive)
+        ).clamp_(min=0, max=max_factor)
+    if max_positive == 1.0:
+        factor2 = 0.0
+    else:
+        # 0 if self.proportion_positive <= max_positive, else can be
+        # as large as -max_factor.
+        factor2 = (
+            (proportion_positive - max_positive) * (gain_factor / (1.0 - max_positive))
+        ).clamp_(min=0, max=max_factor)
+    sign_factor = factor1 - factor2
+    # require min_positive != 0 or max_positive != 1:
+    assert not isinstance(sign_factor, float)
+    return sign_factor
+class ActivationScaleBalancerFunction(torch.autograd.Function):
+    """
+    This object is used in class ActivationBalancer when the user specified
+    min_positive=0, max_positive=1, so there are no constraints on the signs
+    of the activations and only the absolute value has a constraint.
+    """
+    @staticmethod
+    def forward(
+        ctx,
+        x: Tensor,
+        sign_factor: Tensor,
+        scale_factor: Tensor,
+        channel_dim: int,
+    ) -> Tensor:
+        if channel_dim < 0:
+            channel_dim += x.ndim
+        ctx.channel_dim = channel_dim
+        xgt0 = x > 0
+        ctx.save_for_backward(xgt0, sign_factor, scale_factor)
+        return x
+    @staticmethod
+    def backward(ctx, x_grad: Tensor) -> Tuple[Tensor, None, None, None]:
+        xgt0, sign_factor, scale_factor = ctx.saved_tensors
+        for _ in range(ctx.channel_dim, x_grad.ndim - 1):
+            sign_factor = sign_factor.unsqueeze(-1)
+            scale_factor = scale_factor.unsqueeze(-1)
+        factor = sign_factor + scale_factor * (xgt0.to(x_grad.dtype) - 0.5)
+        neg_delta_grad = x_grad.abs() * factor
+        return (
+            x_grad - neg_delta_grad,
+            None,
+            None,
+            None,
+        )
+class RandomClampFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        x: Tensor,
+        min: Optional[float],
+        max: Optional[float],
+        prob: float,
+        reflect: float,
+    ) -> Tensor:
+        x_clamped = torch.clamp(x, min=min, max=max)
+        mask = torch.rand_like(x) < prob
+        ans = torch.where(mask, x_clamped, x)
+        if x.requires_grad:
+            ctx.save_for_backward(ans == x)
+            ctx.reflect = reflect
+        if reflect != 0.0:
+            ans = ans * (1.0 + reflect) - (x * reflect)
+        return ans
+    @staticmethod
+    def backward(ctx, ans_grad: Tensor) -> Tuple[Tensor, None, None, None, None]:
+        (is_same,) = ctx.saved_tensors
+        x_grad = ans_grad * is_same.to(ans_grad.dtype)
+        reflect = ctx.reflect
+        if reflect != 0.0:
+            x_grad = x_grad * (1.0 + reflect) - (ans_grad * reflect)
+        return x_grad, None, None, None, None
+def random_clamp(
+    x: Tensor,
+    min: Optional[float] = None,
+    max: Optional[float] = None,
+    prob: float = 0.5,
+    reflect: float = 0.0,
+):
+    return RandomClampFunction.apply(x, min, max, prob, reflect)
+def random_cast_to_half(x: Tensor, min_abs: float = 5.0e-06) -> Tensor:
+    """
+    A randomized way of casting a floating point value to half precision.
+    """
+    if x.dtype == torch.float16:
+        return x
+    x_abs = x.abs()
+    is_too_small = x_abs < min_abs
+    # for elements where is_too_small is true, random_val will contain +-min_abs with
+    # probability (x.abs() / min_abs), and 0.0 otherwise.  [so this preserves expectations,
+    # for those elements].
+    random_val = min_abs * x.sign() * (torch.rand_like(x) * min_abs < x_abs)
+    return torch.where(is_too_small, random_val, x).to(torch.float16)
+class RandomGradFunction(torch.autograd.Function):
+    """
+    Does nothing in forward pass; in backward pass, gets rid of very small grads using
+    randomized approach that preserves expectations (intended to reduce roundoff).
+    """
+    @staticmethod
+    def forward(ctx, x: Tensor, min_abs: float) -> Tensor:
+        ctx.min_abs = min_abs
+        return x
+    @staticmethod
+    def backward(ctx, ans_grad: Tensor) -> Tuple[Tensor, None]:
+        if ans_grad.dtype == torch.float16:
+            return (
+                random_cast_to_half(ans_grad.to(torch.float32), min_abs=ctx.min_abs),
+                None,
+            )
+        else:
+            return ans_grad, None
+class RandomGrad(torch.nn.Module):
+    """
+    Gets rid of very small gradients using an expectation-preserving method, intended to increase
+    accuracy of training when using amp (automatic mixed precision)
+    """
+    def __init__(self, min_abs: float = 5.0e-06):
+        super(RandomGrad, self).__init__()
+        self.min_abs = min_abs
+    def forward(self, x: Tensor):
+        if torch.jit.is_scripting() or not self.training or torch.jit.is_tracing():
+            return x
+        else:
+            return RandomGradFunction.apply(x, self.min_abs)
+class SoftmaxFunction(torch.autograd.Function):
+    """
+    Tries to handle half-precision derivatives in a randomized way that should
+    be more accurate for training than the default behavior.
+    """
+    @staticmethod
+    def forward(ctx, x: Tensor, dim: int):
+        ans = x.softmax(dim=dim)
+        # if x dtype is float16, x.softmax() returns a float32 because
+        # (presumably) that op does not support float16, and autocast
+        # is enabled.
+        if torch.is_autocast_enabled():
+            ans = ans.to(torch.float16)
+        ctx.save_for_backward(ans)
+        ctx.x_dtype = x.dtype
+        ctx.dim = dim
+        return ans
+    @staticmethod
+    def backward(ctx, ans_grad: Tensor):
+        (ans,) = ctx.saved_tensors
+        with torch.cuda.amp.autocast(enabled=False):
+            ans_grad = ans_grad.to(torch.float32)
+            ans = ans.to(torch.float32)
+            x_grad = ans_grad * ans
+            x_grad = x_grad - ans * x_grad.sum(dim=ctx.dim, keepdim=True)
+            return x_grad, None
+def softmax(x: Tensor, dim: int):
+    if torch.jit.is_scripting() or torch.jit.is_tracing():
+        return x.softmax(dim)
+    return SoftmaxFunction.apply(x, dim)
+class MaxEigLimiterFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        x: Tensor,
+        coeffs: Tensor,
+        direction: Tensor,
+        channel_dim: int,
+        grad_scale: float,
+    ) -> Tensor:
+        ctx.channel_dim = channel_dim
+        ctx.grad_scale = grad_scale
+        ctx.save_for_backward(x.detach(), coeffs.detach(), direction.detach())
+        return x
+    @staticmethod
+    def backward(ctx, x_grad, *args):
+        with torch.enable_grad():
+            (x_orig, coeffs, new_direction) = ctx.saved_tensors
+            x_orig.requires_grad = True
+            num_channels = x_orig.shape[ctx.channel_dim]
+            x = x_orig.transpose(ctx.channel_dim, -1).reshape(-1, num_channels)
+            new_direction.requires_grad = False
+            x = x - x.mean(dim=0)
+            x_var = (x**2).mean()
+            x_residual = x - coeffs * new_direction
+            x_residual_var = (x_residual**2).mean()
+            # `variance_proportion` is the proportion of the variance accounted for
+            # by the top eigen-direction.  This is to be minimized.
+            variance_proportion = (x_var - x_residual_var) / (x_var + 1.0e-20)
+            variance_proportion.backward()
+        x_orig_grad = x_orig.grad
+        x_extra_grad = (
+            x_orig.grad
+            * ctx.grad_scale
+            * x_grad.norm()
+            / (x_orig_grad.norm() + 1.0e-20)
+        )
+        return x_grad + x_extra_grad.detach(), None, None, None, None
+class BasicNorm(torch.nn.Module):
+    """
+    This is intended to be a simpler, and hopefully cheaper, replacement for
+    LayerNorm.  The observation this is based on, is that Transformer-type
+    networks, especially with pre-norm, sometimes seem to set one of the
+    feature dimensions to a large constant value (e.g. 50), which "defeats"
+    the LayerNorm because the output magnitude is then not strongly dependent
+    on the other (useful) features.  Presumably the weight and bias of the
+    LayerNorm are required to allow it to do this.
+    So the idea is to introduce this large constant value as an explicit
+    parameter, that takes the role of the "eps" in LayerNorm, so the network
+    doesn't have to do this trick.  We make the "eps" learnable.
+    Args:
+       num_channels: the number of channels, e.g. 512.
+      channel_dim: the axis/dimension corresponding to the channel,
+        interprted as an offset from the input's ndim if negative.
+        shis is NOT the num_channels; it should typically be one of
+        {-2, -1, 0, 1, 2, 3}.
+       eps: the initial "epsilon" that we add as ballast in:
+             scale = ((input_vec**2).mean() + epsilon)**-0.5
+          Note: our epsilon is actually large, but we keep the name
+          to indicate the connection with conventional LayerNorm.
+       learn_eps: if true, we learn epsilon; if false, we keep it
+         at the initial value.
+    eps_min: float
+    eps_max: float
+    """
+    def __init__(
+        self,
+        num_channels: int,
+        channel_dim: int = -1,  # CAUTION: see documentation.
+        eps: float = 0.25,
+        learn_eps: bool = True,
+        eps_min: float = -3.0,
+        eps_max: float = 3.0,
+    ) -> None:
+        super(BasicNorm, self).__init__()
+        self.num_channels = num_channels
+        self.channel_dim = channel_dim
+        if learn_eps:
+            self.eps = nn.Parameter(torch.tensor(eps).log().detach())
+        else:
+            self.register_buffer("eps", torch.tensor(eps).log().detach())
+        self.eps_min = eps_min
+        self.eps_max = eps_max
+    def forward(self, x: Tensor) -> Tensor:
+        assert x.shape[self.channel_dim] == self.num_channels
+        eps = self.eps
+        if self.training and random.random() < 0.25:
+            # with probability 0.25, in training mode, clamp eps between the min
+            # and max; this will encourage it to learn parameters within the
+            # allowed range by making parameters that are outside the allowed
+            # range noisy.
+            # gradients to allow the parameter to get back into the allowed
+            # region if it happens to exit it.
+            eps = eps.clamp(min=self.eps_min, max=self.eps_max)
+        scales = (
+            torch.mean(x**2, dim=self.channel_dim, keepdim=True) + eps.exp()
+        ) ** -0.5
+        return x * scales
+def ScaledLinear(*args, initial_scale: float = 1.0, **kwargs) -> nn.Linear:
+    """
+    Behaves like a constructor of a modified version of nn.Linear
+    that gives an easy way to set the default initial parameter scale.
+    Args:
+        Accepts the standard args and kwargs that nn.Linear accepts
+        e.g. in_features, out_features, bias=False.
+        initial_scale: you can override this if you want to increase
+           or decrease the initial magnitude of the module's output
+           (affects the initialization of weight_scale and bias_scale).
+           Another option, if you want to do something like this, is
+           to re-initialize the parameters.
+    """
+    ans = nn.Linear(*args, **kwargs)
+    with torch.no_grad():
+        ans.weight[:] *= initial_scale
+        if ans.bias is not None:
+            torch.nn.init.uniform_(ans.bias, -0.1 * initial_scale, 0.1 * initial_scale)
+    return ans
+def ScaledConv1d(
+    *args,
+    initial_scale: float = 1.0,
+    kernel_size: int = 3,
+    padding: str = "same",
+    **kwargs,
+) -> nn.Conv1d:
+    """
+    Behaves like a constructor of a modified version of nn.Conv1d
+    that gives an easy way to set the default initial parameter scale.
+    Args:
+        Accepts the standard args and kwargs that nn.Linear accepts
+        e.g. in_features, out_features, bias=False.
+        initial_scale: you can override this if you want to increase
+           or decrease the initial magnitude of the module's output
+           (affects the initialization of weight_scale and bias_scale).
+           Another option, if you want to do something like this, is
+           to re-initialize the parameters.
+    """
+    ans = nn.Conv1d(*args, kernel_size=kernel_size, padding=padding, **kwargs)
+    with torch.no_grad():
+        ans.weight[:] *= initial_scale
+        if ans.bias is not None:
+            torch.nn.init.uniform_(ans.bias, -0.1 * initial_scale, 0.1 * initial_scale)
+    return ans
+def TransposeScaledConv1d(
+    *args,
+    initial_scale: float = 1.0,
+    kernel_size: int = 3,
+    padding: str = "same",
+    **kwargs,
+) -> nn.Sequential:
+    """
+    Transpose -> ScaledConv1d
+    """
+    return nn.Sequential(
+        Transpose(),
+        ScaledConv1d(
+            *args,
+            initial_scale=initial_scale,
+            kernel_size=kernel_size,
+            padding=padding,
+            **kwargs,
+        ),
+    )
+def ScaledConv1dTranspose(
+    *args,
+    initial_scale: float = 1.0,
+    kernel_size: int = 3,
+    padding: str = "same",
+    **kwargs,
+) -> nn.Sequential:
+    """
+    Transpose -> ScaledConv1d
+    """
+    return nn.Sequential(
+        ScaledConv1d(
+            *args,
+            initial_scale=initial_scale,
+            kernel_size=kernel_size,
+            padding=padding,
+            **kwargs,
+        ),
+        Transpose(),
+    )
+def TransposeConv1d(
+    *args, kernel_size: int = 3, padding: str = "same", **kwargs
+) -> nn.Sequential:
+    """
+    Transpose -> Conv1d
+    """
+    return nn.Sequential(
+        Transpose(),
+        nn.Conv1d(*args, kernel_size=kernel_size, padding=padding, **kwargs),
+    )
+def Conv1dTranspose(
+    *args, kernel_size: int = 3, padding: str = "same", **kwargs
+) -> nn.Sequential:
+    """
+    ScaledConv1d -> Transpose
+    """
+    return nn.Sequential(
+        nn.Conv1d(*args, kernel_size=kernel_size, padding=padding, **kwargs),
+        Transpose(),
+    )
+class SRLinear(nn.Linear):
+    """https://arxiv.org/abs/2303.06296
+    Stabilizing Transformer Training by Preventing Attention Entropy Collapse
+    """
+    def __init__(self, in_features, out_features, bias=True, **kwargs):
+        super().__init__(in_features, out_features, bias=bias, **kwargs)
+        self.register_buffer(
+            "u", nn.functional.normalize(torch.randn(in_features), dim=0)
+        )
+        with torch.no_grad():
+            sigma = self.get_sigma()
+        self.register_buffer("spectral_norm", sigma)
+        self.sigma = nn.Parameter(torch.ones(1))
+    def get_sigma(self):
+        with torch.no_grad():
+            u = self.u
+            v = self.weight.mv(u)
+            v = nn.functional.normalize(v, dim=0)
+            u = self.weight.T.mv(v)
+            u = nn.functional.normalize(u, dim=0)
+            self.u.data.copy_(u)
+        return torch.einsum("c,cd,d->", v, self.weight, u)
+    def get_weight(self):
+        sigma = self.get_sigma()
+        if self.training:
+            self.spectral_norm.data.copy_(sigma)
+        weight = (self.sigma / sigma) * self.weight
+        return weight
+    def forward(self, x):
+        return nn.functional.linear(x, self.get_weight(), self.bias)
+class SRConv1d(SRLinear):
+    def __init__(
+        self,
+        in_features,
+        out_features,
+        kernel_size,
+        stride: int = 1,
+        padding: str = "same",
+        bias: bool = True,
+        **kwargs,
+    ):
+        in_features = in_features * kernel_size
+        super().__init__(in_features, out_features, bias=bias, **kwargs)
+        nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.padding = padding
+    def forward(self, x):
+        in_features = self.in_features // self.kernel_size
+        weight = self.get_weight().view(
+            self.out_features, in_features, self.kernel_size
+        )
+        return nn.functional.conv1d(
+            x, weight, bias=self.bias, stride=self.stride, padding=self.padding
+        )
+def TransposeSRConv1d(
+    *args, kernel_size: int = 3, padding: str = "same", **kwargs
+) -> nn.Sequential:
+    """
+    Transpose -> SRConv1d
+    """
+    return nn.Sequential(
+        Transpose(),
+        SRConv1d(*args, kernel_size=kernel_size, padding=padding, **kwargs),
+    )
+def SRConv1dTranspose(
+    *args, kernel_size: int = 3, padding: str = "same", **kwargs
+) -> nn.Sequential:
+    """
+    SRConv1d -> Transpose
+    """
+    return nn.Sequential(
+        SRConv1d(*args, kernel_size=kernel_size, padding=padding, **kwargs),
+        Transpose(),
+    )
+class ActivationBalancer(torch.nn.Module):
+    """
+    Modifies the backpropped derivatives of a function to try to encourage, for
+    each channel, that it is positive at least a proportion `threshold` of the
+    time.  It does this by multiplying negative derivative values by up to
+    (1+max_factor), and positive derivative values by up to (1-max_factor),
+    interpolated from 1 at the threshold to those extremal values when none
+    of the inputs are positive.
+    Args:
+           num_channels: the number of channels
+           channel_dim: the dimension/axis corresponding to the channel, e.g.
+               -1, 0, 1, 2; will be interpreted as an offset from x.ndim if negative.
+           min_positive: the minimum, per channel, of the proportion of the time
+               that (x > 0), below which we start to modify the derivatives.
+           max_positive: the maximum, per channel, of the proportion of the time
+               that (x > 0), above which we start to modify the derivatives.
+           max_factor: the maximum factor by which we modify the derivatives for
+              either the sign constraint or the magnitude constraint;
+              e.g. with max_factor=0.02, the the derivatives would be multiplied by
+              values in the range [0.98..1.02].
+           sign_gain_factor: determines the 'gain' with which we increase the
+              change in gradient once the constraints on min_positive and max_positive
+              are violated.
+           scale_gain_factor: determines the 'gain' with which we increase the
+              change in gradient once the constraints on min_abs and max_abs
+              are violated.
+           min_abs:  the minimum average-absolute-value difference from the mean
+               value per channel, which we allow, before we start to modify
+               the derivatives to prevent this.
+           max_abs:  the maximum average-absolute-value difference from the mean
+               value per channel, which we allow, before we start to modify
+               the derivatives to prevent this.
+          min_prob: determines the minimum probability with which we modify the
+             gradients for the {min,max}_positive and {min,max}_abs constraints,
+             on each forward().  This is done randomly to prevent all layers
+             from doing it at the same time.  Early in training we may use
+             higher probabilities than this; it will decay to this value.
+    """
+    def __init__(
+        self,
+        num_channels: int,
+        channel_dim: int,
+        min_positive: float = 0.05,
+        max_positive: float = 0.95,
+        max_factor: float = 0.04,
+        sign_gain_factor: float = 0.01,
+        scale_gain_factor: float = 0.02,
+        min_abs: float = 0.2,
+        max_abs: float = 100.0,
+        min_prob: float = 0.1,
+    ):
+        super(ActivationBalancer, self).__init__()
+        self.num_channels = num_channels
+        self.channel_dim = channel_dim
+        self.min_positive = min_positive
+        self.max_positive = max_positive
+        self.max_factor = max_factor
+        self.min_abs = min_abs
+        self.max_abs = max_abs
+        self.min_prob = min_prob
+        self.sign_gain_factor = sign_gain_factor
+        self.scale_gain_factor = scale_gain_factor
+        # count measures how many times the forward() function has been called.
+        # We occasionally sync this to a tensor called `count`, that exists to
+        # make sure it is synced to disk when we load and save the model.
+        self.cpu_count = 0
+        self.register_buffer("count", torch.tensor(0, dtype=torch.int64))
+    def forward(self, x: Tensor) -> Tensor:
+        if torch.jit.is_scripting() or not x.requires_grad or torch.jit.is_tracing():
+            return _no_op(x)
+        count = self.cpu_count
+        self.cpu_count += 1
+        if random.random() < 0.01:
+            # Occasionally sync self.cpu_count with self.count.
+            # count affects the decay of 'prob'.  don't do this on every iter,
+            # because syncing with the GPU is slow.
+            self.cpu_count = max(self.cpu_count, self.count.item())
+            self.count.fill_(self.cpu_count)
+        # the prob of doing some work exponentially decreases from 0.5 till it hits
+        # a floor at min_prob (==0.1, by default)
+        prob = max(self.min_prob, 0.5 ** (1 + (count / 4000.0)))
+        if random.random() < prob:
+            sign_gain_factor = 0.5
+            if self.min_positive != 0.0 or self.max_positive != 1.0:
+                sign_factor = _compute_sign_factor(
+                    x,
+                    self.channel_dim,
+                    self.min_positive,
+                    self.max_positive,
+                    gain_factor=self.sign_gain_factor / prob,
+                    max_factor=self.max_factor,
+                )
+            else:
+                sign_factor = None
+            scale_factor = _compute_scale_factor(
+                x.detach(),
+                self.channel_dim,
+                min_abs=self.min_abs,
+                max_abs=self.max_abs,
+                gain_factor=self.scale_gain_factor / prob,
+                max_factor=self.max_factor,
+            )
+            return ActivationBalancerFunction.apply(
+                x,
+                scale_factor,
+                sign_factor,
+                self.channel_dim,
+            )
+        else:
+            return _no_op(x)
+def penalize_abs_values_gt(x: Tensor, limit: float, penalty: float) -> Tensor:
+    """
+    Returns x unmodified, but in backprop will put a penalty for the excess of
+    the absolute values of elements of x over the limit "limit".  E.g. if
+    limit == 10.0, then if x has any values over 10 it will get a penalty.
+    Caution: the value of this penalty will be affected by grad scaling used
+    in automatic mixed precision training.  For this reasons we use this,
+    it shouldn't really matter, or may even be helpful; we just use this
+    to disallow really implausible values of scores to be given to softmax.
+    """
+    x_sign = x.sign()
+    over_limit = (x.abs() - limit) > 0
+    # The following is a memory efficient way to penalize the absolute values of
+    # x that's over the limit.  (The memory efficiency comes when you think
+    # about which items torch needs to cache for the autograd, and which ones it
+    # can throw away).  The numerical value of aux_loss as computed here will
+    # actually be larger than it should be, by limit * over_limit.sum(), but it
+    # has the same derivative as the real aux_loss which is penalty * (x.abs() -
+    # limit).relu().
+    aux_loss = penalty * ((x_sign * over_limit).to(torch.int8) * x)
+    # note: we don't do sum() here on aux)_loss, but it's as if we had done
+    # sum() due to how with_loss() works.
+    x = with_loss(x, aux_loss)
+    # you must use x for something, or this will be ineffective.
+    return x
+def _diag(x: Tensor):  # like .diag(), but works for tensors with 3 dims.
+    if x.ndim == 2:
+        return x.diag()
+    else:
+        (batch, dim, dim) = x.shape
+        x = x.reshape(batch, dim * dim)
+        x = x[:, :: dim + 1]
+        assert x.shape == (batch, dim)
+        return x
+def _whitening_metric(x: Tensor, num_groups: int):
+    """
+    Computes the "whitening metric", a value which will be 1.0 if all the eigenvalues of
+    of the centered feature covariance are the same within each group's covariance matrix
+    and also between groups.
+    Args:
+        x: a Tensor of shape (*, num_channels)
+     num_groups:  the number of groups of channels, a number >=1 that divides num_channels
+    Returns:
+        Returns a scalar Tensor that will be 1.0 if the data is "perfectly white" and
+    greater than 1.0 otherwise.
+    """
+    assert x.dtype != torch.float16
+    x = x.reshape(-1, x.shape[-1])
+    (num_frames, num_channels) = x.shape
+    assert num_channels % num_groups == 0
+    channels_per_group = num_channels // num_groups
+    x = x.reshape(num_frames, num_groups, channels_per_group).transpose(0, 1)
+    # x now has shape (num_groups, num_frames, channels_per_group)
+    # subtract the mean so we use the centered, not uncentered, covariance.
+    # My experience has been that when we "mess with the gradients" like this,
+    # it's better not do anything that tries to move the mean around, because
+    # that can easily cause instability.
+    x = x - x.mean(dim=1, keepdim=True)
+    # x_covar: (num_groups, channels_per_group, channels_per_group)
+    x_covar = torch.matmul(x.transpose(1, 2), x)
+    x_covar_mean_diag = _diag(x_covar).mean()
+    # the following expression is what we'd get if we took the matrix product
+    # of each covariance and measured the mean of its trace, i.e.
+    # the same as _diag(torch.matmul(x_covar, x_covar)).mean().
+    x_covarsq_mean_diag = (x_covar**2).sum() / (num_groups * channels_per_group)
+    # this metric will be >= 1.0; the larger it is, the less 'white' the data was.
+    metric = x_covarsq_mean_diag / (x_covar_mean_diag**2 + 1.0e-20)
+    return metric
+class WhiteningPenaltyFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        x: Tensor,
+        num_groups: int,
+        whitening_limit: float,
+        grad_scale: float,
+    ) -> Tensor:
+        ctx.save_for_backward(x)
+        ctx.num_groups = num_groups
+        ctx.whitening_limit = whitening_limit
+        ctx.grad_scale = grad_scale
+        return x
+    @staticmethod
+    def backward(ctx, x_grad: Tensor):
+        (x_orig,) = ctx.saved_tensors
+        with torch.enable_grad():
+            with torch.cuda.amp.autocast(enabled=False):
+                x_detached = x_orig.to(torch.float32).detach()
+                x_detached.requires_grad = True
+                metric = _whitening_metric(x_detached, ctx.num_groups)
+                if random.random() < 0.005 or __name__ == "__main__":
+                    logging.info(
+                        f"Whitening: num_groups={ctx.num_groups}, num_channels={x_orig.shape[-1]}, "
+                        f"metric={metric.item():.2f} vs. limit={ctx.whitening_limit}"
+                    )
+                (metric - ctx.whitening_limit).relu().backward()
+                penalty_grad = x_detached.grad
+                scale = ctx.grad_scale * (
+                    x_grad.to(torch.float32).norm() / (penalty_grad.norm() + 1.0e-20)
+                )
+                penalty_grad = penalty_grad * scale
+        return x_grad + penalty_grad.to(x_grad.dtype), None, None, None
+class Whiten(nn.Module):
+    def __init__(
+        self,
+        num_groups: int,
+        whitening_limit: float,
+        prob: Union[float, Tuple[float, float]],
+        grad_scale: float,
+    ):
+        """
+        Args:
+          num_groups: the number of groups to divide the channel dim into before
+            whitening.  We will attempt to make the feature covariance
+            within each group, after mean subtraction, as "white" as possible,
+            while having the same trace across all groups.
+         whitening_limit: a value greater than 1.0, that dictates how much
+           freedom we have to violate the constraints.  1.0 would mean perfectly
+           white, with exactly the same trace across groups; larger values
+           give more freedom.  E.g. 2.0.
+         prob: the probability with which we apply the gradient modification
+           (also affects the grad scale).  May be supplied as a float,
+           or as a pair (min_prob, max_prob)
+          grad_scale: determines the scale on the gradient term from this object,
+            relative to the rest of the gradient on the attention weights.
+            E.g. 0.02 (you may want to use smaller values than this if prob is large)
+        """
+        super(Whiten, self).__init__()
+        assert num_groups >= 1
+        assert whitening_limit >= 1
+        assert grad_scale >= 0
+        self.num_groups = num_groups
+        self.whitening_limit = whitening_limit
+        if isinstance(prob, float):
+            assert 0 < prob <= 1
+            self.prob = prob
+        else:
+            (self.min_prob, self.max_prob) = prob
+            assert 0 < self.min_prob < self.max_prob <= 1
+            self.prob = self.max_prob
+        self.grad_scale = grad_scale
+    def forward(self, x: Tensor) -> Tensor:
+        """
+        In the forward pass, this function just returns the input unmodified.
+        In the backward pass, it will modify the gradients to ensure that the
+        distribution in each group has close to (lambda times I) as the covariance
+        after mean subtraction, with the same lambda across groups.
+        For whitening_limit > 1, there will be more freedom to violate this
+        constraint.
+        Args:
+           x: the input of shape (*, num_channels)
+        Returns:
+            x, unmodified.   You should make sure
+        you use the returned value, or the graph will be freed
+        and nothing will happen in backprop.
+        """
+        if not x.requires_grad or random.random() > self.prob or self.grad_scale == 0:
+            return _no_op(x)
+        else:
+            if hasattr(self, "min_prob") and random.random() < 0.25:
+                # occasionally switch between min_prob and max_prob, based on whether
+                # we are above or below the threshold.
+                if (
+                    _whitening_metric(x.to(torch.float32), self.num_groups)
+                    > self.whitening_limit
+                ):
+                    # there would be a change to the grad.
+                    self.prob = self.max_prob
+                else:
+                    self.prob = self.min_prob
+            return WhiteningPenaltyFunction.apply(
+                x, self.num_groups, self.whitening_limit, self.grad_scale
+            )
+class WithLoss(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, x: Tensor, y: Tensor):
+        ctx.y_shape = y.shape
+        return x
+    @staticmethod
+    def backward(ctx, ans_grad: Tensor):
+        return ans_grad, torch.ones(
+            ctx.y_shape, dtype=ans_grad.dtype, device=ans_grad.device
+        )
+def with_loss(x, y):
+    if torch.jit.is_scripting() or torch.jit.is_tracing():
+        return x
+    # returns x but adds y.sum() to the loss function.
+    return WithLoss.apply(x, y)
+def _no_op(x: Tensor) -> Tensor:
+    if torch.jit.is_scripting() or torch.jit.is_tracing():
+        return x
+    else:
+        # a no-op function that will have a node in the autograd graph,
+        # to avoid certain bugs relating to backward hooks
+        return x.chunk(1, dim=-1)[0]
+class Identity(torch.nn.Module):
+    def __init__(self):
+        super(Identity, self).__init__()
+    def forward(self, x):
+        return _no_op(x)
+class MaxEig(torch.nn.Module):
+    """
+    Modifies the backpropped derivatives of a function to try to discourage
+    that any given direction in activation space accounts for more than
+    a specified proportion of the covariance (e.g. 0.2).
+    Args:
+           num_channels: the number of channels
+           channel_dim: the dimension/axis corresponding to the channel, e.g.
+               -1, 0, 1, 2; will be interpreted as an offset from x.ndim if negative.
+           max_var_per_eig:  the maximum proportion of the variance of the
+               features/channels, after mean subtraction, that can come from
+               any given eigenvalue.
+           min_prob: the minimum probability with which we apply this during any invocation
+               of forward(), assuming last time we applied the constraint it was
+               not active; supplied for speed.
+           scale: determines the scale with which we modify the gradients, relative
+               to the existing / unmodified gradients
+    """
+    def __init__(
+        self,
+        num_channels: int,
+        channel_dim: int,
+        max_var_per_eig: float = 0.2,
+        min_prob: float = 0.01,
+        scale: float = 0.01,
+    ):
+        super(MaxEig, self).__init__()
+        self.num_channels = num_channels
+        self.channel_dim = channel_dim
+        self.scale = scale
+        assert max_var_per_eig == 0.0 or max_var_per_eig > 1.0 / num_channels
+        self.max_var_per_eig = max_var_per_eig
+        # we figure out the dominant direction using the power method: starting with
+        # a random vector, keep multiplying by the covariance and renormalizing.
+        with torch.no_grad():
+            # arbitrary.. would use randn() but want to leave the rest of the model's
+            # random parameters unchanged for comparison
+            direction = torch.arange(num_channels).to(torch.float)
+            direction = direction / direction.norm()
+            self.register_buffer("max_eig_direction", direction)
+        self.min_prob = min_prob
+        # cur_prob is the current probability we'll use to apply the ActivationBalancer.
+        # We'll regress this towards prob, each tiem we try to apply it and it is not
+        # active.
+        self.cur_prob = 1.0
+    def forward(self, x: Tensor) -> Tensor:
+        if (
+            torch.jit.is_scripting()
+            or self.max_var_per_eig <= 0
+            or random.random() > self.cur_prob
+            or torch.jit.is_tracing()
+        ):
+            return _no_op(x)
+        with torch.cuda.amp.autocast(enabled=False):
+            eps = 1.0e-20
+            orig_x = x
+            x = x.to(torch.float32)
+            with torch.no_grad():
+                x = x.transpose(self.channel_dim, -1).reshape(-1, self.num_channels)
+                x = x - x.mean(dim=0)
+                new_direction, coeffs = self._find_direction_coeffs(
+                    x, self.max_eig_direction
+                )
+                x_var = (x**2).mean()
+                x_residual = x - coeffs * new_direction
+                x_residual_var = (x_residual**2).mean()
+                # `variance_proportion` is the proportion of the variance accounted for
+                # by the top eigen-direction.
+                variance_proportion = (x_var - x_residual_var) / (x_var + 1.0e-20)
+                # ensure new direction is nonzero even if x == 0, by including `direction`.
+                self._set_direction(0.1 * self.max_eig_direction + new_direction)
+            if random.random() < 0.01 or __name__ == "__main__":
+                logging.info(
+                    f"variance_proportion = {variance_proportion.item()}, shape={tuple(orig_x.shape)}, cur_prob={self.cur_prob}"
+                )
+            if variance_proportion >= self.max_var_per_eig:
+                # The constraint is active.  Note, we should quite rarely
+                # reach here, only near the beginning of training if we are
+                # starting to diverge, should this constraint be active.
+                cur_prob = self.cur_prob
+                self.cur_prob = 1.0  # next time, do the update with probability 1.0.
+                return MaxEigLimiterFunction.apply(
+                    orig_x, coeffs, new_direction, self.channel_dim, self.scale
+                )
+            else:
+                # let self.cur_prob exponentially approach self.min_prob, as
+                # long as the constraint is inactive.
+                self.cur_prob = 0.75 * self.cur_prob + 0.25 * self.min_prob
+                return orig_x
+    def _set_direction(self, direction: Tensor):
+        """
+        Sets self.max_eig_direction to a normalized version of `direction`
+        """
+        direction = direction.detach()
+        direction = direction / direction.norm()
+        direction_sum = direction.sum().item()
+        if direction_sum - direction_sum == 0:  # no inf/nan
+            self.max_eig_direction[:] = direction
+        else:
+            logging.info(
+                f"Warning: sum of direction in MaxEig is {direction_sum}, "
+                "num_channels={self.num_channels}, channel_dim={self.channel_dim}"
+            )
+    def _find_direction_coeffs(
+        self, x: Tensor, prev_direction: Tensor
+    ) -> Tuple[Tensor, Tensor, Tensor]:
+        """
+            Figure out (an approximation to) the proportion of the variance of a set of
+            feature vectors that can be attributed to the top eigen-direction.
+            Args:
+             x: a Tensor of shape (num_frames, num_channels), with num_frames > 1.
+          prev_direction:  a Tensor of shape (num_channels,), that is our previous estimate
+                   of the top eigen-direction, or a random direction if this is the first
+                   iteration.  Does not have to be normalized, but should be nonzero.
+        Returns: (cur_direction, coeffs), where:
+             cur_direction: a Tensor of shape (num_channels,) that is the current
+                estimate of the top eigen-direction.
+             coeffs: a Tensor of shape (num_frames, 1) that minimizes, or
+                approximately minimizes, (x - coeffs * cur_direction).norm()
+        """
+        (num_frames, num_channels) = x.shape
+        assert num_channels > 1 and num_frames > 1
+        assert prev_direction.shape == (num_channels,)
+        # `coeffs` are the coefficients of `prev_direction` in x.
+        # actually represent the coeffs up to a constant positive factor.
+        coeffs = (x * prev_direction).sum(dim=1, keepdim=True) + 1.0e-10
+        cur_direction = (x * coeffs).sum(dim=0) / ((coeffs**2).sum() + 1.0e-20)
+        return cur_direction, coeffs
+class DoubleSwishFunction(torch.autograd.Function):
+    """
+      double_swish(x) = x * torch.sigmoid(x-1)
+    This is a definition, originally motivated by its close numerical
+    similarity to swish(swish(x)), where swish(x) =  x * sigmoid(x).
+    Memory-efficient derivative computation:
+     double_swish(x) = x * s, where s(x) = torch.sigmoid(x-1)
+     double_swish'(x) = d/dx double_swish(x) =  x * s'(x) + x' * s(x) = x * s'(x) + s(x).
+     Now, s'(x) = s(x) * (1-s(x)).
+     double_swish'(x) =  x * s'(x) + s(x).
+                      =  x * s(x) * (1-s(x)) + s(x).
+                     = double_swish(x) * (1-s(x)) + s(x)
+     ... so we just need to remember s(x) but not x itself.
+    """
+    @staticmethod
+    def forward(ctx, x: Tensor) -> Tensor:
+        requires_grad = x.requires_grad
+        x_dtype = x.dtype
+        if x.dtype == torch.float16:
+            x = x.to(torch.float32)
+        s = torch.sigmoid(x - 1.0)
+        y = x * s
+        if requires_grad:
+            deriv = y * (1 - s) + s
+            # notes on derivative of x * sigmoid(x - 1):
+            # https://www.wolframalpha.com/input?i=d%2Fdx+%28x+*+sigmoid%28x-1%29%29
+            # min \simeq -0.043638.  Take floor as -0.043637 so it's a lower bund
+            # max \simeq 1.1990.   Take ceil to be 1.2 so it's an upper bound.
+            # the combination of "+ torch.rand_like(deriv)" and casting to torch.uint8 (which
+            # floors), should be expectation-preserving.
+            floor = -0.043637
+            ceil = 1.2
+            d_scaled = (deriv - floor) * (255.0 / (ceil - floor)) + torch.rand_like(
+                deriv
+            )
+            if __name__ == "__main__":
+                # for self-testing only.
+                assert d_scaled.min() >= 0.0
+                assert d_scaled.max() < 256.0
+            d_int = d_scaled.to(torch.uint8)
+            ctx.save_for_backward(d_int)
+        if x.dtype == torch.float16 or torch.is_autocast_enabled():
+            y = y.to(torch.float16)
+        return y
+    @staticmethod
+    def backward(ctx, y_grad: Tensor) -> Tensor:
+        (d,) = ctx.saved_tensors
+        # the same constants as used in forward pass.
+        floor = -0.043637
+        ceil = 1.2
+        d = d * ((ceil - floor) / 255.0) + floor
+        return y_grad * d
+class DoubleSwish(torch.nn.Module):
+    def forward(self, x: Tensor) -> Tensor:
+        """Return double-swish activation function which is an approximation to Swish(Swish(x)),
+        that we approximate closely with x * sigmoid(x-1).
+        """
+        if torch.jit.is_scripting() or torch.jit.is_tracing():
+            return x * torch.sigmoid(x - 1.0)
+        return DoubleSwishFunction.apply(x)
+def BalancedDoubleSwish(
+    d_model, channel_dim=-1, max_abs=10.0, min_prob=0.25
+) -> nn.Sequential:
+    """
+    ActivationBalancer -> DoubleSwish
+    """
+    balancer = ActivationBalancer(
+        d_model, channel_dim=channel_dim, max_abs=max_abs, min_prob=min_prob
+    )
+    return nn.Sequential(
+        balancer,
+        DoubleSwish(),
+    )
+def _test_max_eig():
+    for proportion in [0.1, 0.5, 10.0]:
+        logging.info(f"proportion = {proportion}")
+        x = torch.randn(100, 128)
+        direction = torch.randn(128)
+        coeffs = torch.randn(100, 1)
+        x += proportion * direction * coeffs
+        x.requires_grad = True
+        num_channels = 128
+        m = MaxEig(
+            num_channels, 1, 0.5, scale=0.1  # channel_dim  # max_var_per_eig
+        )  # grad_scale
+        for _ in range(4):
+            y = m(x)
+        y_grad = torch.randn_like(x)
+        y.backward(gradient=y_grad)
+        if proportion < 0.2:
+            assert torch.allclose(x.grad, y_grad, atol=1.0e-02)
+        elif proportion > 1.0:
+            assert not torch.allclose(x.grad, y_grad)
+def _test_whiten():
+    for proportion in [0.1, 0.5, 10.0]:
+        logging.info(f"_test_whiten(): proportion = {proportion}")
+        x = torch.randn(100, 128)
+        direction = torch.randn(128)
+        coeffs = torch.randn(100, 1)
+        x += proportion * direction * coeffs
+        x.requires_grad = True
+        num_channels = 128
+        m = Whiten(
+            1, 5.0, prob=1.0, grad_scale=0.1  # num_groups  # whitening_limit,
+        )  # grad_scale
+        for _ in range(4):
+            y = m(x)
+        y_grad = torch.randn_like(x)
+        y.backward(gradient=y_grad)
+        if proportion < 0.2:
+            assert torch.allclose(x.grad, y_grad)
+        elif proportion > 1.0:
+            assert not torch.allclose(x.grad, y_grad)
+def _test_activation_balancer_sign():
+    probs = torch.arange(0, 1, 0.01)
+    N = 1000
+    x = 1.0 * ((2.0 * (torch.rand(probs.numel(), N) < probs.unsqueeze(-1))) - 1.0)
+    x = x.detach()
+    x.requires_grad = True
+    m = ActivationBalancer(
+        probs.numel(),
+        channel_dim=0,
+        min_positive=0.05,
+        max_positive=0.95,
+        max_factor=0.2,
+        min_abs=0.0,
+    )
+    y_grad = torch.sign(torch.randn(probs.numel(), N))
+    y = m(x)
+    y.backward(gradient=y_grad)
+    print("_test_activation_balancer_sign: x = ", x)
+    print("_test_activation_balancer_sign: y grad = ", y_grad)
+    print("_test_activation_balancer_sign: x grad = ", x.grad)
+def _test_activation_balancer_magnitude():
+    magnitudes = torch.arange(0, 1, 0.01)
+    N = 1000
+    x = torch.sign(torch.randn(magnitudes.numel(), N)) * magnitudes.unsqueeze(-1)
+    x = x.detach()
+    x.requires_grad = True
+    m = ActivationBalancer(
+        magnitudes.numel(),
+        channel_dim=0,
+        min_positive=0.0,
+        max_positive=1.0,
+        max_factor=0.2,
+        min_abs=0.2,
+        max_abs=0.8,
+        min_prob=1.0,
+    )
+    y_grad = torch.sign(torch.randn(magnitudes.numel(), N))
+    y = m(x)
+    y.backward(gradient=y_grad)
+    print("_test_activation_balancer_magnitude: x = ", x)
+    print("_test_activation_balancer_magnitude: y grad = ", y_grad)
+    print("_test_activation_balancer_magnitude: x grad = ", x.grad)
+def _test_basic_norm():
+    num_channels = 128
+    m = BasicNorm(num_channels=num_channels, channel_dim=1)
+    x = torch.randn(500, num_channels)
+    y = m(x)
+    assert y.shape == x.shape
+    x_rms = (x**2).mean().sqrt()
+    y_rms = (y**2).mean().sqrt()
+    print("x rms = ", x_rms)
+    print("y rms = ", y_rms)
+    assert y_rms < x_rms
+    assert y_rms > 0.5 * x_rms
+def _test_double_swish_deriv():
+    x = torch.randn(10, 12, dtype=torch.double) * 3.0
+    x.requires_grad = True
+    m = DoubleSwish()
+    tol = (1.2 - (-0.043637)) / 255.0
+    torch.autograd.gradcheck(m, x, atol=tol)
+    # for self-test.
+    x = torch.randn(1000, 1000, dtype=torch.double) * 3.0
+    x.requires_grad = True
+    y = m(x)
+def _test_softmax():
+    a = torch.randn(2, 10, dtype=torch.float64)
+    b = a.clone()
+    a.requires_grad = True
+    b.requires_grad = True
+    a.softmax(dim=1)[:, 0].sum().backward()
+    print("a grad = ", a.grad)
+    softmax(b, dim=1)[:, 0].sum().backward()
+    print("b grad = ", b.grad)
+    assert torch.allclose(a.grad, b.grad)
+if __name__ == "__main__":
+    logging.getLogger().setLevel(logging.INFO)
+    torch.set_num_threads(1)
+    torch.set_num_interop_threads(1)
+    _test_softmax()
+    _test_whiten()
+    _test_max_eig()
+    _test_activation_balancer_sign()
+    _test_activation_balancer_magnitude()
+    _test_basic_norm()
+    _test_double_swish_deriv()

Amphion/modules/norms/norm.py ADDED Viewed

	@@ -0,0 +1,173 @@

+# Copyright (c) 2023 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import copy
+import numbers
+from typing import Any, List, Tuple, Union
+import torch
+from torch import Tensor, nn
+from torch.nn import functional as F
+from modules.general.scaling import ActivationBalancer
+from modules.general.scaling import BasicNorm as _BasicNorm
+_shape_t = Union[int, List[int], torch.Size]
+class LayerNorm(nn.Module):
+    __constants__ = ["normalized_shape", "eps", "elementwise_affine"]
+    normalized_shape: Tuple[int, ...]
+    eps: float
+    elementwise_affine: bool
+    def __init__(
+        self,
+        normalized_shape: _shape_t,
+        eps: float = 1e-5,
+        elementwise_affine: bool = True,
+        device=None,
+        dtype=None,
+    ) -> None:
+        factory_kwargs = {"device": device, "dtype": dtype}
+        super(LayerNorm, self).__init__()
+        if isinstance(normalized_shape, numbers.Integral):
+            normalized_shape = (normalized_shape,)
+        self.normalized_shape = tuple(normalized_shape)
+        self.eps = eps
+        self.elementwise_affine = elementwise_affine
+        if self.elementwise_affine:
+            self.weight = nn.Parameter(
+                torch.empty(self.normalized_shape, **factory_kwargs)
+            )
+            self.bias = nn.Parameter(
+                torch.empty(self.normalized_shape, **factory_kwargs)
+            )
+        else:
+            self.register_parameter("weight", None)
+            self.register_parameter("bias", None)
+        self.reset_parameters()
+    def reset_parameters(self) -> None:
+        if self.elementwise_affine:
+            nn.init.ones_(self.weight)
+            nn.init.zeros_(self.bias)
+    def forward(self, input: Tensor, embedding: Any = None) -> Tensor:
+        if isinstance(input, tuple):
+            input, embedding = input
+            output = F.layer_norm(
+                input, self.normalized_shape, self.weight, self.bias, self.eps
+            )
+            return output, embedding
+        assert embedding is None
+        return F.layer_norm(
+            input, self.normalized_shape, self.weight, self.bias, self.eps
+        )
+    def extra_repr(self) -> str:
+        return (
+            "{normalized_shape}, eps={eps}, "
+            "elementwise_affine={elementwise_affine}".format(**self.__dict__)
+        )
+class AdaptiveLayerNorm(nn.Module):
+    r"""Adaptive Layer Normalization"""
+    def __init__(self, d_model, norm) -> None:
+        super(AdaptiveLayerNorm, self).__init__()
+        self.project_layer = nn.Linear(d_model, 2 * d_model)
+        self.norm = norm
+        self.d_model = d_model
+        self.eps = self.norm.eps
+    def forward(self, input: Tensor, embedding: Tensor = None) -> Tensor:
+        if isinstance(input, tuple):
+            input, embedding = input
+            weight, bias = torch.split(
+                self.project_layer(embedding),
+                split_size_or_sections=self.d_model,
+                dim=-1,
+            )
+            return (weight * self.norm(input) + bias, embedding)
+        weight, bias = torch.split(
+            self.project_layer(embedding),
+            split_size_or_sections=self.d_model,
+            dim=-1,
+        )
+        return weight * self.norm(input) + bias
+class BasicNorm(_BasicNorm):
+    def __init__(
+        self,
+        d_model: int,
+        eps: float = 1e-5,
+        device=None,
+        dtype=None,
+    ):
+        super(BasicNorm, self).__init__(d_model, eps=eps)
+    def forward(self, input: Tensor, embedding: Any = None) -> Tensor:
+        if isinstance(input, tuple):
+            input, embedding = input
+            return (
+                super(BasicNorm, self).forward(input),
+                embedding,
+            )
+        assert embedding is None
+        return super(BasicNorm, self).forward(input)
+class BalancedBasicNorm(nn.Module):
+    def __init__(
+        self,
+        d_model: int,
+        eps: float = 1e-5,
+        device=None,
+        dtype=None,
+    ):
+        super(BalancedBasicNorm, self).__init__()
+        self.balancer = ActivationBalancer(
+            d_model,
+            channel_dim=-1,
+            min_positive=0.45,
+            max_positive=0.55,
+            max_abs=6.0,
+        )
+        self.norm = BasicNorm(d_model, eps, device=device, dtype=dtype)
+    def forward(self, input: Tensor, embedding: Any = None) -> Tensor:
+        if isinstance(input, tuple):
+            input, embedding = input
+            return self.norm((self.balancer(input), embedding))
+        assert embedding is None
+        return self.norm(self.balancer(input))
+class IdentityNorm(nn.Module):
+    def __init__(
+        self,
+        d_model: int,
+        eps: float = 1e-5,
+        device=None,
+        dtype=None,
+    ) -> None:
+        super(IdentityNorm, self).__init__()
+    def forward(self, input: Tensor, embedding: Any = None) -> Tensor:
+        if isinstance(input, tuple):
+            return input
+        assert embedding is None
+        return input