Spaces:
Running
Running
| # Amphion Evaluation Recipe | |
| ## Supported Evaluation Metrics | |
| Until now, Amphion Evaluation has supported the following objective metrics: | |
| - **F0 Modeling**: | |
| - F0 Pearson Coefficients (FPC) | |
| - F0 Periodicity Root Mean Square Error (PeriodicityRMSE) | |
| - F0 Root Mean Square Error (F0RMSE) | |
| - Voiced/Unvoiced F1 Score (V/UV F1) | |
| - **Energy Modeling**: | |
| - Energy Root Mean Square Error (EnergyRMSE) | |
| - Energy Pearson Coefficients (EnergyPC) | |
| - **Intelligibility**: | |
| - Character Error Rate (CER) based on [Whipser](https://github.com/openai/whisper) | |
| - Word Error Rate (WER) based on [Whipser](https://github.com/openai/whisper) | |
| - **Spectrogram Distortion**: | |
| - Frechet Audio Distance (FAD) | |
| - Mel Cepstral Distortion (MCD) | |
| - Multi-Resolution STFT Distance (MSTFT) | |
| - Perceptual Evaluation of Speech Quality (PESQ) | |
| - Short Time Objective Intelligibility (STOI) | |
| - Scale Invariant Signal to Distortion Ratio (SISDR) | |
| - Scale Invariant Signal to Noise Ratio (SISNR) | |
| - **Speaker Similarity**: | |
| - Cosine similarity based on [Rawnet3](https://github.com/Jungjee/RawNet) | |
| - Cosine similarity based on [WeSpeaker](https://github.com/wenet-e2e/wespeaker) (👨💻 developing) | |
| We provide a recipe to demonstrate how to objectively evaluate your generated audios. There are three steps in total: | |
| 1. Pretrained Models Preparation | |
| 2. Audio Data Preparation | |
| 3. Evaluation | |
| ## 1. Pretrained Models Preparation | |
| If you want to calculate `RawNet3` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md). | |
| ## 2. Aduio Data Preparation | |
| Prepare reference audios and generated audios in two folders, the `ref_dir` contains the reference audio and the `gen_dir` contains the generated audio. Here is an example. | |
| ```plaintext | |
| ┣ {ref_dir} | |
| ┃ ┣ sample1.wav | |
| ┃ ┣ sample2.wav | |
| ┣ {gen_dir} | |
| ┃ ┣ sample1.wav | |
| ┃ ┣ sample2.wav | |
| ``` | |
| You have to make sure that the pairwise **reference audio and generated audio are named the same**, as illustrated above (sample1 to sample1, sample2 to sample2). | |
| ## 3. Evaluation | |
| Run the `run.sh` with specified refenrece folder, generated folder, dump folder and metrics. | |
| ```bash | |
| cd Amphion | |
| sh egs/metrics/run.sh \ | |
| --reference_folder [Your path to the reference audios] \ | |
| --generated_folder [Your path to the generated audios] \ | |
| --dump_folder [Your path to dump the objective results] \ | |
| --metrics [The metrics you need] \ | |
| ``` | |
| As for the metrics, an example is provided below: | |
| ```bash | |
| --metrics "mcd pesq fad" | |
| ``` | |
| All currently available metrics keywords are listed below: | |
| | Keys | Description | | |
| | --------------------- | ------------------------------------------ | | |
| | `fpc` | F0 Pearson Coefficients | | |
| | `f0_periodicity_rmse` | F0 Periodicity Root Mean Square Error | | |
| | `f0rmse` | F0 Root Mean Square Error | | |
| | `v_uv_f1` | Voiced/Unvoiced F1 Score | | |
| | `energy_rmse` | Energy Root Mean Square Error | | |
| | `energy_pc` | Energy Pearson Coefficients | | |
| | `cer` | Character Error Rate | | |
| | `wer` | Word Error Rate | | |
| | `speaker_similarity` | Cos Similarity based on RawNet3 | | |
| | `fad` | Frechet Audio Distance | | |
| | `mcd` | Mel Cepstral Distortion | | |
| | `mstft` | Multi-Resolution STFT Distance | | |
| | `pesq` | Perceptual Evaluation of Speech Quality | | |
| | `si_sdr` | Scale Invariant Signal to Distortion Ratio | | |
| | `si_snr` | Scale Invariant Signal to Noise Ratio | | |
| | `stoi` | Short Time Objective Intelligibility | | |