IndexTTS-2-Demo

Paused

App Files Files Community

vasugo05 commited on Dec 27, 2025

Commit

24783d6

verified ·

1 Parent(s): 994ba2a

Upload 221 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

archive/README_INDEXTTS_1_5.md +247 -0
examples/cases.jsonl +12 -0
examples/emo_hate.wav +3 -0
examples/emo_sad.wav +3 -0
examples/voice_01.wav +3 -0
examples/voice_02.wav +3 -0
examples/voice_03.wav +3 -0
examples/voice_04.wav +3 -0
examples/voice_05.wav +3 -0
examples/voice_06.wav +3 -0
examples/voice_07.wav +3 -0
examples/voice_08.wav +3 -0
examples/voice_09.wav +3 -0
examples/voice_10.wav +3 -0
examples/voice_11.wav +3 -0
examples/voice_12.wav +3 -0
indextts/.DS_Store +0 -0
indextts/BigVGAN/.DS_Store +0 -0
indextts/BigVGAN/ECAPA_TDNN.py +656 -0
indextts/BigVGAN/__init__.py +0 -0
indextts/BigVGAN/activations.py +122 -0
indextts/BigVGAN/alias_free_activation/.DS_Store +0 -0
indextts/BigVGAN/alias_free_activation/__init__.py +0 -0
indextts/BigVGAN/alias_free_activation/cuda/.gitignore +1 -0
indextts/BigVGAN/alias_free_activation/cuda/__init__.py +0 -0
indextts/BigVGAN/alias_free_activation/cuda/activation1d.py +76 -0
indextts/BigVGAN/alias_free_activation/cuda/anti_alias_activation.cpp +23 -0
indextts/BigVGAN/alias_free_activation/cuda/anti_alias_activation_cuda.cu +256 -0
indextts/BigVGAN/alias_free_activation/cuda/compat.h +29 -0
indextts/BigVGAN/alias_free_activation/cuda/load.py +121 -0
indextts/BigVGAN/alias_free_activation/cuda/type_shim.h +92 -0
indextts/BigVGAN/alias_free_activation/torch/__init__.py +6 -0
indextts/BigVGAN/alias_free_activation/torch/act.py +31 -0
indextts/BigVGAN/alias_free_activation/torch/filter.py +102 -0
indextts/BigVGAN/alias_free_activation/torch/resample.py +58 -0
indextts/BigVGAN/alias_free_torch/__init__.py +6 -0
indextts/BigVGAN/alias_free_torch/act.py +29 -0
indextts/BigVGAN/alias_free_torch/filter.py +96 -0
indextts/BigVGAN/alias_free_torch/resample.py +49 -0
indextts/BigVGAN/bigvgan.py +534 -0
indextts/BigVGAN/models.py +451 -0
indextts/BigVGAN/nnet/CNN.py +546 -0
indextts/BigVGAN/nnet/__init__.py +0 -0
indextts/BigVGAN/nnet/linear.py +89 -0
indextts/BigVGAN/nnet/normalization.py +670 -0
indextts/BigVGAN/utils.py +101 -0
indextts/__init__.py +0 -0
indextts/cli.py +65 -0
indextts/gpt/__init__.py +0 -0
indextts/gpt/conformer/__init__.py +0 -0

archive/README_INDEXTTS_1_5.md ADDED Viewed

	@@ -0,0 +1,247 @@

+<div align="center">
+<img src='assets/index_icon.png' width="250"/>
+</div>
+<h2><center>IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System</h2>
+<p align="center">
+<a href='https://arxiv.org/abs/2502.05512'><img src='https://img.shields.io/badge/ArXiv-2502.05512-red'></a>
+## 👉🏻 IndexTTS 👈🏻
+[[HuggingFace Demo]](https://huggingface.co/spaces/IndexTeam/IndexTTS)   [[ModelScope Demo]](https://modelscope.cn/studios/IndexTeam/IndexTTS-Demo) \
+[[Paper]](https://arxiv.org/abs/2502.05512)  [[Demos]](https://index-tts.github.io)
+**IndexTTS** is a GPT-style text-to-speech (TTS) model mainly based on XTTS and Tortoise. It is capable of correcting the pronunciation of Chinese characters using pinyin and controlling pauses at any position through punctuation marks. We enhanced multiple modules of the system, including the improvement of speaker condition feature representation, and the integration of BigVGAN2 to optimize audio quality. Trained on tens of thousands of hours of data, our system achieves state-of-the-art performance, outperforming current popular TTS systems such as XTTS, CosyVoice2, Fish-Speech, and F5-TTS.
+<span style="font-size:16px;">
+Experience **IndexTTS**: Please contact <u>xuanwu@bilibili.com</u> for more detailed information. </span>
+### Contact
+QQ群（二群）：1048202584 \
+Discord：https://discord.gg/uT32E7KDmy  \
+简历：indexspeech@bilibili.com  \
+欢迎大家来交流讨论！
+## 📣 Updates
+- `2025/05/14` 🔥🔥 We release the **IndexTTS-1.5**, Significantly improve the model's stability and its performance in the English language.
+- `2025/03/25` 🔥 We release IndexTTS-1.0 model parameters and inference code.
+- `2025/02/12` 🔥 We submitted our paper on arXiv, and released our demos and test sets.
+## 🖥️ Method
+The overview of IndexTTS is shown as follows.
+<picture>
+  <img src="assets/IndexTTS.png"  width="800"/>
+</picture>
+The main improvements and contributions are summarized as follows:
+ - In Chinese scenarios, we have introduced a character-pinyin hybrid modeling approach. This allows for quick correction of mispronounced characters.
+ - **IndexTTS** incorporate a conformer conditioning encoder and a BigVGAN2-based speechcode decoder. This improves training stability, voice timbre similarity, and sound quality.
+ - We release all test sets here, including those for polysyllabic words, subjective and objective test sets.
+## Model Download
+| 🤗**HuggingFace**                                          | **ModelScope** |
+|----------------------------------------------------------|----------------------------------------------------------|
+| [IndexTTS](https://huggingface.co/IndexTeam/Index-TTS) | [IndexTTS](https://modelscope.cn/models/IndexTeam/Index-TTS) |
+| [😁IndexTTS-1.5](https://huggingface.co/IndexTeam/IndexTTS-1.5) | [IndexTTS-1.5](https://modelscope.cn/models/IndexTeam/IndexTTS-1.5) |
+## 📑 Evaluation
+**Word Error Rate (WER) Results for IndexTTS and Baseline Models on the** [**seed-test**](https://github.com/BytedanceSpeech/seed-tts-eval)
+| **WER**                | **test_zh** | **test_en** | **test_hard** |
+|:----------------------:|:-----------:|:-----------:|:-------------:|
+| **Human**              | 1.26        | 2.14        | -             |
+| **SeedTTS**            | 1.002       | 1.945       | **6.243**     |
+| **CosyVoice 2**        | 1.45        | 2.57        | 6.83          |
+| **F5TTS**              | 1.56        | 1.83        | 8.67          |
+| **FireRedTTS**         | 1.51        | 3.82        | 17.45         |
+| **MaskGCT**            | 2.27        | 2.62        | 10.27         |
+| **Spark-TTS**          | 1.2         | 1.98        | -             |
+| **MegaTTS 3**          | 1.36        | 1.82        | -             |
+| **IndexTTS**           | 0.937       | 1.936       | 6.831         |
+| **IndexTTS-1.5**       | **0.821**   | **1.606**   | 6.565         |
+**Word Error Rate (WER) Results for IndexTTS and Baseline Models on the other opensource test**
+|    **Model**    | **aishell1_test** | **commonvoice_20_test_zh** | **commonvoice_20_test_en** | **librispeech_test_clean** |  **avg** |
+|:---------------:|:-----------------:|:--------------------------:|:--------------------------:|:--------------------------:|:--------:|
+|    **Human**    |        2.0        |            9.5             |            10.0            |            2.4             |   5.1    |
+| **CosyVoice 2** |        1.8        |            9.1             |            7.3             |            4.9             |   5.9    |
+|    **F5TTS**    |        3.9        |            11.7            |            5.4             |            7.8             |   8.2    |
+|  **Fishspeech** |        2.4        |            11.4            |            8.8             |            8.0             |   8.3    |
+|  **FireRedTTS** |        2.2        |            11.0            |            16.3            |            5.7             |   7.7    |
+|     **XTTS**    |        3.0        |            11.4            |            7.1             |            3.5             |   6.0    |
+|   **IndexTTS**  |      1.3          |          7.0               |            5.3             |          2.1             | 3.7       |
+|   **IndexTTS-1.5**  |      **1.2**     |          **6.8**          |          **3.9**          |          **1.7**          | **3.1** |
+**Speaker Similarity (SS) Results for IndexTTS and Baseline Models**
+|    **Model**    | **aishell1_test** | **commonvoice_20_test_zh** | **commonvoice_20_test_en** | **librispeech_test_clean** |  **avg**  |
+|:---------------:|:-----------------:|:--------------------------:|:--------------------------:|:--------------------------:|:---------:|
+|    **Human**    |       0.846       |            0.809           |            0.820           |            0.858           |   0.836   |
+| **CosyVoice 2** |     **0.796**     |            0.743           |            0.742           |          **0.837**         | **0.788** |
+|    **F5TTS**    |       0.743       |          **0.747**         |            0.746           |            0.828           |   0.779   |
+|  **Fishspeech** |       0.488       |            0.552           |            0.622           |            0.701           |   0.612   |
+|  **FireRedTTS** |       0.579       |            0.593           |            0.587           |            0.698           |   0.631   |
+|     **XTTS**    |       0.573       |            0.586           |            0.648           |            0.761           |   0.663   |
+|   **IndexTTS**  |       0.744       |            0.742           |          **0.758**         |            0.823           |   0.776   |
+|   **IndexTTS-1.5**  |       0.741       |            0.722           |          0.753         |            0.819           |   0.771   |
+**MOS Scores for Zero-Shot Cloned Voice**
+| **Model**       | **Prosody** | **Timbre** | **Quality** |  **AVG**  |
+|-----------------|:-----------:|:----------:|:-----------:|:---------:|
+| **CosyVoice 2** |    3.67     |    4.05    |    3.73     |   3.81    |
+| **F5TTS**       |    3.56     |    3.88    |    3.56     |   3.66    |
+| **Fishspeech**  |    3.40     |    3.63    |    3.69     |   3.57    |
+| **FireRedTTS**  |    3.79     |    3.72    |    3.60     |   3.70    |
+| **XTTS**        |    3.23     |    2.99    |    3.10     |   3.11    |
+| **IndexTTS**    |    **3.79**     |    **4.20**    |    **4.05**     |   **4.01**    |
+## Usage Instructions
+### Environment Setup
+1. Download this repository:
+```bash
+git clone https://github.com/index-tts/index-tts.git
+```
+2. Install dependencies:
+Create a new conda environment and install dependencies:
+```bash
+conda create -n index-tts python=3.10
+conda activate index-tts
+apt-get install ffmpeg
+# or use conda to install ffmpeg
+conda install -c conda-forge ffmpeg
+```
+Install [PyTorch](https://pytorch.org/get-started/locally/), e.g.:
+```bash
+pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
+```
+> [!NOTE]
+> If you are using Windows you may encounter [an error](https://github.com/index-tts/index-tts/issues/61) when installing `pynini`:
+`ERROR: Failed building wheel for pynini`
+> In this case, please install `pynini` via `conda`:
+> ```bash
+> # after conda activate index-tts
+> conda install -c conda-forge pynini==2.1.6
+> pip install WeTextProcessing --no-deps
+> ```
+Install `IndexTTS` as a package:
+```bash
+cd index-tts
+pip install -e .
+```
+3. Download models:
+Download by `huggingface-cli`:
+```bash
+huggingface-cli download IndexTeam/IndexTTS-1.5 \
+  config.yaml bigvgan_discriminator.pth bigvgan_generator.pth bpe.model dvae.pth gpt.pth unigram_12000.vocab \
+  --local-dir checkpoints
+```
+Recommended for China users. 如果下载速度慢，可以使用镜像：
+```bash
+export HF_ENDPOINT="https://hf-mirror.com"
+```
+Or by `wget`:
+```bash
+wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_discriminator.pth -P checkpoints
+wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bigvgan_generator.pth -P checkpoints
+wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/bpe.model -P checkpoints
+wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/dvae.pth -P checkpoints
+wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/gpt.pth -P checkpoints
+wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/unigram_12000.vocab -P checkpoints
+wget https://huggingface.co/IndexTeam/IndexTTS-1.5/resolve/main/config.yaml -P checkpoints
+```
+> [!NOTE]
+> If you prefer to use the `IndexTTS-1.0` model, please replace `IndexTeam/IndexTTS-1.5` with `IndexTeam/IndexTTS` in the above commands.
+4. Run test script:
+```bash
+# Please put your prompt audio in 'test_data' and rename it to 'input.wav'
+python indextts/infer.py
+```
+5. Use as command line tool:
+```bash
+# Make sure pytorch has been installed before running this command
+indextts "大��好，我现在正在bilibili 体验 ai 科技，说实话，来之前我绝对想不到！AI技术已经发展到这样匪夷所思的地步了！" \
+  --voice reference_voice.wav \
+  --model_dir checkpoints \
+  --config checkpoints/config.yaml \
+  --output output.wav
+```
+Use `--help` to see more options.
+```bash
+indextts --help
+```
+#### Web Demo
+```bash
+pip install -e ".[webui]" --no-build-isolation
+python webui.py
+# use another model version:
+python webui.py --model_dir IndexTTS-1.5
+```
+Open your browser and visit `http://127.0.0.1:7860` to see the demo.
+#### Sample Code
+```python
+from indextts.infer import IndexTTS
+tts = IndexTTS(model_dir="checkpoints",cfg_path="checkpoints/config.yaml")
+voice="reference_voice.wav"
+text="大家好，我现在正在bilibili 体验 ai 科技，说实话，来之前我绝对想不到！AI技术已经发展到这样匪夷所思的地步了！比如说，现在正在说话的其实是B站为我现场复刻的数字分身，简直就是平行宇宙的另一个我了。如果大家也想体验更多深入的AIGC功能，可以访问 bilibili studio，相信我，你们也会吃惊的。"
+tts.infer(voice, text, output_path)
+```
+## Acknowledge
+1. [tortoise-tts](https://github.com/neonbjb/tortoise-tts)
+2. [XTTSv2](https://github.com/coqui-ai/TTS)
+3. [BigVGAN](https://github.com/NVIDIA/BigVGAN)
+4. [wenet](https://github.com/wenet-e2e/wenet/tree/main)
+5. [icefall](https://github.com/k2-fsa/icefall)
+## 📚 Citation
+🌟 If you find our work helpful, please leave us a star and cite our paper.
+```
+@article{deng2025indextts,
+  title={IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System},
+  author={Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang},
+  journal={arXiv preprint arXiv:2502.05512},
+  year={2025}
+}
+```

examples/cases.jsonl ADDED Viewed

	@@ -0,0 +1,12 @@

+{"prompt_audio":"voice_01.wav","text":"Translate for me, what is a surprise!","emo_mode":0}
+{"prompt_audio":"voice_02.wav","text":"The palace is strict, no false rumors, Lady Qi!","emo_mode":0}
+{"prompt_audio":"voice_03.wav","text":"这个呀，就是我们精心制作准备的纪念品，大家可以看到这个色泽和这个材质啊，哎呀多么的光彩照人。","emo_mode":0}
+{"prompt_audio":"voice_04.wav","text":"你就需要我这种专业人士的帮助，就像手无缚鸡之力的人进入雪山狩猎，一定需要最老练的猎人指导。","emo_mode":0}
+{"prompt_audio":"voice_05.wav","text":"在真正的日本剑道中，格斗过程极其短暂，常常短至半秒，最长也不超过两秒，利剑相击的转瞬间，已有一方倒在血泊中。但在这电光石火的对决之前，双方都要以一个石雕般凝固的姿势站定，长时间的逼视对方，这一过程可能长达十分钟！","emo_mode":0}
+{"prompt_audio":"voice_06.wav","text":"今天呢，咱们开一部新书，叫《赛博朋克二零七七》。这词儿我听着都新鲜。这赛博朋克啊，简单理解就是“高科技，低生活”。这一听，我就明白了，于老师就爱用那高科技的东西，手机都得拿脚纹开，大冬天为了解锁脱得一丝不挂，冻得跟王八蛋似的。","emo_mode":0}
+{"prompt_audio":"voice_07.wav","emo_audio":"emo_sad.wav","emo_weight": 1.0, "emo_mode":1,"text":"酒楼丧尽天良，开始借机竞拍房间，哎，一群蠢货。"}
+{"prompt_audio":"voice_08.wav","emo_audio":"emo_hate.wav","emo_weight": 1.0, "emo_mode":1,"text":"你看看你，对我还有没有一点父子之间的信任了。"}
+{"prompt_audio":"voice_09.wav","emo_vec_3":0.8,"emo_mode":2,"text":"对不起嘛！我的记性真的不太好，但是和你在一起的事情，我都会努力记住的~"}
+{"prompt_audio":"voice_10.wav","emo_vec_7":1.0,"emo_mode":2,"text":"哇塞！这个爆率也太高了！欧皇附体了！"}
+{"prompt_audio":"voice_11.wav","emo_mode":3,"emo_text":"极度悲伤","text":"这些年的时光终究是错付了... "}
+{"prompt_audio":"voice_12.wav","emo_mode":3,"emo_text":"You scared me to death! What are you, a ghost?","text":"快躲起来！是他要来了！他要来抓我们了！"}

examples/emo_hate.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:89e6e7eee1a28303776e9cf43971e9505529bd0e669f5fcf47f4d1370f9187c4
+size 145368

examples/emo_sad.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7d3e5bf2b7bca6458f9e6d7a5ce073c41eb4418895e7df2f994e5a0c96c064a
+size 842016

examples/voice_01.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e33e6ee0107a1dd58e1d66dd90c13df3d55a8683047cc3d7ea206dad84ed3fc8
+size 478050

examples/voice_02.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8fe2dd1dbd54ef85a073fbc4c8fc0198f8d4523cc3320a600de0e347a3d8b491
+size 574074

examples/voice_03.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:50e8b632efd794418919e2d33c8c2aab9189a57f4d21ef55020413be9f2b292a
+size 616814

examples/voice_04.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2a3d2536245f45fd5e1eef046dd768ae7b72a0dba3ec3f370f145862fe64b3b2
+size 681084

examples/voice_05.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eefb7f4a29a8b36f08d5cc1014ea947dbe9f7bef348f07c40263058e604a98eb
+size 1482796

examples/voice_06.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2d85800fe261d106c3274fa792cbb952458c4b0b2e1b908340a8cd0d63c73a30
+size 299052

examples/voice_07.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bcb10f84e63c3fdbfe99ac4184ca403b46a6d20b50540732713d48c4c95375ce
+size 591894

examples/voice_08.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2e2c5f4859999b1ada95ee801d50c3c72879147269a4ed99e385fd917dae5c6f
+size 426812

examples/voice_09.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8702467b9b3c83a16bead578e131c4388b3ef82aeff861bd336e622a9ae8a511
+size 1798188

examples/voice_10.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39c2db8b395e4c6ea1122ec7463b5f7bd7dd7d7302f3255780e4c529a9ae9985
+size 1942242

examples/voice_11.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:82730e38498413d4371a76e841cd91fa2f74843b79ad3b606d45ad8a7b7a736c
+size 1520734

examples/voice_12.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d67bd4f51773677d5902409813b9bb4c1d59b8243c74fc104553b80b49edd22b
+size 778626

indextts/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

indextts/BigVGAN/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

indextts/BigVGAN/ECAPA_TDNN.py ADDED Viewed

	@@ -0,0 +1,656 @@

+"""A popular speaker recognition and diarization model.
+Authors
+ * Hwidong Na 2020
+"""
+import torch  # noqa: F401
+import torch.nn as nn
+import torch.nn.functional as F
+from indextts.BigVGAN.nnet.CNN import Conv1d as _Conv1d
+from indextts.BigVGAN.nnet.linear import Linear
+from indextts.BigVGAN.nnet.normalization import BatchNorm1d as _BatchNorm1d
+def length_to_mask(length, max_len=None, dtype=None, device=None):
+    """Creates a binary mask for each sequence.
+    Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3
+    Arguments
+    ---------
+    length : torch.LongTensor
+        Containing the length of each sequence in the batch. Must be 1D.
+    max_len : int
+        Max length for the mask, also the size of the second dimension.
+    dtype : torch.dtype, default: None
+        The dtype of the generated mask.
+    device: torch.device, default: None
+        The device to put the mask variable.
+    Returns
+    -------
+    mask : tensor
+        The binary mask.
+    Example
+    -------
+    >>> length=torch.Tensor([1,2,3])
+    >>> mask=length_to_mask(length)
+    >>> mask
+    tensor([[1., 0., 0.],
+            [1., 1., 0.],
+            [1., 1., 1.]])
+    """
+    assert len(length.shape) == 1
+    if max_len is None:
+        max_len = length.max().long().item()  # using arange to generate mask
+    mask = torch.arange(
+        max_len, device=length.device, dtype=length.dtype
+    ).expand(len(length), max_len) < length.unsqueeze(1)
+    if dtype is None:
+        dtype = length.dtype
+    if device is None:
+        device = length.device
+    mask = torch.as_tensor(mask, dtype=dtype, device=device)
+    return mask
+# Skip transpose as much as possible for efficiency
+class Conv1d(_Conv1d):
+    """1D convolution. Skip transpose is used to improve efficiency."""
+    def __init__(self, *args, **kwargs):
+        super().__init__(skip_transpose=True, *args, **kwargs)
+class BatchNorm1d(_BatchNorm1d):
+    """1D batch normalization. Skip transpose is used to improve efficiency."""
+    def __init__(self, *args, **kwargs):
+        super().__init__(skip_transpose=True, *args, **kwargs)
+class TDNNBlock(nn.Module):
+    """An implementation of TDNN.
+    Arguments
+    ---------
+    in_channels : int
+        Number of input channels.
+    out_channels : int
+        The number of output channels.
+    kernel_size : int
+        The kernel size of the TDNN blocks.
+    dilation : int
+        The dilation of the TDNN block.
+    activation : torch class
+        A class for constructing the activation layers.
+    groups : int
+        The groups size of the TDNN blocks.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1)
+    >>> out_tensor = layer(inp_tensor).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        dilation,
+        activation=nn.ReLU,
+        groups=1,
+    ):
+        super().__init__()
+        self.conv = Conv1d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            dilation=dilation,
+            groups=groups,
+        )
+        self.activation = activation()
+        self.norm = BatchNorm1d(input_size=out_channels)
+    def forward(self, x):
+        """Processes the input tensor x and returns an output tensor."""
+        return self.norm(self.activation(self.conv(x)))
+class Res2NetBlock(torch.nn.Module):
+    """An implementation of Res2NetBlock w/ dilation.
+    Arguments
+    ---------
+    in_channels : int
+        The number of channels expected in the input.
+    out_channels : int
+        The number of output channels.
+    scale : int
+        The scale of the Res2Net block.
+    kernel_size: int
+        The kernel size of the Res2Net block.
+    dilation : int
+        The dilation of the Res2Net block.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
+    >>> out_tensor = layer(inp_tensor).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self, in_channels, out_channels, scale=8, kernel_size=3, dilation=1
+    ):
+        super().__init__()
+        assert in_channels % scale == 0
+        assert out_channels % scale == 0
+        in_channel = in_channels // scale
+        hidden_channel = out_channels // scale
+        self.blocks = nn.ModuleList(
+            [
+                TDNNBlock(
+                    in_channel,
+                    hidden_channel,
+                    kernel_size=kernel_size,
+                    dilation=dilation,
+                )
+                for i in range(scale - 1)
+            ]
+        )
+        self.scale = scale
+    def forward(self, x):
+        """Processes the input tensor x and returns an output tensor."""
+        y = []
+        for i, x_i in enumerate(torch.chunk(x, self.scale, dim=1)):
+            if i == 0:
+                y_i = x_i
+            elif i == 1:
+                y_i = self.blocks[i - 1](x_i)
+            else:
+                y_i = self.blocks[i - 1](x_i + y_i)
+            y.append(y_i)
+        y = torch.cat(y, dim=1)
+        return y
+class SEBlock(nn.Module):
+    """An implementation of squeeze-and-excitation block.
+    Arguments
+    ---------
+    in_channels : int
+        The number of input channels.
+    se_channels : int
+        The number of output channels after squeeze.
+    out_channels : int
+        The number of output channels.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> se_layer = SEBlock(64, 16, 64)
+    >>> lengths = torch.rand((8,))
+    >>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(self, in_channels, se_channels, out_channels):
+        super().__init__()
+        self.conv1 = Conv1d(
+            in_channels=in_channels, out_channels=se_channels, kernel_size=1
+        )
+        self.relu = torch.nn.ReLU(inplace=True)
+        self.conv2 = Conv1d(
+            in_channels=se_channels, out_channels=out_channels, kernel_size=1
+        )
+        self.sigmoid = torch.nn.Sigmoid()
+    def forward(self, x, lengths=None):
+        """Processes the input tensor x and returns an output tensor."""
+        L = x.shape[-1]
+        if lengths is not None:
+            mask = length_to_mask(lengths * L, max_len=L, device=x.device)
+            mask = mask.unsqueeze(1)
+            total = mask.sum(dim=2, keepdim=True)
+            s = (x * mask).sum(dim=2, keepdim=True) / total
+        else:
+            s = x.mean(dim=2, keepdim=True)
+        s = self.relu(self.conv1(s))
+        s = self.sigmoid(self.conv2(s))
+        return s * x
+class AttentiveStatisticsPooling(nn.Module):
+    """This class implements an attentive statistic pooling layer for each channel.
+    It returns the concatenated mean and std of the input tensor.
+    Arguments
+    ---------
+    channels: int
+        The number of input channels.
+    attention_channels: int
+        The number of attention channels.
+    global_context: bool
+        Whether to use global context.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> asp_layer = AttentiveStatisticsPooling(64)
+    >>> lengths = torch.rand((8,))
+    >>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 1, 128])
+    """
+    def __init__(self, channels, attention_channels=128, global_context=True):
+        super().__init__()
+        self.eps = 1e-12
+        self.global_context = global_context
+        if global_context:
+            self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
+        else:
+            self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
+        self.tanh = nn.Tanh()
+        self.conv = Conv1d(
+            in_channels=attention_channels, out_channels=channels, kernel_size=1
+        )
+    def forward(self, x, lengths=None):
+        """Calculates mean and std for a batch (input tensor).
+        Arguments
+        ---------
+        x : torch.Tensor
+            Tensor of shape [N, C, L].
+        lengths : torch.Tensor
+            The corresponding relative lengths of the inputs.
+        Returns
+        -------
+        pooled_stats : torch.Tensor
+            mean and std of batch
+        """
+        L = x.shape[-1]
+        def _compute_statistics(x, m, dim=2, eps=self.eps):
+            mean = (m * x).sum(dim)
+            std = torch.sqrt(
+                (m * (x - mean.unsqueeze(dim)).pow(2)).sum(dim).clamp(eps)
+            )
+            return mean, std
+        if lengths is None:
+            lengths = torch.ones(x.shape[0], device=x.device)
+        # Make binary mask of shape [N, 1, L]
+        mask = length_to_mask(lengths * L, max_len=L, device=x.device)
+        mask = mask.unsqueeze(1)
+        # Expand the temporal context of the pooling layer by allowing the
+        # self-attention to look at global properties of the utterance.
+        if self.global_context:
+            # torch.std is unstable for backward computation
+            # https://github.com/pytorch/pytorch/issues/4320
+            total = mask.sum(dim=2, keepdim=True).float()
+            mean, std = _compute_statistics(x, mask / total)
+            mean = mean.unsqueeze(2).repeat(1, 1, L)
+            std = std.unsqueeze(2).repeat(1, 1, L)
+            attn = torch.cat([x, mean, std], dim=1)
+        else:
+            attn = x
+        # Apply layers
+        attn = self.conv(self.tanh(self.tdnn(attn)))
+        # Filter out zero-paddings
+        attn = attn.masked_fill(mask == 0, float("-inf"))
+        attn = F.softmax(attn, dim=2)
+        mean, std = _compute_statistics(x, attn)
+        # Append mean and std of the batch
+        pooled_stats = torch.cat((mean, std), dim=1)
+        pooled_stats = pooled_stats.unsqueeze(2)
+        return pooled_stats
+class SERes2NetBlock(nn.Module):
+    """An implementation of building block in ECAPA-TDNN, i.e.,
+    TDNN-Res2Net-TDNN-SEBlock.
+    Arguments
+    ---------
+    in_channels: int
+        Expected size of input channels.
+    out_channels: int
+        The number of output channels.
+    res2net_scale: int
+        The scale of the Res2Net block.
+    se_channels : int
+        The number of output channels after squeeze.
+    kernel_size: int
+        The kernel size of the TDNN blocks.
+    dilation: int
+        The dilation of the Res2Net block.
+    activation : torch class
+        A class for constructing the activation layers.
+    groups: int
+        Number of blocked connections from input channels to output channels.
+    Example
+    -------
+    >>> x = torch.rand(8, 120, 64).transpose(1, 2)
+    >>> conv = SERes2NetBlock(64, 64, res2net_scale=4)
+    >>> out = conv(x).transpose(1, 2)
+    >>> out.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        res2net_scale=8,
+        se_channels=128,
+        kernel_size=1,
+        dilation=1,
+        activation=torch.nn.ReLU,
+        groups=1,
+    ):
+        super().__init__()
+        self.out_channels = out_channels
+        self.tdnn1 = TDNNBlock(
+            in_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            activation=activation,
+            groups=groups,
+        )
+        self.res2net_block = Res2NetBlock(
+            out_channels, out_channels, res2net_scale, kernel_size, dilation
+        )
+        self.tdnn2 = TDNNBlock(
+            out_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            activation=activation,
+            groups=groups,
+        )
+        self.se_block = SEBlock(out_channels, se_channels, out_channels)
+        self.shortcut = None
+        if in_channels != out_channels:
+            self.shortcut = Conv1d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+            )
+    def forward(self, x, lengths=None):
+        """Processes the input tensor x and returns an output tensor."""
+        residual = x
+        if self.shortcut:
+            residual = self.shortcut(x)
+        x = self.tdnn1(x)
+        x = self.res2net_block(x)
+        x = self.tdnn2(x)
+        x = self.se_block(x, lengths)
+        return x + residual
+class ECAPA_TDNN(torch.nn.Module):
+    """An implementation of the speaker embedding model in a paper.
+    "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in
+    TDNN Based Speaker Verification" (https://arxiv.org/abs/2005.07143).
+    Arguments
+    ---------
+    input_size : int
+        Expected size of the input dimension.
+    device : str
+        Device used, e.g., "cpu" or "cuda".
+    lin_neurons : int
+        Number of neurons in linear layers.
+    activation : torch class
+        A class for constructing the activation layers.
+    channels : list of ints
+        Output channels for TDNN/SERes2Net layer.
+    kernel_sizes : list of ints
+        List of kernel sizes for each layer.
+    dilations : list of ints
+        List of dilations for kernels in each layer.
+    attention_channels: int
+        The number of attention channels.
+    res2net_scale : int
+        The scale of the Res2Net block.
+    se_channels : int
+        The number of output channels after squeeze.
+    global_context: bool
+        Whether to use global context.
+    groups : list of ints
+        List of groups for kernels in each layer.
+    Example
+    -------
+    >>> input_feats = torch.rand([5, 120, 80])
+    >>> compute_embedding = ECAPA_TDNN(80, lin_neurons=192)
+    >>> outputs = compute_embedding(input_feats)
+    >>> outputs.shape
+    torch.Size([5, 1, 192])
+    """
+    def __init__(
+        self,
+        input_size,
+        device="cpu",
+        lin_neurons=192,
+        activation=torch.nn.ReLU,
+        channels=[512, 512, 512, 512, 1536],
+        kernel_sizes=[5, 3, 3, 3, 1],
+        dilations=[1, 2, 3, 4, 1],
+        attention_channels=128,
+        res2net_scale=8,
+        se_channels=128,
+        global_context=True,
+        groups=[1, 1, 1, 1, 1],
+    ):
+        super().__init__()
+        assert len(channels) == len(kernel_sizes)
+        assert len(channels) == len(dilations)
+        self.channels = channels
+        self.blocks = nn.ModuleList()
+        # The initial TDNN layer
+        self.blocks.append(
+            TDNNBlock(
+                input_size,
+                channels[0],
+                kernel_sizes[0],
+                dilations[0],
+                activation,
+                groups[0],
+            )
+        )
+        # SE-Res2Net layers
+        for i in range(1, len(channels) - 1):
+            self.blocks.append(
+                SERes2NetBlock(
+                    channels[i - 1],
+                    channels[i],
+                    res2net_scale=res2net_scale,
+                    se_channels=se_channels,
+                    kernel_size=kernel_sizes[i],
+                    dilation=dilations[i],
+                    activation=activation,
+                    groups=groups[i],
+                )
+            )
+        # Multi-layer feature aggregation
+        self.mfa = TDNNBlock(
+            channels[-2] * (len(channels) - 2),
+            channels[-1],
+            kernel_sizes[-1],
+            dilations[-1],
+            activation,
+            groups=groups[-1],
+        )
+        # Attentive Statistical Pooling
+        self.asp = AttentiveStatisticsPooling(
+            channels[-1],
+            attention_channels=attention_channels,
+            global_context=global_context,
+        )
+        self.asp_bn = BatchNorm1d(input_size=channels[-1] * 2)
+        # Final linear transformation
+        self.fc = Conv1d(
+            in_channels=channels[-1] * 2,
+            out_channels=lin_neurons,
+            kernel_size=1,
+        )
+    def forward(self, x, lengths=None):
+        """Returns the embedding vector.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Tensor of shape (batch, time, channel).
+        lengths : torch.Tensor
+            Corresponding relative lengths of inputs.
+        Returns
+        -------
+        x : torch.Tensor
+            Embedding vector.
+        """
+        # Minimize transpose for efficiency
+        x = x.transpose(1, 2)
+        xl = []
+        for layer in self.blocks:
+            try:
+                x = layer(x, lengths=lengths)
+            except TypeError:
+                x = layer(x)
+            xl.append(x)
+        # Multi-layer feature aggregation
+        x = torch.cat(xl[1:], dim=1)
+        x = self.mfa(x)
+        # Attentive Statistical Pooling
+        x = self.asp(x, lengths=lengths)
+        x = self.asp_bn(x)
+        # Final linear transformation
+        x = self.fc(x)
+        x = x.transpose(1, 2)
+        return x
+class Classifier(torch.nn.Module):
+    """This class implements the cosine similarity on the top of features.
+    Arguments
+    ---------
+    input_size : int
+        Expected size of input dimension.
+    device : str
+        Device used, e.g., "cpu" or "cuda".
+    lin_blocks : int
+        Number of linear layers.
+    lin_neurons : int
+        Number of neurons in linear layers.
+    out_neurons : int
+        Number of classes.
+    Example
+    -------
+    >>> classify = Classifier(input_size=2, lin_neurons=2, out_neurons=2)
+    >>> outputs = torch.tensor([ [1., -1.], [-9., 1.], [0.9, 0.1], [0.1, 0.9] ])
+    >>> outputs = outputs.unsqueeze(1)
+    >>> cos = classify(outputs)
+    >>> (cos < -1.0).long().sum()
+    tensor(0)
+    >>> (cos > 1.0).long().sum()
+    tensor(0)
+    """
+    def __init__(
+        self,
+        input_size,
+        device="cpu",
+        lin_blocks=0,
+        lin_neurons=192,
+        out_neurons=1211,
+    ):
+        super().__init__()
+        self.blocks = nn.ModuleList()
+        for block_index in range(lin_blocks):
+            self.blocks.extend(
+                [
+                    _BatchNorm1d(input_size=input_size),
+                    Linear(input_size=input_size, n_neurons=lin_neurons),
+                ]
+            )
+            input_size = lin_neurons
+        # Final Layer
+        self.weight = nn.Parameter(
+            torch.FloatTensor(out_neurons, input_size, device=device)
+        )
+        nn.init.xavier_uniform_(self.weight)
+    def forward(self, x):
+        """Returns the output probabilities over speakers.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Torch tensor.
+        Returns
+        -------
+        out : torch.Tensor
+            Output probabilities over speakers.
+        """
+        for layer in self.blocks:
+            x = layer(x)
+        # Need to be normalized
+        x = F.linear(F.normalize(x.squeeze(1)), F.normalize(self.weight))
+        return x.unsqueeze(1)

indextts/BigVGAN/__init__.py ADDED Viewed

File without changes

indextts/BigVGAN/activations.py ADDED Viewed

	@@ -0,0 +1,122 @@

+# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
+#   LICENSE is in incl_licenses directory.
+import torch
+from torch import nn, pow, sin
+from torch.nn import Parameter
+class Snake(nn.Module):
+    '''
+    Implementation of a sine-based periodic activation function
+    Shape:
+        - Input: (B, C, T)
+        - Output: (B, C, T), same shape as the input
+    Parameters:
+        - alpha - trainable parameter
+    References:
+        - This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
+        https://arxiv.org/abs/2006.08195
+    Examples:
+        >>> a1 = snake(256)
+        >>> x = torch.randn(256)
+        >>> x = a1(x)
+    '''
+    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
+        '''
+        Initialization.
+        INPUT:
+            - in_features: shape of the input
+            - alpha: trainable parameter
+            alpha is initialized to 1 by default, higher values = higher-frequency.
+            alpha will be trained along with the rest of your model.
+        '''
+        super(Snake, self).__init__()
+        self.in_features = in_features
+        # initialize alpha
+        self.alpha_logscale = alpha_logscale
+        if self.alpha_logscale:  # log scale alphas initialized to zeros
+            self.alpha = Parameter(torch.zeros(in_features) * alpha)
+        else:  # linear scale alphas initialized to ones
+            self.alpha = Parameter(torch.ones(in_features) * alpha)
+        self.alpha.requires_grad = alpha_trainable
+        self.no_div_by_zero = 0.000000001
+    def forward(self, x):
+        '''
+        Forward pass of the function.
+        Applies the function to the input elementwise.
+        Snake ∶= x + 1/a * sin^2 (xa)
+        '''
+        alpha = self.alpha.unsqueeze(0).unsqueeze(-1)  # line up with x to [B, C, T]
+        if self.alpha_logscale:
+            alpha = torch.exp(alpha)
+        x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
+        return x
+class SnakeBeta(nn.Module):
+    '''
+    A modified Snake function which uses separate parameters for the magnitude of the periodic components
+    Shape:
+        - Input: (B, C, T)
+        - Output: (B, C, T), same shape as the input
+    Parameters:
+        - alpha - trainable parameter that controls frequency
+        - beta - trainable parameter that controls magnitude
+    References:
+        - This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
+        https://arxiv.org/abs/2006.08195
+    Examples:
+        >>> a1 = snakebeta(256)
+        >>> x = torch.randn(256)
+        >>> x = a1(x)
+    '''
+    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
+        '''
+        Initialization.
+        INPUT:
+            - in_features: shape of the input
+            - alpha - trainable parameter that controls frequency
+            - beta - trainable parameter that controls magnitude
+            alpha is initialized to 1 by default, higher values = higher-frequency.
+            beta is initialized to 1 by default, higher values = higher-magnitude.
+            alpha will be trained along with the rest of your model.
+        '''
+        super(SnakeBeta, self).__init__()
+        self.in_features = in_features
+        # initialize alpha
+        self.alpha_logscale = alpha_logscale
+        if self.alpha_logscale:  # log scale alphas initialized to zeros
+            self.alpha = Parameter(torch.zeros(in_features) * alpha)
+            self.beta = Parameter(torch.zeros(in_features) * alpha)
+        else:  # linear scale alphas initialized to ones
+            self.alpha = Parameter(torch.ones(in_features) * alpha)
+            self.beta = Parameter(torch.ones(in_features) * alpha)
+        self.alpha.requires_grad = alpha_trainable
+        self.beta.requires_grad = alpha_trainable
+        self.no_div_by_zero = 0.000000001
+    def forward(self, x):
+        '''
+        Forward pass of the function.
+        Applies the function to the input elementwise.
+        SnakeBeta ∶= x + 1/b * sin^2 (xa)
+        '''
+        alpha = self.alpha.unsqueeze(0).unsqueeze(-1)  # line up with x to [B, C, T]
+        beta = self.beta.unsqueeze(0).unsqueeze(-1)
+        if self.alpha_logscale:
+            alpha = torch.exp(alpha)
+            beta = torch.exp(beta)
+        x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
+        return x

indextts/BigVGAN/alias_free_activation/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

indextts/BigVGAN/alias_free_activation/__init__.py ADDED Viewed

File without changes

indextts/BigVGAN/alias_free_activation/cuda/.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ /build

indextts/BigVGAN/alias_free_activation/cuda/__init__.py ADDED Viewed

File without changes

indextts/BigVGAN/alias_free_activation/cuda/activation1d.py ADDED Viewed

	@@ -0,0 +1,76 @@

+# Copyright (c) 2024 NVIDIA CORPORATION.
+#   Licensed under the MIT license.
+import torch
+import torch.nn as nn
+# load fused CUDA kernel: this enables importing anti_alias_activation_cuda
+from indextts.BigVGAN.alias_free_activation.cuda import load
+from indextts.BigVGAN.alias_free_activation.torch.resample import DownSample1d, UpSample1d
+anti_alias_activation_cuda = load.load()
+class FusedAntiAliasActivation(torch.autograd.Function):
+    """
+    Assumes filter size 12, replication padding on upsampling/downsampling, and logscale alpha/beta parameters as inputs.
+    The hyperparameters are hard-coded in the kernel to maximize speed.
+    NOTE: The fused kenrel is incorrect for Activation1d with different hyperparameters.
+    """
+    @staticmethod
+    def forward(ctx, inputs, up_ftr, down_ftr, alpha, beta):
+        activation_results = anti_alias_activation_cuda.forward(
+            inputs, up_ftr, down_ftr, alpha, beta
+        )
+        return activation_results
+    @staticmethod
+    def backward(ctx, output_grads):
+        raise NotImplementedError
+        return output_grads, None, None
+class Activation1d(nn.Module):
+    def __init__(
+        self,
+        activation,
+        up_ratio: int = 2,
+        down_ratio: int = 2,
+        up_kernel_size: int = 12,
+        down_kernel_size: int = 12,
+        fused: bool = True,
+    ):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.down_ratio = down_ratio
+        self.act = activation
+        self.upsample = UpSample1d(up_ratio, up_kernel_size)
+        self.downsample = DownSample1d(down_ratio, down_kernel_size)
+        self.fused = fused  # Whether to use fused CUDA kernel or not
+    def forward(self, x):
+        if not self.fused:
+            x = self.upsample(x)
+            x = self.act(x)
+            x = self.downsample(x)
+            return x
+        else:
+            if self.act.__class__.__name__ == "Snake":
+                beta = self.act.alpha.data  # Snake uses same params for alpha and beta
+            else:
+                beta = (
+                    self.act.beta.data
+                )  # Snakebeta uses different params for alpha and beta
+            alpha = self.act.alpha.data
+            if (
+                not self.act.alpha_logscale
+            ):  # Exp baked into cuda kernel, cancel it out with a log
+                alpha = torch.log(alpha)
+                beta = torch.log(beta)
+            x = FusedAntiAliasActivation.apply(
+                x, self.upsample.filter, self.downsample.lowpass.filter, alpha, beta
+            )
+            return x

indextts/BigVGAN/alias_free_activation/cuda/anti_alias_activation.cpp ADDED Viewed

	@@ -0,0 +1,23 @@

+/* coding=utf-8
+ * Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+ #include <torch/extension.h>
+extern "C" torch::Tensor fwd_cuda(torch::Tensor const &input, torch::Tensor const &up_filter, torch::Tensor const &down_filter, torch::Tensor const &alpha, torch::Tensor const &beta);
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
+    m.def("forward", &fwd_cuda, "Anti-Alias Activation forward (CUDA)");
+}

indextts/BigVGAN/alias_free_activation/cuda/anti_alias_activation_cuda.cu ADDED Viewed

	@@ -0,0 +1,256 @@

+/* coding=utf-8
+ * Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <ATen/ATen.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <cuda_fp16.h>
+#include <cuda_profiler_api.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <torch/extension.h>
+#include "type_shim.h"
+#include <assert.h>
+#include <cfloat>
+#include <limits>
+#include <stdint.h>
+#include <c10/macros/Macros.h>
+namespace
+{
+    // Hard-coded hyperparameters
+    // WARP_SIZE and WARP_BATCH must match the return values batches_per_warp and
+    constexpr int ELEMENTS_PER_LDG_STG = 1; //(WARP_ITERATIONS < 4) ? 1 : 4;
+    constexpr int BUFFER_SIZE = 32;
+    constexpr int FILTER_SIZE = 12;
+    constexpr int HALF_FILTER_SIZE = 6;
+    constexpr int UPSAMPLE_REPLICATION_PAD = 5; // 5 on each side, matching torch impl
+    constexpr int DOWNSAMPLE_REPLICATION_PAD_LEFT = 5; // matching torch impl
+    constexpr int DOWNSAMPLE_REPLICATION_PAD_RIGHT = 6; // matching torch impl
+    template <typename input_t, typename output_t, typename acc_t>
+    __global__ void anti_alias_activation_forward(
+        output_t *dst,
+        const input_t *src,
+        const acc_t *up_ftr,
+        const acc_t *down_ftr,
+        const acc_t *alpha,
+        const acc_t *beta,
+        int batch_size,
+        int channels,
+        int seq_len)
+    {
+        // Up and downsample filters
+        input_t up_filter[FILTER_SIZE];
+        input_t down_filter[FILTER_SIZE];
+        // Load data from global memory including extra indices reserved for replication paddings
+        input_t elements[2 * FILTER_SIZE + 2 * BUFFER_SIZE + 2 * UPSAMPLE_REPLICATION_PAD] = {0};
+        input_t intermediates[2 * FILTER_SIZE + 2 * BUFFER_SIZE + DOWNSAMPLE_REPLICATION_PAD_LEFT + DOWNSAMPLE_REPLICATION_PAD_RIGHT] = {0};
+        // Output stores downsampled output before writing to dst
+        output_t output[BUFFER_SIZE];
+        // blockDim/threadIdx = (128, 1, 1)
+        // gridDim/blockIdx = (seq_blocks, channels, batches)
+        int block_offset = (blockIdx.x * 128 * BUFFER_SIZE + seq_len * (blockIdx.y + gridDim.y * blockIdx.z));
+        int local_offset = threadIdx.x * BUFFER_SIZE;
+        int seq_offset = blockIdx.x * 128 * BUFFER_SIZE + local_offset;
+        // intermediate have double the seq_len
+        int intermediate_local_offset = threadIdx.x * BUFFER_SIZE * 2;
+        int intermediate_seq_offset = blockIdx.x * 128 * BUFFER_SIZE * 2 + intermediate_local_offset;
+        // Get values needed for replication padding before moving pointer
+        const input_t *right_most_pntr = src + (seq_len * (blockIdx.y + gridDim.y * blockIdx.z));
+        input_t seq_left_most_value = right_most_pntr[0];
+        input_t seq_right_most_value = right_most_pntr[seq_len - 1];
+        // Move src and dst pointers
+        src += block_offset + local_offset;
+        dst += block_offset + local_offset;
+        // Alpha and beta values for snake activatons. Applies exp by default
+        alpha = alpha + blockIdx.y;
+        beta = beta + blockIdx.y;
+        acc_t alpha_val = expf(alpha[0]);
+        acc_t beta_val = expf(beta[0]);
+        #pragma unroll
+        for (int it = 0; it < FILTER_SIZE; it += 1)
+        {
+            up_filter[it] = up_ftr[it];
+            down_filter[it] = down_ftr[it];
+        }
+        // Apply replication padding for upsampling, matching torch impl
+        #pragma unroll
+        for (int it = -HALF_FILTER_SIZE; it < BUFFER_SIZE + HALF_FILTER_SIZE; it += 1)
+        {
+            int element_index = seq_offset + it; // index for element
+            if ((element_index < 0) && (element_index >= -UPSAMPLE_REPLICATION_PAD))
+            {
+                elements[2 * (HALF_FILTER_SIZE + it)] = 2 * seq_left_most_value;
+            }
+            if ((element_index >= seq_len) && (element_index < seq_len + UPSAMPLE_REPLICATION_PAD))
+            {
+                elements[2 * (HALF_FILTER_SIZE + it)] = 2 * seq_right_most_value;
+            }
+            if ((element_index >= 0) && (element_index < seq_len))
+            {
+                elements[2 * (HALF_FILTER_SIZE + it)] = 2 * src[it];
+            }
+        }
+        // Apply upsampling strided convolution and write to intermediates. It reserves DOWNSAMPLE_REPLICATION_PAD_LEFT for replication padding of the downsampilng conv later
+        #pragma unroll
+        for (int it = 0; it < (2 * BUFFER_SIZE + 2 * FILTER_SIZE); it += 1)
+        {
+            acc_t acc = 0.0;
+            int element_index = intermediate_seq_offset + it; // index for intermediate
+            #pragma unroll
+            for (int f_idx = 0; f_idx < FILTER_SIZE; f_idx += 1)
+            {
+                if ((element_index + f_idx) >= 0)
+                {
+                    acc += up_filter[f_idx] * elements[it + f_idx];
+                }
+            }
+            intermediates[it + DOWNSAMPLE_REPLICATION_PAD_LEFT] = acc;
+        }
+        // Apply activation function. It reserves DOWNSAMPLE_REPLICATION_PAD_LEFT and DOWNSAMPLE_REPLICATION_PAD_RIGHT for replication padding of the downsampilng conv later
+        double no_div_by_zero = 0.000000001;
+        #pragma unroll
+        for (int it = 0; it < 2 * BUFFER_SIZE + 2 * FILTER_SIZE; it += 1)
+        {
+            acc_t a = sinf(intermediates[it + DOWNSAMPLE_REPLICATION_PAD_LEFT] * alpha_val);
+            intermediates[it + DOWNSAMPLE_REPLICATION_PAD_LEFT] += (1.0 / (beta_val + no_div_by_zero)) * a * a;
+        }
+        // Apply replication padding before downsampling conv from intermediates
+        #pragma unroll
+        for (int it = 0; it < DOWNSAMPLE_REPLICATION_PAD_LEFT; it += 1)
+        {
+            intermediates[it] = intermediates[DOWNSAMPLE_REPLICATION_PAD_LEFT];
+        }
+        #pragma unroll
+        for (int it = DOWNSAMPLE_REPLICATION_PAD_LEFT + 2 * BUFFER_SIZE + 2 * FILTER_SIZE; it < DOWNSAMPLE_REPLICATION_PAD_LEFT + 2 * BUFFER_SIZE + 2 * FILTER_SIZE + DOWNSAMPLE_REPLICATION_PAD_RIGHT; it += 1)
+        {
+            intermediates[it] = intermediates[DOWNSAMPLE_REPLICATION_PAD_LEFT + 2 * BUFFER_SIZE + 2 * FILTER_SIZE - 1];
+        }
+        // Apply downsample strided convolution (assuming stride=2) from intermediates
+        #pragma unroll
+        for (int it = 0; it < BUFFER_SIZE; it += 1)
+        {
+            acc_t acc = 0.0;
+            #pragma unroll
+            for (int f_idx = 0; f_idx < FILTER_SIZE; f_idx += 1)
+            {
+                // Add constant DOWNSAMPLE_REPLICATION_PAD_RIGHT to match torch implementation
+                acc += down_filter[f_idx] * intermediates[it * 2 + f_idx + DOWNSAMPLE_REPLICATION_PAD_RIGHT];
+            }
+            output[it] = acc;
+        }
+        // Write output to dst
+        #pragma unroll
+        for (int it = 0;  it < BUFFER_SIZE;  it += ELEMENTS_PER_LDG_STG)
+        {
+            int element_index = seq_offset + it;
+            if (element_index < seq_len)
+            {
+                dst[it] = output[it];
+            }
+        }
+    }
+    template <typename input_t, typename output_t, typename acc_t>
+    void dispatch_anti_alias_activation_forward(
+        output_t *dst,
+        const input_t *src,
+        const acc_t *up_ftr,
+        const acc_t *down_ftr,
+        const acc_t *alpha,
+        const acc_t *beta,
+        int batch_size,
+        int channels,
+        int seq_len)
+    {
+        if (seq_len == 0)
+        {
+            return;
+        }
+        else
+        {
+            // Use 128 threads per block to maximimize gpu utilization
+            constexpr int threads_per_block = 128;
+            constexpr int seq_len_per_block = 4096;
+            int blocks_per_seq_len = (seq_len + seq_len_per_block - 1) / seq_len_per_block;
+            dim3 blocks(blocks_per_seq_len, channels, batch_size);
+            dim3 threads(threads_per_block, 1, 1);
+            anti_alias_activation_forward<input_t, output_t, acc_t>
+                <<<blocks, threads, 0, at::cuda::getCurrentCUDAStream()>>>(dst, src, up_ftr, down_ftr, alpha, beta, batch_size, channels, seq_len);
+        }
+    }
+}
+extern "C" torch::Tensor fwd_cuda(torch::Tensor const &input, torch::Tensor const &up_filter, torch::Tensor const &down_filter, torch::Tensor const &alpha, torch::Tensor const &beta)
+{
+    // Input is a 3d tensor with dimensions [batches, channels, seq_len]
+    const int batches = input.size(0);
+    const int channels = input.size(1);
+    const int seq_len = input.size(2);
+    // Output
+    auto act_options = input.options().requires_grad(false);
+    torch::Tensor anti_alias_activation_results =
+        torch::empty({batches, channels, seq_len}, act_options);
+    using float32 = float;
+    // The dtype of input is float16, bfloat16, or float32
+    // The dtype of up_filter, down_filter, alpha, and beta is float32
+    // printf("input scalar type: %d\n", input.scalar_type());
+    // printf("up_filter scalar type: %d\n", up_filter.scalar_type());
+    // printf("down_filter scalar type: %d\n", down_filter.scalar_type());
+    // printf("alpha scalar type: %d\n", alpha.scalar_type());
+    // printf("beta scalar type: %d\n", beta.scalar_type());
+    void *input_ptr = static_cast<void *>(input.data_ptr());
+    float32 *up_filter_ptr = static_cast<float32 *>(up_filter.data_ptr());
+    float32 *down_filter_ptr = static_cast<float32 *>(down_filter.data_ptr());
+    float32 *alpha_ptr = static_cast<float32 *>(alpha.data_ptr());
+    float32 *beta_ptr = static_cast<float32 *>(beta.data_ptr());
+    void *anti_alias_activation_results_ptr = static_cast<void *>(anti_alias_activation_results.data_ptr());
+    DISPATCH_FLOAT_HALF_AND_BFLOAT(
+        input.scalar_type(),
+        "dispatch anti alias activation_forward",
+        dispatch_anti_alias_activation_forward<scalar_t, scalar_t, float32>(
+            reinterpret_cast<scalar_t *>(anti_alias_activation_results_ptr),
+            reinterpret_cast<const scalar_t *>(input_ptr),
+            reinterpret_cast<const float32 *>(up_filter_ptr),
+            reinterpret_cast<const float32 *>(down_filter_ptr),
+            reinterpret_cast<const float32 *>(alpha_ptr),
+            reinterpret_cast<const float32 *>(beta_ptr),
+            batches,
+            channels,
+            seq_len););
+    return anti_alias_activation_results;
+}

indextts/BigVGAN/alias_free_activation/cuda/compat.h ADDED Viewed

	@@ -0,0 +1,29 @@

+/* coding=utf-8
+ * Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*This code is copied fron NVIDIA apex:
+ *     https://github.com/NVIDIA/apex
+ *     with minor changes. */
+#ifndef TORCH_CHECK
+#define TORCH_CHECK AT_CHECK
+#endif
+#ifdef VERSION_GE_1_3
+#define DATA_PTR data_ptr
+#else
+#define DATA_PTR data
+#endif

indextts/BigVGAN/alias_free_activation/cuda/load.py ADDED Viewed

	@@ -0,0 +1,121 @@

+# Copyright (c) 2024 NVIDIA CORPORATION.
+#   Licensed under the MIT license.
+import os
+import pathlib
+import subprocess
+from torch.utils import cpp_extension
+"""
+Setting this param to a list has a problem of generating different compilation commands (with diferent order of architectures) and leading to recompilation of fused kernels.
+Set it to empty stringo avoid recompilation and assign arch flags explicity in extra_cuda_cflags below
+"""
+os.environ["TORCH_CUDA_ARCH_LIST"] = ""
+import re
+import shutil
+import tempfile
+# 补丁修复：sources 路径含中文字符时，生成 build.ninja 乱码导致编译失败
+# 使用临时目录来规避 ninja 编译失败（比如中文路径）
+def chinese_path_compile_support(sources, buildpath):
+    pattern = re.compile(r'[\u4e00-\u9fff]')
+    if not bool(pattern.search(str(sources[0].resolve()))):
+        return buildpath # 检测非中文路径跳过
+    # Create build directory
+    resolves = [ item.name for item in sources]
+    ninja_compile_dir = os.path.join(tempfile.gettempdir(), "BigVGAN", "cuda")
+    os.makedirs(ninja_compile_dir, exist_ok=True)
+    new_buildpath = os.path.join(ninja_compile_dir, "build")
+    os.makedirs(new_buildpath, exist_ok=True)
+    print(f"ninja_buildpath: {new_buildpath}")
+    # Copy files to directory
+    sources.clear()
+    current_dir = os.path.dirname(__file__)
+    ALLOWED_EXTENSIONS = {'.py', '.cu', '.cpp', '.h'}
+    for filename in os.listdir(current_dir):
+        item = pathlib.Path(current_dir).joinpath(filename)
+        tar_path = pathlib.Path(ninja_compile_dir).joinpath(item.name)
+        if not item.suffix.lower() in ALLOWED_EXTENSIONS:continue
+        pathlib.Path(shutil.copy2(item, tar_path))
+        if tar_path.name in resolves:sources.append(tar_path)
+    return new_buildpath
+def load():
+    # Check if cuda 11 is installed for compute capability 8.0
+    cc_flag = []
+    _, bare_metal_major, _ = _get_cuda_bare_metal_version(cpp_extension.CUDA_HOME)
+    if int(bare_metal_major) >= 11:
+        cc_flag.append("-gencode")
+        cc_flag.append("arch=compute_80,code=sm_80")
+    # Build path
+    srcpath = pathlib.Path(__file__).parent.absolute()
+    buildpath = srcpath / "build"
+    _create_build_dir(buildpath)
+    # Helper function to build the kernels.
+    def _cpp_extention_load_helper(name, sources, extra_cuda_flags):
+        return cpp_extension.load(
+            name=name,
+            sources=sources,
+            build_directory=buildpath,
+            extra_cflags=[
+                "-O3",
+            ],
+            extra_cuda_cflags=[
+                "-O3",
+                "-gencode",
+                "arch=compute_70,code=sm_70",
+                "--use_fast_math",
+            ]
+            + extra_cuda_flags
+            + cc_flag,
+            verbose=True,
+        )
+    extra_cuda_flags = [
+        "-U__CUDA_NO_HALF_OPERATORS__",
+        "-U__CUDA_NO_HALF_CONVERSIONS__",
+        "--expt-relaxed-constexpr",
+        "--expt-extended-lambda",
+    ]
+    sources = [
+        srcpath / "anti_alias_activation.cpp",
+        srcpath / "anti_alias_activation_cuda.cu",
+    ]
+    # 兼容方案：ninja 特殊字符路径编译支持处理（比如中文路径）
+    buildpath = chinese_path_compile_support(sources, buildpath)
+    anti_alias_activation_cuda = _cpp_extention_load_helper(
+        "anti_alias_activation_cuda", sources, extra_cuda_flags
+    )
+    return anti_alias_activation_cuda
+def _get_cuda_bare_metal_version(cuda_dir):
+    raw_output = subprocess.check_output(
+        [cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True
+    )
+    output = raw_output.split()
+    release_idx = output.index("release") + 1
+    release = output[release_idx].split(".")
+    bare_metal_major = release[0]
+    bare_metal_minor = release[1][0]
+    return raw_output, bare_metal_major, bare_metal_minor
+def _create_build_dir(buildpath):
+    try:
+        os.mkdir(buildpath)
+    except OSError:
+        if not os.path.isdir(buildpath):
+            print(f"Creation of the build directory {buildpath} failed")

indextts/BigVGAN/alias_free_activation/cuda/type_shim.h ADDED Viewed

	@@ -0,0 +1,92 @@

+/* coding=utf-8
+ * Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <ATen/ATen.h>
+#include "compat.h"
+#define DISPATCH_FLOAT_HALF_AND_BFLOAT(TYPE, NAME, ...)                 \
+	switch (TYPE)                                                       \
+	{                                                                   \
+	case at::ScalarType::Float:                                         \
+	{                                                                   \
+		using scalar_t = float;                                         \
+		__VA_ARGS__;                                                    \
+		break;                                                          \
+	}                                                                   \
+	case at::ScalarType::Half:                                          \
+	{                                                                   \
+		using scalar_t = at::Half;                                      \
+		__VA_ARGS__;                                                    \
+		break;                                                          \
+	}                                                                   \
+	case at::ScalarType::BFloat16:                                      \
+	{                                                                   \
+		using scalar_t = at::BFloat16;                                  \
+		__VA_ARGS__;                                                    \
+		break;                                                          \
+	}                                                                   \
+	default:                                                            \
+		AT_ERROR(#NAME, " not implemented for '", toString(TYPE), "'"); \
+	}
+#define DISPATCH_FLOAT_HALF_AND_BFLOAT_INOUT_TYPES(TYPEIN, TYPEOUT, NAME, ...) \
+	switch (TYPEIN)                                                            \
+	{                                                                          \
+	case at::ScalarType::Float:                                                \
+	{                                                                          \
+		using scalar_t_in = float;                                             \
+		switch (TYPEOUT)                                                       \
+		{                                                                      \
+		case at::ScalarType::Float:                                            \
+		{                                                                      \
+			using scalar_t_out = float;                                        \
+			__VA_ARGS__;                                                       \
+			break;                                                             \
+		}                                                                      \
+		case at::ScalarType::Half:                                             \
+		{                                                                      \
+			using scalar_t_out = at::Half;                                     \
+			__VA_ARGS__;                                                       \
+			break;                                                             \
+		}                                                                      \
+		case at::ScalarType::BFloat16:                                         \
+		{                                                                      \
+			using scalar_t_out = at::BFloat16;                                 \
+			__VA_ARGS__;                                                       \
+			break;                                                             \
+		}                                                                      \
+		default:                                                               \
+			AT_ERROR(#NAME, " not implemented for '", toString(TYPEOUT), "'"); \
+		}                                                                      \
+		break;                                                                 \
+	}                                                                          \
+	case at::ScalarType::Half:                                                 \
+	{                                                                          \
+		using scalar_t_in = at::Half;                                          \
+		using scalar_t_out = at::Half;                                         \
+		__VA_ARGS__;                                                           \
+		break;                                                                 \
+	}                                                                          \
+	case at::ScalarType::BFloat16:                                             \
+	{                                                                          \
+		using scalar_t_in = at::BFloat16;                                      \
+		using scalar_t_out = at::BFloat16;                                     \
+		__VA_ARGS__;                                                           \
+		break;                                                                 \
+	}                                                                          \
+	default:                                                                   \
+		AT_ERROR(#NAME, " not implemented for '", toString(TYPEIN), "'");      \
+	}

indextts/BigVGAN/alias_free_activation/torch/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+from .act import *
+from .filter import *
+from .resample import *

indextts/BigVGAN/alias_free_activation/torch/act.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from .resample import DownSample1d, UpSample1d
+class Activation1d(nn.Module):
+    def __init__(
+        self,
+        activation,
+        up_ratio: int = 2,
+        down_ratio: int = 2,
+        up_kernel_size: int = 12,
+        down_kernel_size: int = 12,
+    ):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.down_ratio = down_ratio
+        self.act = activation
+        self.upsample = UpSample1d(up_ratio, up_kernel_size)
+        self.downsample = DownSample1d(down_ratio, down_kernel_size)
+    # x: [B,C,T]
+    def forward(self, x):
+        x = self.upsample(x)
+        x = self.act(x)
+        x = self.downsample(x)
+        return x

indextts/BigVGAN/alias_free_activation/torch/filter.py ADDED Viewed

	@@ -0,0 +1,102 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+if "sinc" in dir(torch):
+    sinc = torch.sinc
+else:
+    # This code is adopted from adefossez's julius.core.sinc under the MIT License
+    # https://adefossez.github.io/julius/julius/core.html
+    #   LICENSE is in incl_licenses directory.
+    def sinc(x: torch.Tensor):
+        """
+        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
+        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
+        """
+        return torch.where(
+            x == 0,
+            torch.tensor(1.0, device=x.device, dtype=x.dtype),
+            torch.sin(math.pi * x) / math.pi / x,
+        )
+# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
+# https://adefossez.github.io/julius/julius/lowpass.html
+#   LICENSE is in incl_licenses directory.
+def kaiser_sinc_filter1d(
+    cutoff, half_width, kernel_size
+):  # return filter [1,1,kernel_size]
+    even = kernel_size % 2 == 0
+    half_size = kernel_size // 2
+    # For kaiser window
+    delta_f = 4 * half_width
+    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
+    if A > 50.0:
+        beta = 0.1102 * (A - 8.7)
+    elif A >= 21.0:
+        beta = 0.5842 * (A - 21) ** 0.4 + 0.07886 * (A - 21.0)
+    else:
+        beta = 0.0
+    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
+    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
+    if even:
+        time = torch.arange(-half_size, half_size) + 0.5
+    else:
+        time = torch.arange(kernel_size) - half_size
+    if cutoff == 0:
+        filter_ = torch.zeros_like(time)
+    else:
+        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
+        """
+        Normalize filter to have sum = 1, otherwise we will have a small leakage of the constant component in the input signal.
+        """
+        filter_ /= filter_.sum()
+        filter = filter_.view(1, 1, kernel_size)
+    return filter
+class LowPassFilter1d(nn.Module):
+    def __init__(
+        self,
+        cutoff=0.5,
+        half_width=0.6,
+        stride: int = 1,
+        padding: bool = True,
+        padding_mode: str = "replicate",
+        kernel_size: int = 12,
+    ):
+        """
+        kernel_size should be even number for stylegan3 setup, in this implementation, odd number is also possible.
+        """
+        super().__init__()
+        if cutoff < -0.0:
+            raise ValueError("Minimum cutoff must be larger than zero.")
+        if cutoff > 0.5:
+            raise ValueError("A cutoff above 0.5 does not make sense.")
+        self.kernel_size = kernel_size
+        self.even = kernel_size % 2 == 0
+        self.pad_left = kernel_size // 2 - int(self.even)
+        self.pad_right = kernel_size // 2
+        self.stride = stride
+        self.padding = padding
+        self.padding_mode = padding_mode
+        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
+        self.register_buffer("filter", filter)
+    # Input [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        if self.padding:
+            x = F.pad(x, (self.pad_left, self.pad_right), mode=self.padding_mode)
+        out = F.conv1d(x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
+        return out

indextts/BigVGAN/alias_free_activation/torch/resample.py ADDED Viewed

	@@ -0,0 +1,58 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from torch.nn import functional as F
+from .filter import LowPassFilter1d, kaiser_sinc_filter1d
+class UpSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = (
+            int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        )
+        self.stride = ratio
+        self.pad = self.kernel_size // ratio - 1
+        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
+        self.pad_right = (
+            self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
+        )
+        filter = kaiser_sinc_filter1d(
+            cutoff=0.5 / ratio, half_width=0.6 / ratio, kernel_size=self.kernel_size
+        )
+        self.register_buffer("filter", filter)
+    # x: [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        x = F.pad(x, (self.pad, self.pad), mode="replicate")
+        x = self.ratio * F.conv_transpose1d(
+            x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C
+        )
+        x = x[..., self.pad_left : -self.pad_right]
+        return x
+class DownSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = (
+            int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        )
+        self.lowpass = LowPassFilter1d(
+            cutoff=0.5 / ratio,
+            half_width=0.6 / ratio,
+            stride=ratio,
+            kernel_size=self.kernel_size,
+        )
+    def forward(self, x):
+        xx = self.lowpass(x)
+        return xx

indextts/BigVGAN/alias_free_torch/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+from .act import *
+from .filter import *
+from .resample import *

indextts/BigVGAN/alias_free_torch/act.py ADDED Viewed

	@@ -0,0 +1,29 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from .resample import DownSample1d, UpSample1d
+class Activation1d(nn.Module):
+    def __init__(self,
+                 activation,
+                 up_ratio: int = 2,
+                 down_ratio: int = 2,
+                 up_kernel_size: int = 12,
+                 down_kernel_size: int = 12):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.down_ratio = down_ratio
+        self.act = activation
+        self.upsample = UpSample1d(up_ratio, up_kernel_size)
+        self.downsample = DownSample1d(down_ratio, down_kernel_size)
+    # x: [B,C,T]
+    def forward(self, x):
+        x = self.upsample(x)
+        x = self.act(x)
+        x = self.downsample(x)
+        return x

indextts/BigVGAN/alias_free_torch/filter.py ADDED Viewed

	@@ -0,0 +1,96 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+if 'sinc' in dir(torch):
+    sinc = torch.sinc
+else:
+    # This code is adopted from adefossez's julius.core.sinc under the MIT License
+    # https://adefossez.github.io/julius/julius/core.html
+    #   LICENSE is in incl_licenses directory.
+    def sinc(x: torch.Tensor):
+        """
+        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
+        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
+        """
+        return torch.where(x == 0,
+                           torch.tensor(1., device=x.device, dtype=x.dtype),
+                           torch.sin(math.pi * x) / math.pi / x)
+# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
+# https://adefossez.github.io/julius/julius/lowpass.html
+#   LICENSE is in incl_licenses directory.
+def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filter [1,1,kernel_size]
+    even = (kernel_size % 2 == 0)
+    half_size = kernel_size // 2
+    #For kaiser window
+    delta_f = 4 * half_width
+    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
+    if A > 50.:
+        beta = 0.1102 * (A - 8.7)
+    elif A >= 21.:
+        beta = 0.5842 * (A - 21)**0.4 + 0.07886 * (A - 21.)
+    else:
+        beta = 0.
+    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
+    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
+    if even:
+        time = (torch.arange(-half_size, half_size) + 0.5)
+    else:
+        time = torch.arange(kernel_size) - half_size
+    if cutoff == 0:
+        filter_ = torch.zeros_like(time)
+    else:
+        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
+        # Normalize filter to have sum = 1, otherwise we will have a small leakage
+        # of the constant component in the input signal.
+        filter_ /= filter_.sum()
+        filter = filter_.view(1, 1, kernel_size)
+    return filter
+class LowPassFilter1d(nn.Module):
+    def __init__(self,
+                 cutoff=0.5,
+                 half_width=0.6,
+                 stride: int = 1,
+                 padding: bool = True,
+                 padding_mode: str = 'replicate',
+                 kernel_size: int = 12):
+        # kernel_size should be even number for stylegan3 setup,
+        # in this implementation, odd number is also possible.
+        super().__init__()
+        if cutoff < -0.:
+            raise ValueError("Minimum cutoff must be larger than zero.")
+        if cutoff > 0.5:
+            raise ValueError("A cutoff above 0.5 does not make sense.")
+        self.kernel_size = kernel_size
+        self.even = (kernel_size % 2 == 0)
+        self.pad_left = kernel_size // 2 - int(self.even)
+        self.pad_right = kernel_size // 2
+        self.stride = stride
+        self.padding = padding
+        self.padding_mode = padding_mode
+        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
+        self.register_buffer("filter", filter)
+    #input [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        if self.padding:
+            x = F.pad(x, (self.pad_left, self.pad_right),
+                      mode=self.padding_mode)
+        out = F.conv1d(x, self.filter.expand(C, -1, -1),
+                       stride=self.stride, groups=C)
+        return out

indextts/BigVGAN/alias_free_torch/resample.py ADDED Viewed

	@@ -0,0 +1,49 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from torch.nn import functional as F
+from .filter import LowPassFilter1d, kaiser_sinc_filter1d
+class UpSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        self.stride = ratio
+        self.pad = self.kernel_size // ratio - 1
+        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
+        self.pad_right = self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
+        filter = kaiser_sinc_filter1d(cutoff=0.5 / ratio,
+                                      half_width=0.6 / ratio,
+                                      kernel_size=self.kernel_size)
+        self.register_buffer("filter", filter)
+    # x: [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        x = F.pad(x, (self.pad, self.pad), mode='replicate')
+        x = self.ratio * F.conv_transpose1d(
+            x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
+        x = x[..., self.pad_left:-self.pad_right]
+        return x
+class DownSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        self.lowpass = LowPassFilter1d(cutoff=0.5 / ratio,
+                                       half_width=0.6 / ratio,
+                                       stride=ratio,
+                                       kernel_size=self.kernel_size)
+    def forward(self, x):
+        xx = self.lowpass(x)
+        return xx

indextts/BigVGAN/bigvgan.py ADDED Viewed

	@@ -0,0 +1,534 @@

+# Copyright (c) 2024 NVIDIA CORPORATION.
+#   Licensed under the MIT license.
+# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
+#   LICENSE is in incl_licenses directory.
+import json
+import os
+from pathlib import Path
+from typing import Dict, Optional, Union
+import torch
+import torch.nn as nn
+from huggingface_hub import PyTorchModelHubMixin, hf_hub_download
+from torch.nn import Conv1d, ConvTranspose1d
+from torch.nn.utils import remove_weight_norm, weight_norm
+import indextts.BigVGAN.activations as activations
+from indextts.BigVGAN.alias_free_activation.torch.act import \
+    Activation1d as TorchActivation1d
+from indextts.BigVGAN.ECAPA_TDNN import ECAPA_TDNN
+from indextts.BigVGAN.env import AttrDict
+from indextts.BigVGAN.utils import get_padding, init_weights
+def load_hparams_from_json(path) -> AttrDict:
+    with open(path) as f:
+        data = f.read()
+    return AttrDict(json.loads(data))
+class AMPBlock1(torch.nn.Module):
+    """
+    AMPBlock applies Snake / SnakeBeta activation functions with trainable parameters that control periodicity, defined for each layer.
+    AMPBlock1 has additional self.convs2 that contains additional Conv1d layers with a fixed dilation=1 followed by each layer in self.convs1
+    Args:
+        h (AttrDict): Hyperparameters.
+        channels (int): Number of convolution channels.
+        kernel_size (int): Size of the convolution kernel. Default is 3.
+        dilation (tuple): Dilation rates for the convolutions. Each dilation layer has two convolutions. Default is (1, 3, 5).
+        activation (str): Activation function type. Should be either 'snake' or 'snakebeta'. Default is None.
+    """
+    def __init__(
+        self,
+        h: AttrDict,
+        channels: int,
+        kernel_size: int = 3,
+        dilation: tuple = (1, 3, 5),
+        activation: str = None,
+    ):
+        super().__init__()
+        self.h = h
+        self.convs1 = nn.ModuleList(
+            [
+                weight_norm(
+                    Conv1d(
+                        channels,
+                        channels,
+                        kernel_size,
+                        stride=1,
+                        dilation=d,
+                        padding=get_padding(kernel_size, d),
+                    )
+                )
+                for d in dilation
+            ]
+        )
+        self.convs1.apply(init_weights)
+        self.convs2 = nn.ModuleList(
+            [
+                weight_norm(
+                    Conv1d(
+                        channels,
+                        channels,
+                        kernel_size,
+                        stride=1,
+                        dilation=1,
+                        padding=get_padding(kernel_size, 1),
+                    )
+                )
+                for _ in range(len(dilation))
+            ]
+        )
+        self.convs2.apply(init_weights)
+        self.num_layers = len(self.convs1) + len(
+            self.convs2
+        )  # Total number of conv layers
+        # Select which Activation1d, lazy-load cuda version to ensure backward compatibility
+        if self.h.get("use_cuda_kernel", False):
+            from alias_free_activation.cuda.activation1d import \
+                Activation1d as CudaActivation1d
+            Activation1d = CudaActivation1d
+        else:
+            Activation1d = TorchActivation1d
+        # Activation functions
+        if activation == "snake":
+            self.activations = nn.ModuleList(
+                [
+                    Activation1d(
+                        activation=activations.Snake(
+                            channels, alpha_logscale=h.snake_logscale
+                        )
+                    )
+                    for _ in range(self.num_layers)
+                ]
+            )
+        elif activation == "snakebeta":
+            self.activations = nn.ModuleList(
+                [
+                    Activation1d(
+                        activation=activations.SnakeBeta(
+                            channels, alpha_logscale=h.snake_logscale
+                        )
+                    )
+                    for _ in range(self.num_layers)
+                ]
+            )
+        else:
+            raise NotImplementedError(
+                "activation incorrectly specified. check the config file and look for 'activation'."
+            )
+    def forward(self, x):
+        acts1, acts2 = self.activations[::2], self.activations[1::2]
+        for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
+            xt = a1(x)
+            xt = c1(xt)
+            xt = a2(xt)
+            xt = c2(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs1:
+            remove_weight_norm(l)
+        for l in self.convs2:
+            remove_weight_norm(l)
+class AMPBlock2(torch.nn.Module):
+    """
+    AMPBlock applies Snake / SnakeBeta activation functions with trainable parameters that control periodicity, defined for each layer.
+    Unlike AMPBlock1, AMPBlock2 does not contain extra Conv1d layers with fixed dilation=1
+    Args:
+        h (AttrDict): Hyperparameters.
+        channels (int): Number of convolution channels.
+        kernel_size (int): Size of the convolution kernel. Default is 3.
+        dilation (tuple): Dilation rates for the convolutions. Each dilation layer has two convolutions. Default is (1, 3, 5).
+        activation (str): Activation function type. Should be either 'snake' or 'snakebeta'. Default is None.
+    """
+    def __init__(
+        self,
+        h: AttrDict,
+        channels: int,
+        kernel_size: int = 3,
+        dilation: tuple = (1, 3, 5),
+        activation: str = None,
+    ):
+        super().__init__()
+        self.h = h
+        self.convs = nn.ModuleList(
+            [
+                weight_norm(
+                    Conv1d(
+                        channels,
+                        channels,
+                        kernel_size,
+                        stride=1,
+                        dilation=d,
+                        padding=get_padding(kernel_size, d),
+                    )
+                )
+                for d in dilation
+            ]
+        )
+        self.convs.apply(init_weights)
+        self.num_layers = len(self.convs)  # Total number of conv layers
+        # Select which Activation1d, lazy-load cuda version to ensure backward compatibility
+        if self.h.get("use_cuda_kernel", False):
+            from alias_free_activation.cuda.activation1d import \
+                Activation1d as CudaActivation1d
+            Activation1d = CudaActivation1d
+        else:
+            Activation1d = TorchActivation1d
+        # Activation functions
+        if activation == "snake":
+            self.activations = nn.ModuleList(
+                [
+                    Activation1d(
+                        activation=activations.Snake(
+                            channels, alpha_logscale=h.snake_logscale
+                        )
+                    )
+                    for _ in range(self.num_layers)
+                ]
+            )
+        elif activation == "snakebeta":
+            self.activations = nn.ModuleList(
+                [
+                    Activation1d(
+                        activation=activations.SnakeBeta(
+                            channels, alpha_logscale=h.snake_logscale
+                        )
+                    )
+                    for _ in range(self.num_layers)
+                ]
+            )
+        else:
+            raise NotImplementedError(
+                "activation incorrectly specified. check the config file and look for 'activation'."
+            )
+    def forward(self, x):
+        for c, a in zip(self.convs, self.activations):
+            xt = a(x)
+            xt = c(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs:
+            remove_weight_norm(l)
+'''
+    PyTorchModelHubMixin,
+    library_name="bigvgan",
+    repo_url="https://github.com/NVIDIA/BigVGAN",
+    docs_url="https://github.com/NVIDIA/BigVGAN/blob/main/README.md",
+    pipeline_tag="audio-to-audio",
+    license="mit",
+    tags=["neural-vocoder", "audio-generation", "arxiv:2206.04658"],
+'''
+class BigVGAN(
+    torch.nn.Module,
+):
+    """
+    BigVGAN is a neural vocoder model that applies anti-aliased periodic activation for residual blocks (resblocks).
+    New in BigVGAN-v2: it can optionally use optimized CUDA kernels for AMP (anti-aliased multi-periodicity) blocks.
+    Args:
+        h (AttrDict): Hyperparameters.
+        use_cuda_kernel (bool): If set to True, loads optimized CUDA kernels for AMP. This should be used for inference only, as training is not supported with CUDA kernels.
+    Note:
+        - The `use_cuda_kernel` parameter should be used for inference only, as training with CUDA kernels is not supported.
+        - Ensure that the activation function is correctly specified in the hyperparameters (h.activation).
+    """
+    def __init__(self, h: AttrDict, use_cuda_kernel: bool = False):
+        super().__init__()
+        self.h = h
+        self.h["use_cuda_kernel"] = use_cuda_kernel
+        # Select which Activation1d, lazy-load cuda version to ensure backward compatibility
+        if self.h.get("use_cuda_kernel", False):
+            from alias_free_activation.cuda.activation1d import \
+                Activation1d as CudaActivation1d
+            Activation1d = CudaActivation1d
+        else:
+            Activation1d = TorchActivation1d
+        self.num_kernels = len(h.resblock_kernel_sizes)
+        self.num_upsamples = len(h.upsample_rates)
+        self.feat_upsample = h.feat_upsample
+        self.cond_in_each_up_layer = h.cond_d_vector_in_each_upsampling_layer
+        # Pre-conv
+        self.conv_pre = weight_norm(
+            Conv1d(h.gpt_dim, h.upsample_initial_channel, 7, 1, padding=3)
+        )
+        # Define which AMPBlock to use. BigVGAN uses AMPBlock1 as default
+        if h.resblock == "1":
+            resblock_class = AMPBlock1
+        elif h.resblock == "2":
+            resblock_class = AMPBlock2
+        else:
+            raise ValueError(
+                f"Incorrect resblock class specified in hyperparameters. Got {h.resblock}"
+            )
+        # Transposed conv-based upsamplers. does not apply anti-aliasing
+        self.ups = nn.ModuleList()
+        for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+            self.ups.append(
+                nn.ModuleList(
+                    [
+                        weight_norm(
+                            ConvTranspose1d(
+                                h.upsample_initial_channel // (2**i),
+                                h.upsample_initial_channel // (2 ** (i + 1)),
+                                k,
+                                u,
+                                padding=(k - u) // 2,
+                            )
+                        )
+                    ]
+                )
+            )
+        # Residual blocks using anti-aliased multi-periodicity composition modules (AMP)
+        self.resblocks = nn.ModuleList()
+        for i in range(len(self.ups)):
+            ch = h.upsample_initial_channel // (2 ** (i + 1))
+            for j, (k, d) in enumerate(
+                zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)
+            ):
+                self.resblocks.append(
+                    resblock_class(h, ch, k, d, activation=h.activation)
+                )
+        # Post-conv
+        activation_post = (
+            activations.Snake(ch, alpha_logscale=h.snake_logscale)
+            if h.activation == "snake"
+            else (
+                activations.SnakeBeta(ch, alpha_logscale=h.snake_logscale)
+                if h.activation == "snakebeta"
+                else None
+            )
+        )
+        if activation_post is None:
+            raise NotImplementedError(
+                "activation incorrectly specified. check the config file and look for 'activation'."
+            )
+        self.activation_post = Activation1d(activation=activation_post)
+        # Whether to use bias for the final conv_post. Default to True for backward compatibility
+        self.use_bias_at_final = h.get("use_bias_at_final", True)
+        self.conv_post = weight_norm(
+            Conv1d(ch, 1, 7, 1, padding=3, bias=self.use_bias_at_final)
+        )
+        # Weight initialization
+        for i in range(len(self.ups)):
+            self.ups[i].apply(init_weights)
+        self.conv_post.apply(init_weights)
+        # Final tanh activation. Defaults to True for backward compatibility
+        self.use_tanh_at_final = h.get("use_tanh_at_final", True)
+        self.speaker_encoder = ECAPA_TDNN(h.num_mels, lin_neurons=h.speaker_embedding_dim)
+        self.cond_layer = nn.Conv1d(h.speaker_embedding_dim, h.upsample_initial_channel, 1)
+        if self.cond_in_each_up_layer:
+            self.conds = nn.ModuleList()
+            for i in range(len(self.ups)):
+                ch = h.upsample_initial_channel // (2 ** (i + 1))
+                self.conds.append(nn.Conv1d(h.speaker_embedding_dim, ch, 1))
+    def forward(self, x, mel_refer, lens=None):
+        # Speaker reference
+        speaker_embedding = self.speaker_encoder(mel_refer, lens)
+        n_batch = x.size(0)
+        contrastive_loss = None
+        if n_batch * 2 == speaker_embedding.size(0):
+            spe_emb_chunk1, spe_emb_chunk2 = speaker_embedding[:n_batch, :, :], speaker_embedding[n_batch:, :, :]
+            contrastive_loss = self.cal_clip_loss(spe_emb_chunk1.squeeze(1), spe_emb_chunk2.squeeze(1),
+                                                  self.logit_scale.exp())
+            speaker_embedding = speaker_embedding[:n_batch, :, :]
+        speaker_embedding = speaker_embedding.transpose(1, 2)
+        # upsample feat
+        if self.feat_upsample:
+            x = torch.nn.functional.interpolate(
+                x.transpose(1, 2),
+                scale_factor=[4],
+                mode="linear",
+            ).squeeze(1)
+        else:
+            x = x.transpose(1, 2)
+        # BigVGAN
+        # Pre-conv
+        x = self.conv_pre(x)
+        x = x + self.cond_layer(speaker_embedding)
+        for i in range(self.num_upsamples):
+            # Upsampling
+            for i_up in range(len(self.ups[i])):
+                x = self.ups[i][i_up](x)
+            if self.cond_in_each_up_layer:
+                x = x + self.conds[i](speaker_embedding)
+            # AMP blocks
+            xs = None
+            for j in range(self.num_kernels):
+                if xs is None:
+                    xs = self.resblocks[i * self.num_kernels + j](x)
+                else:
+                    xs += self.resblocks[i * self.num_kernels + j](x)
+            x = xs / self.num_kernels
+        # Post-conv
+        x = self.activation_post(x)
+        x = self.conv_post(x)
+        # Final tanh activation
+        if self.use_tanh_at_final:
+            x = torch.tanh(x)
+        else:
+            x = torch.clamp(x, min=-1.0, max=1.0)  # Bound the output to [-1, 1]
+        return x, contrastive_loss
+    def remove_weight_norm(self):
+        try:
+            print("Removing weight norm...")
+            for l in self.ups:
+                for l_i in l:
+                    remove_weight_norm(l_i)
+            for l in self.resblocks:
+                l.remove_weight_norm()
+            remove_weight_norm(self.conv_pre)
+            remove_weight_norm(self.conv_post)
+        except ValueError:
+            print("[INFO] Model already removed weight norm. Skipping!")
+            pass
+    # Additional methods for huggingface_hub support
+    def _save_pretrained(self, save_directory: Path) -> None:
+        """Save weights and config.json from a Pytorch model to a local directory."""
+        model_path = save_directory / "bigvgan_generator.pt"
+        torch.save({"generator": self.state_dict()}, model_path)
+        config_path = save_directory / "config.json"
+        with open(config_path, "w") as config_file:
+            json.dump(self.h, config_file, indent=4)
+    @classmethod
+    def _from_pretrained(
+        cls,
+        *,
+        model_id: str,
+        revision: str,
+        cache_dir: str,
+        force_download: bool,
+        proxies: Optional[Dict],
+        resume_download: bool,
+        local_files_only: bool,
+        token: Union[str, bool, None],
+        map_location: str = "cpu",  # Additional argument
+        strict: bool = False,  # Additional argument
+        use_cuda_kernel: bool = False,
+        **model_kwargs,
+    ):
+        """Load Pytorch pretrained weights and return the loaded model."""
+        # Download and load hyperparameters (h) used by BigVGAN
+        if os.path.isdir(model_id):
+            print("Loading config.json from local directory")
+            config_file = os.path.join(model_id, "config.json")
+        else:
+            config_file = hf_hub_download(
+                repo_id=model_id,
+                filename="config.json",
+                revision=revision,
+                cache_dir=cache_dir,
+                force_download=force_download,
+                proxies=proxies,
+                resume_download=resume_download,
+                token=token,
+                local_files_only=local_files_only,
+            )
+        h = load_hparams_from_json(config_file)
+        # instantiate BigVGAN using h
+        if use_cuda_kernel:
+            print(
+                f"[WARNING] You have specified use_cuda_kernel=True during BigVGAN.from_pretrained(). Only inference is supported (training is not implemented)!"
+            )
+            print(
+                f"[WARNING] You need nvcc and ninja installed in your system that matches your PyTorch build is using to build the kernel. If not, the model will fail to initialize or generate incorrect waveform!"
+            )
+            print(
+                f"[WARNING] For detail, see the official GitHub repository: https://github.com/NVIDIA/BigVGAN?tab=readme-ov-file#using-custom-cuda-kernel-for-synthesis"
+            )
+        model = cls(h, use_cuda_kernel=use_cuda_kernel)
+        # Download and load pretrained generator weight
+        if os.path.isdir(model_id):
+            print("Loading weights from local directory")
+            model_file = os.path.join(model_id, "bigvgan_generator.pt")
+        else:
+            print(f"Loading weights from {model_id}")
+            model_file = hf_hub_download(
+                repo_id=model_id,
+                filename="bigvgan_generator.pt",
+                revision=revision,
+                cache_dir=cache_dir,
+                force_download=force_download,
+                proxies=proxies,
+                resume_download=resume_download,
+                token=token,
+                local_files_only=local_files_only,
+            )
+        checkpoint_dict = torch.load(model_file, map_location=map_location)
+        try:
+            model.load_state_dict(checkpoint_dict["generator"])
+        except RuntimeError:
+            print(
+                f"[INFO] the pretrained checkpoint does not contain weight norm. Loading the checkpoint after removing weight norm!"
+            )
+            model.remove_weight_norm()
+            model.load_state_dict(checkpoint_dict["generator"])
+        return model

indextts/BigVGAN/models.py ADDED Viewed

	@@ -0,0 +1,451 @@

+# Copyright (c) 2022 NVIDIA CORPORATION.
+#   Licensed under the MIT license.
+# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
+#   LICENSE is in incl_licenses directory.
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import Conv1d, Conv2d, ConvTranspose1d
+from torch.nn.utils import remove_weight_norm, spectral_norm, weight_norm
+import indextts.BigVGAN.activations as activations
+from indextts.BigVGAN.ECAPA_TDNN import ECAPA_TDNN
+from indextts.BigVGAN.utils import get_padding, init_weights
+LRELU_SLOPE = 0.1
+class AMPBlock1(torch.nn.Module):
+    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5), activation=None):
+        super(AMPBlock1, self).__init__()
+        self.h = h
+        self.convs1 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                               padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                               padding=get_padding(kernel_size, dilation[1]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+                               padding=get_padding(kernel_size, dilation[2])))
+        ])
+        self.convs1.apply(init_weights)
+        self.convs2 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                               padding=get_padding(kernel_size, 1)))
+        ])
+        self.convs2.apply(init_weights)
+        self.num_layers = len(self.convs1) + len(self.convs2)  # total number of conv layers
+        if self.h.get("use_cuda_kernel", False):
+            from indextts.BigVGAN.alias_free_activation.cuda.activation1d import Activation1d
+        else:
+            from indextts.BigVGAN.alias_free_torch import Activation1d
+        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
+            self.activations = nn.ModuleList([
+                Activation1d(
+                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
+                for _ in range(self.num_layers)
+            ])
+        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
+            self.activations = nn.ModuleList([
+                Activation1d(
+                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
+                for _ in range(self.num_layers)
+            ])
+        else:
+            raise NotImplementedError("activation incorrectly specified. check the config file and look for 'activation'.")
+    def forward(self, x):
+        acts1, acts2 = self.activations[::2], self.activations[1::2]
+        for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
+            xt = a1(x)
+            xt = c1(xt)
+            xt = a2(xt)
+            xt = c2(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs1:
+            remove_weight_norm(l)
+        for l in self.convs2:
+            remove_weight_norm(l)
+class AMPBlock2(torch.nn.Module):
+    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3), activation=None):
+        super(AMPBlock2, self).__init__()
+        self.h = h
+        self.convs = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                               padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                               padding=get_padding(kernel_size, dilation[1])))
+        ])
+        self.convs.apply(init_weights)
+        self.num_layers = len(self.convs)  # total number of conv layers
+        if self.h.get("use_cuda_kernel", False):
+            from indextts.BigVGAN.alias_free_activation.cuda.activation1d import Activation1d
+        else:
+            from indextts.BigVGAN.alias_free_torch import Activation1d
+        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
+            self.activations = nn.ModuleList([
+                Activation1d(
+                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
+                for _ in range(self.num_layers)
+            ])
+        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
+            self.activations = nn.ModuleList([
+                Activation1d(
+                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
+                for _ in range(self.num_layers)
+            ])
+        else:
+            raise NotImplementedError("activation incorrectly specified. check the config file and look for 'activation'.")
+    def forward(self, x):
+        for c, a in zip(self.convs, self.activations):
+            xt = a(x)
+            xt = c(xt)
+            x = xt + x
+        return x
+    def remove_weight_norm(self):
+        for l in self.convs:
+            remove_weight_norm(l)
+class BigVGAN(torch.nn.Module):
+    # this is our main BigVGAN model. Applies anti-aliased periodic activation for resblocks.
+    def __init__(self, h, use_cuda_kernel=False):
+        """
+        Args:
+            h (dict)
+            use_cuda_kernel (bool): whether to use custom cuda kernel for anti-aliased activation
+        """
+        super(BigVGAN, self).__init__()
+        self.h = h
+        self.h["use_cuda_kernel"] = use_cuda_kernel
+        self.num_kernels = len(h.resblock_kernel_sizes)
+        self.num_upsamples = len(h.upsample_rates)
+        self.feat_upsample = h.feat_upsample
+        self.cond_in_each_up_layer = h.cond_d_vector_in_each_upsampling_layer
+        # pre conv
+        self.conv_pre = weight_norm(Conv1d(h.gpt_dim, h.upsample_initial_channel, 7, 1, padding=3))
+        # define which AMPBlock to use. BigVGAN uses AMPBlock1 as default
+        resblock = AMPBlock1 if h.resblock == "1" else AMPBlock2
+        # transposed conv-based upsamplers. does not apply anti-aliasing
+        self.ups = nn.ModuleList()
+        for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
+            self.ups.append(nn.ModuleList([
+                weight_norm(ConvTranspose1d(h.upsample_initial_channel // (2 ** i),
+                                            h.upsample_initial_channel // (2 ** (i + 1)),
+                                            k, u, padding=(k - u) // 2))
+            ]))
+        # residual blocks using anti-aliased multi-periodicity composition modules (AMP)
+        self.resblocks = nn.ModuleList()
+        for i in range(len(self.ups)):
+            ch = h.upsample_initial_channel // (2 ** (i + 1))
+            for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
+                self.resblocks.append(resblock(self.h, ch, k, d, activation=h.activation))
+        if use_cuda_kernel:
+            from indextts.BigVGAN.alias_free_activation.cuda.activation1d import Activation1d
+        else:
+            from indextts.BigVGAN.alias_free_torch import Activation1d
+        # post conv
+        if h.activation == "snake":  # periodic nonlinearity with snake function and anti-aliasing
+            activation_post = activations.Snake(ch, alpha_logscale=h.snake_logscale)
+            self.activation_post = Activation1d(activation=activation_post)
+        elif h.activation == "snakebeta":  # periodic nonlinearity with snakebeta function and anti-aliasing
+            activation_post = activations.SnakeBeta(ch, alpha_logscale=h.snake_logscale)
+            self.activation_post = Activation1d(activation=activation_post)
+        else:
+            raise NotImplementedError("activation incorrectly specified. check the config file and look for 'activation'.")
+        self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
+        # weight initialization
+        for i in range(len(self.ups)):
+            self.ups[i].apply(init_weights)
+        self.conv_post.apply(init_weights)
+        self.speaker_encoder = ECAPA_TDNN(h.num_mels, lin_neurons=h.speaker_embedding_dim)
+        self.cond_layer = nn.Conv1d(h.speaker_embedding_dim, h.upsample_initial_channel, 1)
+        if self.cond_in_each_up_layer:
+            self.conds = nn.ModuleList()
+            for i in range(len(self.ups)):
+                ch = h.upsample_initial_channel // (2 ** (i + 1))
+                self.conds.append(nn.Conv1d(h.speaker_embedding_dim, ch, 1))
+        # self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
+    def forward(self, x, mel_ref, lens=None):
+        speaker_embedding = self.speaker_encoder(mel_ref, lens)
+        n_batch = x.size(0)
+        contrastive_loss = None
+        if n_batch * 2 == speaker_embedding.size(0):
+            spe_emb_chunk1, spe_emb_chunk2 = speaker_embedding[:n_batch, :, :], speaker_embedding[n_batch:, :, :]
+            contrastive_loss = self.cal_clip_loss(spe_emb_chunk1.squeeze(1), spe_emb_chunk2.squeeze(1), self.logit_scale.exp())
+            speaker_embedding = speaker_embedding[:n_batch, :, :]
+        speaker_embedding = speaker_embedding.transpose(1, 2)
+        # upsample feat
+        if self.feat_upsample:
+            x = torch.nn.functional.interpolate(
+                x.transpose(1, 2),
+                scale_factor=[4],
+                mode="linear",
+            ).squeeze(1)
+        else:
+            x = x.transpose(1, 2)
+        ### bigVGAN ###
+        # pre conv
+        x = self.conv_pre(x)
+        x = x + self.cond_layer(speaker_embedding)
+        for i in range(self.num_upsamples):
+            # upsampling
+            for i_up in range(len(self.ups[i])):
+                x = self.ups[i][i_up](x)
+            if self.cond_in_each_up_layer:
+                x = x + self.conds[i](speaker_embedding)
+            # AMP blocks
+            xs = None
+            for j in range(self.num_kernels):
+                if xs is None:
+                    xs = self.resblocks[i * self.num_kernels + j](x)
+                else:
+                    xs += self.resblocks[i * self.num_kernels + j](x)
+            x = xs / self.num_kernels
+        # post conv
+        x = self.activation_post(x)
+        x = self.conv_post(x)
+        x = torch.tanh(x)
+        return x, contrastive_loss
+    def remove_weight_norm(self):
+        print('Removing weight norm...')
+        for l in self.ups:
+            for l_i in l:
+                remove_weight_norm(l_i)
+        for l in self.resblocks:
+            l.remove_weight_norm()
+        remove_weight_norm(self.conv_pre)
+        remove_weight_norm(self.conv_post)
+    def cal_clip_loss(self, image_features, text_features, logit_scale):
+        device = image_features.device
+        logits_per_image, logits_per_text = self.get_logits(image_features, text_features, logit_scale)
+        labels = torch.arange(logits_per_image.shape[0], device=device, dtype=torch.long)
+        total_loss = (
+            F.cross_entropy(logits_per_image, labels) +
+            F.cross_entropy(logits_per_text, labels)
+        ) / 2
+        return total_loss
+    def get_logits(self, image_features, text_features, logit_scale):
+        logits_per_image = logit_scale * image_features @ text_features.T
+        logits_per_text = logit_scale * text_features @ image_features.T
+        return logits_per_image, logits_per_text
+class DiscriminatorP(torch.nn.Module):
+    def __init__(self, h, period, kernel_size=5, stride=3, use_spectral_norm=False):
+        super(DiscriminatorP, self).__init__()
+        self.period = period
+        self.d_mult = h.discriminator_channel_mult
+        norm_f = weight_norm if use_spectral_norm == False else spectral_norm
+        self.convs = nn.ModuleList([
+            norm_f(Conv2d(1, int(32 * self.d_mult), (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(int(32 * self.d_mult), int(128 * self.d_mult), (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(int(128 * self.d_mult), int(512 * self.d_mult), (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(int(512 * self.d_mult), int(1024 * self.d_mult), (kernel_size, 1), (stride, 1), padding=(get_padding(5, 1), 0))),
+            norm_f(Conv2d(int(1024 * self.d_mult), int(1024 * self.d_mult), (kernel_size, 1), 1, padding=(2, 0))),
+        ])
+        self.conv_post = norm_f(Conv2d(int(1024 * self.d_mult), 1, (3, 1), 1, padding=(1, 0)))
+    def forward(self, x):
+        fmap = []
+        # 1d to 2d
+        b, c, t = x.shape
+        if t % self.period != 0:  # pad first
+            n_pad = self.period - (t % self.period)
+            x = F.pad(x, (0, n_pad), "reflect")
+            t = t + n_pad
+        x = x.view(b, c, t // self.period, self.period)
+        for l in self.convs:
+            x = l(x)
+            x = F.leaky_relu(x, LRELU_SLOPE)
+            fmap.append(x)
+        x = self.conv_post(x)
+        fmap.append(x)
+        x = torch.flatten(x, 1, -1)
+        return x, fmap
+class MultiPeriodDiscriminator(torch.nn.Module):
+    def __init__(self, h):
+        super(MultiPeriodDiscriminator, self).__init__()
+        self.mpd_reshapes = h.mpd_reshapes
+        print("mpd_reshapes: {}".format(self.mpd_reshapes))
+        discriminators = [DiscriminatorP(h, rs, use_spectral_norm=h.use_spectral_norm) for rs in self.mpd_reshapes]
+        self.discriminators = nn.ModuleList(discriminators)
+    def forward(self, y, y_hat):
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for i, d in enumerate(self.discriminators):
+            y_d_r, fmap_r = d(y)
+            y_d_g, fmap_g = d(y_hat)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+class DiscriminatorR(nn.Module):
+    def __init__(self, cfg, resolution):
+        super().__init__()
+        self.resolution = resolution
+        assert len(self.resolution) == 3, \
+            "MRD layer requires list with len=3, got {}".format(self.resolution)
+        self.lrelu_slope = LRELU_SLOPE
+        norm_f = weight_norm if cfg.use_spectral_norm == False else spectral_norm
+        if hasattr(cfg, "mrd_use_spectral_norm"):
+            print("INFO: overriding MRD use_spectral_norm as {}".format(cfg.mrd_use_spectral_norm))
+            norm_f = weight_norm if cfg.mrd_use_spectral_norm == False else spectral_norm
+        self.d_mult = cfg.discriminator_channel_mult
+        if hasattr(cfg, "mrd_channel_mult"):
+            print("INFO: overriding mrd channel multiplier as {}".format(cfg.mrd_channel_mult))
+            self.d_mult = cfg.mrd_channel_mult
+        self.convs = nn.ModuleList([
+            norm_f(nn.Conv2d(1, int(32 * self.d_mult), (3, 9), padding=(1, 4))),
+            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
+            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
+            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 9), stride=(1, 2), padding=(1, 4))),
+            norm_f(nn.Conv2d(int(32 * self.d_mult), int(32 * self.d_mult), (3, 3), padding=(1, 1))),
+        ])
+        self.conv_post = norm_f(nn.Conv2d(int(32 * self.d_mult), 1, (3, 3), padding=(1, 1)))
+    def forward(self, x):
+        fmap = []
+        x = self.spectrogram(x)
+        x = x.unsqueeze(1)
+        for l in self.convs:
+            x = l(x)
+            x = F.leaky_relu(x, self.lrelu_slope)
+            fmap.append(x)
+        x = self.conv_post(x)
+        fmap.append(x)
+        x = torch.flatten(x, 1, -1)
+        return x, fmap
+    def spectrogram(self, x):
+        n_fft, hop_length, win_length = self.resolution
+        x = F.pad(x, (int((n_fft - hop_length) / 2), int((n_fft - hop_length) / 2)), mode='reflect')
+        x = x.squeeze(1)
+        x = torch.stft(x, n_fft=n_fft, hop_length=hop_length, win_length=win_length, center=False, return_complex=True)
+        x = torch.view_as_real(x)  # [B, F, TT, 2]
+        mag = torch.norm(x, p=2, dim=-1)  # [B, F, TT]
+        return mag
+class MultiResolutionDiscriminator(nn.Module):
+    def __init__(self, cfg, debug=False):
+        super().__init__()
+        self.resolutions = cfg.resolutions
+        assert len(self.resolutions) == 3, \
+            "MRD requires list of list with len=3, each element having a list with len=3. got {}".\
+            format(self.resolutions)
+        self.discriminators = nn.ModuleList(
+            [DiscriminatorR(cfg, resolution) for resolution in self.resolutions]
+        )
+    def forward(self, y, y_hat):
+        y_d_rs = []
+        y_d_gs = []
+        fmap_rs = []
+        fmap_gs = []
+        for i, d in enumerate(self.discriminators):
+            y_d_r, fmap_r = d(x=y)
+            y_d_g, fmap_g = d(x=y_hat)
+            y_d_rs.append(y_d_r)
+            fmap_rs.append(fmap_r)
+            y_d_gs.append(y_d_g)
+            fmap_gs.append(fmap_g)
+        return y_d_rs, y_d_gs, fmap_rs, fmap_gs
+def feature_loss(fmap_r, fmap_g):
+    loss = 0
+    for dr, dg in zip(fmap_r, fmap_g):
+        for rl, gl in zip(dr, dg):
+            loss += torch.mean(torch.abs(rl - gl))
+    return loss * 2
+def discriminator_loss(disc_real_outputs, disc_generated_outputs):
+    loss = 0
+    r_losses = []
+    g_losses = []
+    for dr, dg in zip(disc_real_outputs, disc_generated_outputs):
+        r_loss = torch.mean((1 - dr)**2)
+        g_loss = torch.mean(dg**2)
+        loss += (r_loss + g_loss)
+        r_losses.append(r_loss.item())
+        g_losses.append(g_loss.item())
+    return loss, r_losses, g_losses
+def generator_loss(disc_outputs):
+    loss = 0
+    gen_losses = []
+    for dg in disc_outputs:
+        l = torch.mean((1 - dg)**2)
+        gen_losses.append(l)
+        loss += l
+    return loss, gen_losses

indextts/BigVGAN/nnet/CNN.py ADDED Viewed

	@@ -0,0 +1,546 @@

+"""Library implementing convolutional neural networks.
+Authors
+ * Mirco Ravanelli 2020
+ * Jianyuan Zhong 2020
+ * Cem Subakan 2021
+ * Davide Borra 2021
+ * Andreas Nautsch 2022
+ * Sarthak Yadav 2022
+"""
+import logging
+import math
+from typing import Tuple
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torchaudio
+class SincConv(nn.Module):
+    """This function implements SincConv (SincNet).
+    M. Ravanelli, Y. Bengio, "Speaker Recognition from raw waveform with
+    SincNet", in Proc. of  SLT 2018 (https://arxiv.org/abs/1808.00158)
+    Arguments
+    ---------
+    out_channels : int
+        It is the number of output channels.
+    kernel_size: int
+        Kernel size of the convolutional filters.
+    input_shape : tuple
+        The shape of the input. Alternatively use ``in_channels``.
+    in_channels : int
+        The number of input channels. Alternatively use ``input_shape``.
+    stride : int
+        Stride factor of the convolutional filters. When the stride factor > 1,
+        a decimation in time is performed.
+    dilation : int
+        Dilation factor of the convolutional filters.
+    padding : str
+        (same, valid, causal). If "valid", no padding is performed.
+        If "same" and stride is 1, output shape is the same as the input shape.
+        "causal" results in causal (dilated) convolutions.
+    padding_mode : str
+        This flag specifies the type of padding. See torch.nn documentation
+        for more information.
+    sample_rate : int
+        Sampling rate of the input signals. It is only used for sinc_conv.
+    min_low_hz : float
+        Lowest possible frequency (in Hz) for a filter. It is only used for
+        sinc_conv.
+    min_band_hz : float
+        Lowest possible value (in Hz) for a filter bandwidth.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([10, 16000])
+    >>> conv = SincConv(input_shape=inp_tensor.shape, out_channels=25, kernel_size=11)
+    >>> out_tensor = conv(inp_tensor)
+    >>> out_tensor.shape
+    torch.Size([10, 16000, 25])
+    """
+    def __init__(
+        self,
+        out_channels,
+        kernel_size,
+        input_shape=None,
+        in_channels=None,
+        stride=1,
+        dilation=1,
+        padding="same",
+        padding_mode="reflect",
+        sample_rate=16000,
+        min_low_hz=50,
+        min_band_hz=50,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.padding = padding
+        self.padding_mode = padding_mode
+        self.sample_rate = sample_rate
+        self.min_low_hz = min_low_hz
+        self.min_band_hz = min_band_hz
+        # input shape inference
+        if input_shape is None and self.in_channels is None:
+            raise ValueError("Must provide one of input_shape or in_channels")
+        if self.in_channels is None:
+            self.in_channels = self._check_input_shape(input_shape)
+        if self.out_channels % self.in_channels != 0:
+            raise ValueError(
+                "Number of output channels must be divisible by in_channels"
+            )
+        # Initialize Sinc filters
+        self._init_sinc_conv()
+    def forward(self, x):
+        """Returns the output of the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channel)
+            input to convolve. 2d or 4d tensors are expected.
+        Returns
+        -------
+        wx : torch.Tensor
+            The convolved outputs.
+        """
+        x = x.transpose(1, -1)
+        self.device = x.device
+        unsqueeze = x.ndim == 2
+        if unsqueeze:
+            x = x.unsqueeze(1)
+        if self.padding == "same":
+            x = self._manage_padding(
+                x, self.kernel_size, self.dilation, self.stride
+            )
+        elif self.padding == "causal":
+            num_pad = (self.kernel_size - 1) * self.dilation
+            x = F.pad(x, (num_pad, 0))
+        elif self.padding == "valid":
+            pass
+        else:
+            raise ValueError(
+                "Padding must be 'same', 'valid' or 'causal'. Got %s."
+                % (self.padding)
+            )
+        sinc_filters = self._get_sinc_filters()
+        wx = F.conv1d(
+            x,
+            sinc_filters,
+            stride=self.stride,
+            padding=0,
+            dilation=self.dilation,
+            groups=self.in_channels,
+        )
+        if unsqueeze:
+            wx = wx.squeeze(1)
+        wx = wx.transpose(1, -1)
+        return wx
+    def _check_input_shape(self, shape):
+        """Checks the input shape and returns the number of input channels."""
+        if len(shape) == 2:
+            in_channels = 1
+        elif len(shape) == 3:
+            in_channels = shape[-1]
+        else:
+            raise ValueError(
+                "sincconv expects 2d or 3d inputs. Got " + str(len(shape))
+            )
+        # Kernel size must be odd
+        if self.kernel_size % 2 == 0:
+            raise ValueError(
+                "The field kernel size must be an odd number. Got %s."
+                % (self.kernel_size)
+            )
+        return in_channels
+    def _get_sinc_filters(self):
+        """This functions creates the sinc-filters to used for sinc-conv."""
+        # Computing the low frequencies of the filters
+        low = self.min_low_hz + torch.abs(self.low_hz_)
+        # Setting minimum band and minimum freq
+        high = torch.clamp(
+            low + self.min_band_hz + torch.abs(self.band_hz_),
+            self.min_low_hz,
+            self.sample_rate / 2,
+        )
+        band = (high - low)[:, 0]
+        # Passing from n_ to the corresponding f_times_t domain
+        self.n_ = self.n_.to(self.device)
+        self.window_ = self.window_.to(self.device)
+        f_times_t_low = torch.matmul(low, self.n_)
+        f_times_t_high = torch.matmul(high, self.n_)
+        # Left part of the filters.
+        band_pass_left = (
+            (torch.sin(f_times_t_high) - torch.sin(f_times_t_low))
+            / (self.n_ / 2)
+        ) * self.window_
+        # Central element of the filter
+        band_pass_center = 2 * band.view(-1, 1)
+        # Right part of the filter (sinc filters are symmetric)
+        band_pass_right = torch.flip(band_pass_left, dims=[1])
+        # Combining left, central, and right part of the filter
+        band_pass = torch.cat(
+            [band_pass_left, band_pass_center, band_pass_right], dim=1
+        )
+        # Amplitude normalization
+        band_pass = band_pass / (2 * band[:, None])
+        # Setting up the filter coefficients
+        filters = band_pass.view(self.out_channels, 1, self.kernel_size)
+        return filters
+    def _init_sinc_conv(self):
+        """Initializes the parameters of the sinc_conv layer."""
+        # Initialize filterbanks such that they are equally spaced in Mel scale
+        high_hz = self.sample_rate / 2 - (self.min_low_hz + self.min_band_hz)
+        mel = torch.linspace(
+            self._to_mel(self.min_low_hz),
+            self._to_mel(high_hz),
+            self.out_channels + 1,
+        )
+        hz = self._to_hz(mel)
+        # Filter lower frequency and bands
+        self.low_hz_ = hz[:-1].unsqueeze(1)
+        self.band_hz_ = (hz[1:] - hz[:-1]).unsqueeze(1)
+        # Maiking freq and bands learnable
+        self.low_hz_ = nn.Parameter(self.low_hz_)
+        self.band_hz_ = nn.Parameter(self.band_hz_)
+        # Hamming window
+        n_lin = torch.linspace(
+            0, (self.kernel_size / 2) - 1, steps=int((self.kernel_size / 2))
+        )
+        self.window_ = 0.54 - 0.46 * torch.cos(
+            2 * math.pi * n_lin / self.kernel_size
+        )
+        # Time axis  (only half is needed due to symmetry)
+        n = (self.kernel_size - 1) / 2.0
+        self.n_ = (
+            2 * math.pi * torch.arange(-n, 0).view(1, -1) / self.sample_rate
+        )
+    def _to_mel(self, hz):
+        """Converts frequency in Hz to the mel scale."""
+        return 2595 * np.log10(1 + hz / 700)
+    def _to_hz(self, mel):
+        """Converts frequency in the mel scale to Hz."""
+        return 700 * (10 ** (mel / 2595) - 1)
+    def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
+        """This function performs zero-padding on the time axis
+        such that their lengths is unchanged after the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Input tensor.
+        kernel_size : int
+            Size of kernel.
+        dilation : int
+            Dilation used.
+        stride : int
+            Stride.
+        Returns
+        -------
+        x : torch.Tensor
+        """
+        # Detecting input shape
+        L_in = self.in_channels
+        # Time padding
+        padding = get_padding_elem(L_in, stride, kernel_size, dilation)
+        # Applying padding
+        x = F.pad(x, padding, mode=self.padding_mode)
+        return x
+class Conv1d(nn.Module):
+    """This function implements 1d convolution.
+    Arguments
+    ---------
+    out_channels : int
+        It is the number of output channels.
+    kernel_size : int
+        Kernel size of the convolutional filters.
+    input_shape : tuple
+        The shape of the input. Alternatively use ``in_channels``.
+    in_channels : int
+        The number of input channels. Alternatively use ``input_shape``.
+    stride : int
+        Stride factor of the convolutional filters. When the stride factor > 1,
+        a decimation in time is performed.
+    dilation : int
+        Dilation factor of the convolutional filters.
+    padding : str
+        (same, valid, causal). If "valid", no padding is performed.
+        If "same" and stride is 1, output shape is the same as the input shape.
+        "causal" results in causal (dilated) convolutions.
+    groups : int
+        Number of blocked connections from input channels to output channels.
+    bias : bool
+        Whether to add a bias term to convolution operation.
+    padding_mode : str
+        This flag specifies the type of padding. See torch.nn documentation
+        for more information.
+    skip_transpose : bool
+        If False, uses batch x time x channel convention of speechbrain.
+        If True, uses batch x channel x time convention.
+    weight_norm : bool
+        If True, use weight normalization,
+        to be removed with self.remove_weight_norm() at inference
+    conv_init : str
+        Weight initialization for the convolution network
+    default_padding: str or int
+        This sets the default padding mode that will be used by the pytorch Conv1d backend.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([10, 40, 16])
+    >>> cnn_1d = Conv1d(
+    ...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=5
+    ... )
+    >>> out_tensor = cnn_1d(inp_tensor)
+    >>> out_tensor.shape
+    torch.Size([10, 40, 8])
+    """
+    def __init__(
+        self,
+        out_channels,
+        kernel_size,
+        input_shape=None,
+        in_channels=None,
+        stride=1,
+        dilation=1,
+        padding="same",
+        groups=1,
+        bias=True,
+        padding_mode="reflect",
+        skip_transpose=False,
+        weight_norm=False,
+        conv_init=None,
+        default_padding=0,
+    ):
+        super().__init__()
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.padding = padding
+        self.padding_mode = padding_mode
+        self.unsqueeze = False
+        self.skip_transpose = skip_transpose
+        if input_shape is None and in_channels is None:
+            raise ValueError("Must provide one of input_shape or in_channels")
+        if in_channels is None:
+            in_channels = self._check_input_shape(input_shape)
+        self.in_channels = in_channels
+        self.conv = nn.Conv1d(
+            in_channels,
+            out_channels,
+            self.kernel_size,
+            stride=self.stride,
+            dilation=self.dilation,
+            padding=default_padding,
+            groups=groups,
+            bias=bias,
+        )
+        if conv_init == "kaiming":
+            nn.init.kaiming_normal_(self.conv.weight)
+        elif conv_init == "zero":
+            nn.init.zeros_(self.conv.weight)
+        elif conv_init == "normal":
+            nn.init.normal_(self.conv.weight, std=1e-6)
+        if weight_norm:
+            self.conv = nn.utils.weight_norm(self.conv)
+    def forward(self, x):
+        """Returns the output of the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channel)
+            input to convolve. 2d or 4d tensors are expected.
+        Returns
+        -------
+        wx : torch.Tensor
+            The convolved outputs.
+        """
+        if not self.skip_transpose:
+            x = x.transpose(1, -1)
+        if self.unsqueeze:
+            x = x.unsqueeze(1)
+        if self.padding == "same":
+            x = self._manage_padding(
+                x, self.kernel_size, self.dilation, self.stride
+            )
+        elif self.padding == "causal":
+            num_pad = (self.kernel_size - 1) * self.dilation
+            x = F.pad(x, (num_pad, 0))
+        elif self.padding == "valid":
+            pass
+        else:
+            raise ValueError(
+                "Padding must be 'same', 'valid' or 'causal'. Got "
+                + self.padding
+            )
+        wx = self.conv(x)
+        if self.unsqueeze:
+            wx = wx.squeeze(1)
+        if not self.skip_transpose:
+            wx = wx.transpose(1, -1)
+        return wx
+    def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
+        """This function performs zero-padding on the time axis
+        such that their lengths is unchanged after the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Input tensor.
+        kernel_size : int
+            Size of kernel.
+        dilation : int
+            Dilation used.
+        stride : int
+            Stride.
+        Returns
+        -------
+        x : torch.Tensor
+            The padded outputs.
+        """
+        # Detecting input shape
+        L_in = self.in_channels
+        # Time padding
+        padding = get_padding_elem(L_in, stride, kernel_size, dilation)
+        # Applying padding
+        x = F.pad(x, padding, mode=self.padding_mode)
+        return x
+    def _check_input_shape(self, shape):
+        """Checks the input shape and returns the number of input channels."""
+        if len(shape) == 2:
+            self.unsqueeze = True
+            in_channels = 1
+        elif self.skip_transpose:
+            in_channels = shape[1]
+        elif len(shape) == 3:
+            in_channels = shape[2]
+        else:
+            raise ValueError(
+                "conv1d expects 2d, 3d inputs. Got " + str(len(shape))
+            )
+        # Kernel size must be odd
+        if not self.padding == "valid" and self.kernel_size % 2 == 0:
+            raise ValueError(
+                "The field kernel size must be an odd number. Got %s."
+                % (self.kernel_size)
+            )
+        return in_channels
+    def remove_weight_norm(self):
+        """Removes weight normalization at inference if used during training."""
+        self.conv = nn.utils.remove_weight_norm(self.conv)
+def get_padding_elem(L_in: int, stride: int, kernel_size: int, dilation: int):
+    """This function computes the number of elements to add for zero-padding.
+    Arguments
+    ---------
+    L_in : int
+    stride: int
+    kernel_size : int
+    dilation : int
+    Returns
+    -------
+    padding : int
+        The size of the padding to be added
+    """
+    if stride > 1:
+        padding = [math.floor(kernel_size / 2), math.floor(kernel_size / 2)]
+    else:
+        L_out = (
+            math.floor((L_in - dilation * (kernel_size - 1) - 1) / stride) + 1
+        )
+        padding = [
+            math.floor((L_in - L_out) / 2),
+            math.floor((L_in - L_out) / 2),
+        ]
+    return padding

indextts/BigVGAN/nnet/__init__.py ADDED Viewed

File without changes

indextts/BigVGAN/nnet/linear.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""Library implementing linear transformation.
+Authors
+ * Mirco Ravanelli 2020
+ * Davide Borra 2021
+"""
+import logging
+import torch
+import torch.nn as nn
+class Linear(torch.nn.Module):
+    """Computes a linear transformation y = wx + b.
+    Arguments
+    ---------
+    n_neurons : int
+        It is the number of output neurons (i.e, the dimensionality of the
+        output).
+    input_shape : tuple
+        It is the shape of the input tensor.
+    input_size : int
+        Size of the input tensor.
+    bias : bool
+        If True, the additive bias b is adopted.
+    max_norm : float
+        weight max-norm.
+    combine_dims : bool
+        If True and the input is 4D, combine 3rd and 4th dimensions of input.
+    Example
+    -------
+    >>> inputs = torch.rand(10, 50, 40)
+    >>> lin_t = Linear(input_shape=(10, 50, 40), n_neurons=100)
+    >>> output = lin_t(inputs)
+    >>> output.shape
+    torch.Size([10, 50, 100])
+    """
+    def __init__(
+        self,
+        n_neurons,
+        input_shape=None,
+        input_size=None,
+        bias=True,
+        max_norm=None,
+        combine_dims=False,
+    ):
+        super().__init__()
+        self.max_norm = max_norm
+        self.combine_dims = combine_dims
+        if input_shape is None and input_size is None:
+            raise ValueError("Expected one of input_shape or input_size")
+        if input_size is None:
+            input_size = input_shape[-1]
+            if len(input_shape) == 4 and self.combine_dims:
+                input_size = input_shape[2] * input_shape[3]
+        # Weights are initialized following pytorch approach
+        self.w = nn.Linear(input_size, n_neurons, bias=bias)
+    def forward(self, x):
+        """Returns the linear transformation of input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Input to transform linearly.
+        Returns
+        -------
+        wx : torch.Tensor
+            The linearly transformed outputs.
+        """
+        if x.ndim == 4 and self.combine_dims:
+            x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3])
+        if self.max_norm is not None:
+            self.w.weight.data = torch.renorm(
+                self.w.weight.data, p=2, dim=0, maxnorm=self.max_norm
+            )
+        wx = self.w(x)
+        return wx

indextts/BigVGAN/nnet/normalization.py ADDED Viewed

	@@ -0,0 +1,670 @@

+"""Library implementing normalization.
+Authors
+ * Mirco Ravanelli 2020
+ * Guillermo Cámbara 2021
+ * Sarthak Yadav 2022
+"""
+import torch
+import torch.nn as nn
+class BatchNorm1d(nn.Module):
+    """Applies 1d batch normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    momentum : float
+        It is a value used for the running_mean and running_var computation.
+    affine : bool
+        When set to True, the affine parameters are learned.
+    track_running_stats : bool
+        When set to True, this module tracks the running mean and variance,
+        and when set to False, this module does not track such statistics.
+    combine_batch_time : bool
+        When true, it combines batch an time axis.
+    skip_transpose : bool
+        Whether to skip the transposition.
+    Example
+    -------
+    >>> input = torch.randn(100, 10)
+    >>> norm = BatchNorm1d(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 10])
+    """
+    def __init__(
+        self,
+        input_shape=None,
+        input_size=None,
+        eps=1e-05,
+        momentum=0.1,
+        affine=True,
+        track_running_stats=True,
+        combine_batch_time=False,
+        skip_transpose=False,
+    ):
+        super().__init__()
+        self.combine_batch_time = combine_batch_time
+        self.skip_transpose = skip_transpose
+        if input_size is None and skip_transpose:
+            input_size = input_shape[1]
+        elif input_size is None:
+            input_size = input_shape[-1]
+        self.norm = nn.BatchNorm1d(
+            input_size,
+            eps=eps,
+            momentum=momentum,
+            affine=affine,
+            track_running_stats=track_running_stats,
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, [channels])
+            input to normalize. 2d or 3d tensors are expected in input
+            4d tensors can be used when combine_dims=True.
+        Returns
+        -------
+        x_n : torch.Tensor
+            The normalized outputs.
+        """
+        shape_or = x.shape
+        if self.combine_batch_time:
+            if x.ndim == 3:
+                x = x.reshape(shape_or[0] * shape_or[1], shape_or[2])
+            else:
+                x = x.reshape(
+                    shape_or[0] * shape_or[1], shape_or[3], shape_or[2]
+                )
+        elif not self.skip_transpose:
+            x = x.transpose(-1, 1)
+        x_n = self.norm(x)
+        if self.combine_batch_time:
+            x_n = x_n.reshape(shape_or)
+        elif not self.skip_transpose:
+            x_n = x_n.transpose(1, -1)
+        return x_n
+class BatchNorm2d(nn.Module):
+    """Applies 2d batch normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    momentum : float
+        It is a value used for the running_mean and running_var computation.
+    affine : bool
+        When set to True, the affine parameters are learned.
+    track_running_stats : bool
+        When set to True, this module tracks the running mean and variance,
+        and when set to False, this module does not track such statistics.
+    Example
+    -------
+    >>> input = torch.randn(100, 10, 5, 20)
+    >>> norm = BatchNorm2d(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 10, 5, 20])
+    """
+    def __init__(
+        self,
+        input_shape=None,
+        input_size=None,
+        eps=1e-05,
+        momentum=0.1,
+        affine=True,
+        track_running_stats=True,
+    ):
+        super().__init__()
+        if input_shape is None and input_size is None:
+            raise ValueError("Expected input_shape or input_size as input")
+        if input_size is None:
+            input_size = input_shape[-1]
+        self.norm = nn.BatchNorm2d(
+            input_size,
+            eps=eps,
+            momentum=momentum,
+            affine=affine,
+            track_running_stats=track_running_stats,
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channel1, channel2)
+            input to normalize. 4d tensors are expected.
+        Returns
+        -------
+        x_n : torch.Tensor
+            The normalized outputs.
+        """
+        x = x.transpose(-1, 1)
+        x_n = self.norm(x)
+        x_n = x_n.transpose(1, -1)
+        return x_n
+class LayerNorm(nn.Module):
+    """Applies layer normalization to the input tensor.
+    Arguments
+    ---------
+    input_size : int
+        The expected size of the dimension to be normalized.
+    input_shape : tuple
+        The expected shape of the input.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    elementwise_affine : bool
+        If True, this module has learnable per-element affine parameters
+        initialized to ones (for weights) and zeros (for biases).
+    Example
+    -------
+    >>> input = torch.randn(100, 101, 128)
+    >>> norm = LayerNorm(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 101, 128])
+    """
+    def __init__(
+        self,
+        input_size=None,
+        input_shape=None,
+        eps=1e-05,
+        elementwise_affine=True,
+    ):
+        super().__init__()
+        self.eps = eps
+        self.elementwise_affine = elementwise_affine
+        if input_shape is not None:
+            input_size = input_shape[2:]
+        self.norm = torch.nn.LayerNorm(
+            input_size,
+            eps=self.eps,
+            elementwise_affine=self.elementwise_affine,
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channels)
+            input to normalize. 3d or 4d tensors are expected.
+        Returns
+        -------
+        The normalized outputs.
+        """
+        return self.norm(x)
+class InstanceNorm1d(nn.Module):
+    """Applies 1d instance normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    momentum : float
+        It is a value used for the running_mean and running_var computation.
+    track_running_stats : bool
+        When set to True, this module tracks the running mean and variance,
+        and when set to False, this module does not track such statistics.
+    affine : bool
+        A boolean value that when set to True, this module has learnable
+        affine parameters, initialized the same way as done for
+        batch normalization. Default: False.
+    Example
+    -------
+    >>> input = torch.randn(100, 10, 20)
+    >>> norm = InstanceNorm1d(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 10, 20])
+    """
+    def __init__(
+        self,
+        input_shape=None,
+        input_size=None,
+        eps=1e-05,
+        momentum=0.1,
+        track_running_stats=True,
+        affine=False,
+    ):
+        super().__init__()
+        if input_shape is None and input_size is None:
+            raise ValueError("Expected input_shape or input_size as input")
+        if input_size is None:
+            input_size = input_shape[-1]
+        self.norm = nn.InstanceNorm1d(
+            input_size,
+            eps=eps,
+            momentum=momentum,
+            track_running_stats=track_running_stats,
+            affine=affine,
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channels)
+            input to normalize. 3d tensors are expected.
+        Returns
+        -------
+        x_n : torch.Tensor
+            The normalized outputs.
+        """
+        x = x.transpose(-1, 1)
+        x_n = self.norm(x)
+        x_n = x_n.transpose(1, -1)
+        return x_n
+class InstanceNorm2d(nn.Module):
+    """Applies 2d instance normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    momentum : float
+        It is a value used for the running_mean and running_var computation.
+    track_running_stats : bool
+        When set to True, this module tracks the running mean and variance,
+        and when set to False, this module does not track such statistics.
+    affine : bool
+        A boolean value that when set to True, this module has learnable
+        affine parameters, initialized the same way as done for
+        batch normalization. Default: False.
+    Example
+    -------
+    >>> input = torch.randn(100, 10, 20, 2)
+    >>> norm = InstanceNorm2d(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 10, 20, 2])
+    """
+    def __init__(
+        self,
+        input_shape=None,
+        input_size=None,
+        eps=1e-05,
+        momentum=0.1,
+        track_running_stats=True,
+        affine=False,
+    ):
+        super().__init__()
+        if input_shape is None and input_size is None:
+            raise ValueError("Expected input_shape or input_size as input")
+        if input_size is None:
+            input_size = input_shape[-1]
+        self.norm = nn.InstanceNorm2d(
+            input_size,
+            eps=eps,
+            momentum=momentum,
+            track_running_stats=track_running_stats,
+            affine=affine,
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channel1, channel2)
+            input to normalize. 4d tensors are expected.
+        Returns
+        -------
+        x_n : torch.Tensor
+            The normalized outputs.
+        """
+        x = x.transpose(-1, 1)
+        x_n = self.norm(x)
+        x_n = x_n.transpose(1, -1)
+        return x_n
+class GroupNorm(nn.Module):
+    """Applies group normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    num_groups : int
+        Number of groups to separate the channels into.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    affine : bool
+        A boolean value that when set to True, this module has learnable per-channel
+        affine parameters initialized to ones (for weights) and zeros (for biases).
+    Example
+    -------
+    >>> input = torch.randn(100, 101, 128)
+    >>> norm = GroupNorm(input_size=128, num_groups=128)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 101, 128])
+    """
+    def __init__(
+        self,
+        input_shape=None,
+        input_size=None,
+        num_groups=None,
+        eps=1e-05,
+        affine=True,
+    ):
+        super().__init__()
+        self.eps = eps
+        self.affine = affine
+        if input_shape is None and input_size is None:
+            raise ValueError("Expected input_shape or input_size as input")
+        if num_groups is None:
+            raise ValueError("Expected num_groups as input")
+        if input_shape is not None:
+            input_size = input_shape[-1]
+        self.norm = torch.nn.GroupNorm(
+            num_groups,
+            input_size,
+            eps=self.eps,
+            affine=self.affine,
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channels)
+            input to normalize. 3d or 4d tensors are expected.
+        Returns
+        -------
+        x_n : torch.Tensor
+            The normalized outputs.
+        """
+        x = x.transpose(-1, 1)
+        x_n = self.norm(x)
+        x_n = x_n.transpose(1, -1)
+        return x_n
+class ExponentialMovingAverage(nn.Module):
+    """
+    Applies learnable exponential moving average, as required by learnable PCEN layer
+    Arguments
+    ---------
+    input_size : int
+        The expected size of the input.
+    coeff_init: float
+        Initial smoothing coefficient value
+    per_channel: bool
+        Controls whether every smoothing coefficients are learned
+        independently for every input channel
+    trainable: bool
+        whether to learn the PCEN parameters or use fixed
+    skip_transpose : bool
+        If False, uses batch x time x channel convention of speechbrain.
+        If True, uses batch x channel x time convention.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([10, 50, 40])
+    >>> pcen = ExponentialMovingAverage(40)
+    >>> out_tensor = pcen(inp_tensor)
+    >>> out_tensor.shape
+    torch.Size([10, 50, 40])
+    """
+    def __init__(
+        self,
+        input_size: int,
+        coeff_init: float = 0.04,
+        per_channel: bool = False,
+        trainable: bool = True,
+        skip_transpose: bool = False,
+    ):
+        super().__init__()
+        self._coeff_init = coeff_init
+        self._per_channel = per_channel
+        self.skip_transpose = skip_transpose
+        self.trainable = trainable
+        weights = (
+            torch.ones(
+                input_size,
+            )
+            if self._per_channel
+            else torch.ones(
+                1,
+            )
+        )
+        self._weights = nn.Parameter(
+            weights * self._coeff_init, requires_grad=trainable
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+         ---------
+         x : torch.Tensor (batch, time, channels)
+             input to normalize.
+        """
+        if not self.skip_transpose:
+            x = x.transpose(1, -1)
+        w = torch.clamp(self._weights, min=0.0, max=1.0)
+        initial_state = x[:, :, 0]
+        def scan(init_state, x, w):
+            """Loops and accumulates."""
+            x = x.permute(2, 0, 1)
+            acc = init_state
+            results = []
+            for ix in range(x.shape[0]):
+                acc = (w * x[ix]) + ((1.0 - w) * acc)
+                results.append(acc.unsqueeze(0))
+            results = torch.cat(results, dim=0)
+            results = results.permute(1, 2, 0)
+            return results
+        output = scan(initial_state, x, w)
+        if not self.skip_transpose:
+            output = output.transpose(1, -1)
+        return output
+class PCEN(nn.Module):
+    """
+    This class implements a learnable Per-channel energy normalization (PCEN) layer, supporting both
+    original PCEN as specified in [1] as well as sPCEN as specified in [2]
+    [1] Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous, "Trainable Frontend For
+    Robust and Far-Field Keyword Spotting", in Proc of ICASSP 2017 (https://arxiv.org/abs/1607.05666)
+    [2] Neil Zeghidour, Olivier Teboul, F{\'e}lix de Chaumont Quitry & Marco Tagliasacchi, "LEAF: A LEARNABLE FRONTEND
+    FOR AUDIO CLASSIFICATION", in Proc of ICLR 2021 (https://arxiv.org/abs/2101.08596)
+    The default argument values correspond with those used by [2].
+    Arguments
+    ---------
+    input_size : int
+        The expected size of the input.
+    alpha: float
+        specifies alpha coefficient for PCEN
+    smooth_coef: float
+        specified smooth coefficient for PCEN
+    delta: float
+        specifies delta coefficient for PCEN
+    root: float
+        specifies root coefficient for PCEN
+    floor: float
+        specifies floor coefficient for PCEN
+    trainable: bool
+        whether to learn the PCEN parameters or use fixed
+    per_channel_smooth_coef: bool
+        whether to learn independent smooth coefficients for every channel.
+        when True, essentially using sPCEN from [2]
+    skip_transpose : bool
+        If False, uses batch x time x channel convention of speechbrain.
+        If True, uses batch x channel x time convention.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([10, 50, 40])
+    >>> pcen = PCEN(40, alpha=0.96)         # sPCEN
+    >>> out_tensor = pcen(inp_tensor)
+    >>> out_tensor.shape
+    torch.Size([10, 50, 40])
+    """
+    def __init__(
+        self,
+        input_size,
+        alpha: float = 0.96,
+        smooth_coef: float = 0.04,
+        delta: float = 2.0,
+        root: float = 2.0,
+        floor: float = 1e-12,
+        trainable: bool = True,
+        per_channel_smooth_coef: bool = True,
+        skip_transpose: bool = False,
+    ):
+        super().__init__()
+        self._smooth_coef = smooth_coef
+        self._floor = floor
+        self._per_channel_smooth_coef = per_channel_smooth_coef
+        self.skip_transpose = skip_transpose
+        self.alpha = nn.Parameter(
+            torch.ones(input_size) * alpha, requires_grad=trainable
+        )
+        self.delta = nn.Parameter(
+            torch.ones(input_size) * delta, requires_grad=trainable
+        )
+        self.root = nn.Parameter(
+            torch.ones(input_size) * root, requires_grad=trainable
+        )
+        self.ema = ExponentialMovingAverage(
+            input_size,
+            coeff_init=self._smooth_coef,
+            per_channel=self._per_channel_smooth_coef,
+            skip_transpose=True,
+            trainable=trainable,
+        )
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channels)
+            input to normalize.
+        Returns
+        -------
+        output : torch.Tensor
+            The normalized outputs.
+        """
+        if not self.skip_transpose:
+            x = x.transpose(1, -1)
+        alpha = torch.min(
+            self.alpha, torch.tensor(1.0, dtype=x.dtype, device=x.device)
+        )
+        root = torch.max(
+            self.root, torch.tensor(1.0, dtype=x.dtype, device=x.device)
+        )
+        ema_smoother = self.ema(x)
+        one_over_root = 1.0 / root
+        output = (
+            x / (self._floor + ema_smoother) ** alpha.view(1, -1, 1)
+            + self.delta.view(1, -1, 1)
+        ) ** one_over_root.view(1, -1, 1) - self.delta.view(
+            1, -1, 1
+        ) ** one_over_root.view(
+            1, -1, 1
+        )
+        if not self.skip_transpose:
+            output = output.transpose(1, -1)
+        return output

indextts/BigVGAN/utils.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
+#   LICENSE is in incl_licenses directory.
+import glob
+import os
+import matplotlib
+import matplotlib.pylab as plt
+import torch
+from scipy.io.wavfile import write
+from torch.nn.utils import weight_norm
+matplotlib.use("Agg")
+MAX_WAV_VALUE = 32768.0
+def plot_spectrogram(spectrogram):
+    fig, ax = plt.subplots(figsize=(10, 2))
+    im = ax.imshow(spectrogram, aspect="auto", origin="lower", interpolation="none")
+    plt.colorbar(im, ax=ax)
+    fig.canvas.draw()
+    plt.close()
+    return fig
+def plot_spectrogram_clipped(spectrogram, clip_max=2.0):
+    fig, ax = plt.subplots(figsize=(10, 2))
+    im = ax.imshow(
+        spectrogram,
+        aspect="auto",
+        origin="lower",
+        interpolation="none",
+        vmin=1e-6,
+        vmax=clip_max,
+    )
+    plt.colorbar(im, ax=ax)
+    fig.canvas.draw()
+    plt.close()
+    return fig
+def init_weights(m, mean=0.0, std=0.01):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        m.weight.data.normal_(mean, std)
+def apply_weight_norm(m):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        weight_norm(m)
+def get_padding(kernel_size, dilation=1):
+    return int((kernel_size * dilation - dilation) / 2)
+def load_checkpoint(filepath, device):
+    assert os.path.isfile(filepath)
+    print(f"Loading '{filepath}'")
+    checkpoint_dict = torch.load(filepath, map_location=device)
+    print("Complete.")
+    return checkpoint_dict
+def save_checkpoint(filepath, obj):
+    print(f"Saving checkpoint to {filepath}")
+    torch.save(obj, filepath)
+    print("Complete.")
+def scan_checkpoint(cp_dir, prefix, renamed_file=None):
+    # Fallback to original scanning logic first
+    pattern = os.path.join(cp_dir, prefix + "????????")
+    cp_list = glob.glob(pattern)
+    if len(cp_list) > 0:
+        last_checkpoint_path = sorted(cp_list)[-1]
+        print(f"[INFO] Resuming from checkpoint: '{last_checkpoint_path}'")
+        return last_checkpoint_path
+    # If no pattern-based checkpoints are found, check for renamed file
+    if renamed_file:
+        renamed_path = os.path.join(cp_dir, renamed_file)
+        if os.path.isfile(renamed_path):
+            print(f"[INFO] Resuming from renamed checkpoint: '{renamed_file}'")
+            return renamed_path
+    return None
+def save_audio(audio, path, sr):
+    # wav: torch with 1d shape
+    audio = audio * MAX_WAV_VALUE
+    audio = audio.cpu().numpy().astype("int16")
+    write(path, sr, audio)

indextts/__init__.py ADDED Viewed

File without changes

indextts/cli.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import os
+import sys
+import warnings
+# Suppress warnings from tensorflow and other libraries
+warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=FutureWarning)
+def main():
+    import argparse
+    parser = argparse.ArgumentParser(description="IndexTTS Command Line")
+    parser.add_argument("text", type=str, help="Text to be synthesized")
+    parser.add_argument("-v", "--voice", type=str, required=True, help="Path to the audio prompt file (wav format)")
+    parser.add_argument("-o", "--output_path", type=str, default="gen.wav", help="Path to the output wav file")
+    parser.add_argument("-c", "--config", type=str, default="checkpoints/config.yaml", help="Path to the config file. Default is 'checkpoints/config.yaml'")
+    parser.add_argument("--model_dir", type=str, default="checkpoints", help="Path to the model directory. Default is 'checkpoints'")
+    parser.add_argument("--fp16", action="store_true", default=False, help="Use FP16 for inference if available")
+    parser.add_argument("-f", "--force", action="store_true", default=False, help="Force to overwrite the output file if it exists")
+    parser.add_argument("-d", "--device", type=str, default=None, help="Device to run the model on (cpu, cuda, mps, xpu)." )
+    args = parser.parse_args()
+    if len(args.text.strip()) == 0:
+        print("ERROR: Text is empty.")
+        parser.print_help()
+        sys.exit(1)
+    if not os.path.exists(args.voice):
+        print(f"Audio prompt file {args.voice} does not exist.")
+        parser.print_help()
+        sys.exit(1)
+    if not os.path.exists(args.config):
+        print(f"Config file {args.config} does not exist.")
+        parser.print_help()
+        sys.exit(1)
+    output_path = args.output_path
+    if os.path.exists(output_path):
+        if not args.force:
+            print(f"ERROR: Output file {output_path} already exists. Use --force to overwrite.")
+            parser.print_help()
+            sys.exit(1)
+        else:
+            os.remove(output_path)
+    try:
+        import torch
+    except ImportError:
+        print("ERROR: PyTorch is not installed. Please install it first.")
+        sys.exit(1)
+    if args.device is None:
+        if torch.cuda.is_available():
+            args.device = "cuda:0"
+        elif hasattr(torch, "xpu") and torch.xpu.is_available():
+            args.device = "xpu"
+        elif hasattr(torch, "mps") and torch.mps.is_available():
+            args.device = "mps"
+        else:
+            args.device = "cpu"
+            args.fp16 = False # Disable FP16 on CPU
+            print("WARNING: Running on CPU may be slow.")
+    # TODO: Add CLI support for IndexTTS2.
+    from indextts.infer import IndexTTS
+    tts = IndexTTS(cfg_path=args.config, model_dir=args.model_dir, use_fp16=args.fp16, device=args.device)
+    tts.infer(audio_prompt=args.voice, text=args.text.strip(), output_path=output_path)
+if __name__ == "__main__":
+    main()

indextts/gpt/__init__.py ADDED Viewed

File without changes

indextts/gpt/conformer/__init__.py ADDED Viewed

File without changes