Spaces:

Soul-AILab
/

SoulX-Singer

Running on Zero

App Files Files Community

Xinsheng-Wang commited on Feb 9

Commit

c7f3ffb

verified ·

1 Parent(s): a81bc3b

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +6 -0
.gitignore +38 -0
DEPLOY.md +201 -0
LICENSE +201 -0
README.md +226 -9
app.py +63 -0
assets/performance_radar.png +3 -0
assets/soul_wechat01.jpg +3 -0
assets/soulx-logo.png +3 -0
assets/technical-report.pdf +3 -0
cli/inference.py +147 -0
deploy_to_hf.sh +70 -0
example/audio/en_prompt.json +16 -0
example/audio/en_prompt.mp3 +0 -0
example/audio/en_target.json +16 -0
example/audio/en_target.mp3 +0 -0
example/audio/music.json +16 -0
example/audio/music.mp3 +3 -0
example/audio/yue_target.json +16 -0
example/audio/yue_target.mp3 +3 -0
example/audio/zh_prompt.json +16 -0
example/audio/zh_prompt.mp3 +0 -0
example/audio/zh_target.json +16 -0
example/audio/zh_target.mp3 +0 -0
example/infer.sh +28 -0
example/preprocess.sh +41 -0
preprocess/README.md +155 -0
preprocess/pipeline.py +146 -0
preprocess/requirements.txt +33 -0
preprocess/tools/__init__.py +53 -0
preprocess/tools/f0_extraction.py +527 -0
preprocess/tools/g2p.py +72 -0
preprocess/tools/lyric_transcription.py +279 -0
preprocess/tools/midi_parser.py +669 -0
preprocess/tools/note_transcription/__init__.py +0 -0
preprocess/tools/note_transcription/model.py +522 -0
preprocess/tools/note_transcription/modules/__init__.py +1 -0
preprocess/tools/note_transcription/modules/commons/__init__.py +1 -0
preprocess/tools/note_transcription/modules/commons/conformer/__init__.py +1 -0
preprocess/tools/note_transcription/modules/commons/conformer/conformer.py +96 -0
preprocess/tools/note_transcription/modules/commons/conformer/espnet_positional_embedding.py +113 -0
preprocess/tools/note_transcription/modules/commons/conformer/espnet_transformer_attn.py +198 -0
preprocess/tools/note_transcription/modules/commons/conformer/layers.py +260 -0
preprocess/tools/note_transcription/modules/commons/conv.py +175 -0
preprocess/tools/note_transcription/modules/commons/layers.py +85 -0
preprocess/tools/note_transcription/modules/commons/rel_transformer.py +378 -0
preprocess/tools/note_transcription/modules/commons/rnn.py +261 -0
preprocess/tools/note_transcription/modules/commons/transformer.py +751 -0
preprocess/tools/note_transcription/modules/commons/wavenet.py +109 -0
preprocess/tools/note_transcription/modules/pe/__init__.py +1 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/performance_radar.png filter=lfs diff=lfs merge=lfs -text
+assets/soul_wechat01.jpg filter=lfs diff=lfs merge=lfs -text
+assets/soulx-logo.png filter=lfs diff=lfs merge=lfs -text
+assets/technical-report.pdf filter=lfs diff=lfs merge=lfs -text
+example/audio/music.mp3 filter=lfs diff=lfs merge=lfs -text
+example/audio/yue_target.mp3 filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,38 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+dev/
+results/
+wandb/
+.ipynb_checkpoints/
+.vscode/
+.cache
+local/
+outputs/
+*.pt
+*.ckpt
+# Logs
+logs/
+*.log
+results/
+runs/
+dev*
+local/
+generated/
+.DS_Store
+pretrained_models/
+*.err
+*.out
+# Dev
+dev/
+# Data
+data/
+outputs/
+deploy/
+.gradio/

DEPLOY.md ADDED Viewed

	@@ -0,0 +1,201 @@

+# 🚀 部署到 Hugging Face Space 指南
+本指南将帮助您将 SoulX-Singer 部署到 Hugging Face Space。
+## 📋 前置要求
+1. **Hugging Face 账号**：如果没有，请先注册 [huggingface.co](https://huggingface.co/join)
+2. **Git**：确保已安装 Git
+3. **Hugging Face CLI**（可选但推荐）：`pip install huggingface_hub`
+## 🎯 部署步骤
+### 方法一：通过 Web 界面创建（推荐）
+#### 步骤 1：准备代码仓库
+确保您的代码已准备好：
+- ✅ `app.py` - Space 入口文件
+- ✅ `webui.py` - Gradio 界面代码
+- ✅ `requirements.txt` - Python 依赖
+- ✅ `README.md` - 包含 Space 配置的 YAML 头部
+#### 步骤 2：创建 Space
+1. 访问 [huggingface.co/spaces](https://huggingface.co/spaces)
+2. 点击 **"Create new Space"** 按钮
+3. 填写 Space 信息：
+   - **Space name**: 例如 `SoulX-Singer` 或 `soulx-singer-demo`
+   - **SDK**: 选择 **Gradio**
+   - **Hardware**: 推荐选择 **GPU T4 small**（推理更快，首次下载模型后缓存）
+   - **Visibility**: 选择 Public（公开）或 Private（私有）
+4. 点击 **"Create Space"**
+#### 步骤 3：上传代码
+**选项 A：使用 Git 推送（推荐）**
+```bash
+# 1. 在本地代码目录初始化 Git（如果还没有）
+git init
+git add .
+git commit -m "Initial commit for HF Space"
+# 2. 添加 Hugging Face 远程仓库
+# 替换 YOUR_USERNAME 和 YOUR_SPACE_NAME
+git remote add origin https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
+# 3. 推送代码
+git push -u origin main
+```
+**选项 B：使用 Web 界面上传**
+1. 在 Space 页面点击 **"Files and versions"** 标签
+2. 点击 **"Add file"** → **"Upload files"**
+3. 拖拽或选择以下必需文件：
+   - `app.py`
+   - `webui.py`
+   - `requirements.txt`
+   - `README.md`
+   - `soulxsinger/` 目录（整个文件夹）
+   - `preprocess/` 目录（整个文件夹）
+   - `cli/` 目录（整个文件夹）
+   - `example/` 目录（整个文件夹）
+   - `assets/` 目录（整个文件夹）
+   - 其他配置文件（如 `LICENSE`, `.gitignore` 等）
+#### 步骤 4：等待构建和首次运行
+1. Space 会自动检测到代码并开始构建
+2. 查看 **"Logs"** 标签页监控构建进度
+3. 首次运行会：
+   - 安装 `requirements.txt` 中的依赖
+   - 执行 `app.py`
+   - **自动下载** `Soul-AILab/SoulX-Singer` 和 `Soul-AILab/SoulX-Singer-Preprocess` 模型（可能需要 5-15 分钟，取决于网络速度）
+4. 构建完成后，Space 会自动启动，您可以在 **"App"** 标签页看到界面
+### 方法二：使用 Hugging Face CLI
+```bash
+# 1. 安装 Hugging Face Hub CLI
+pip install huggingface_hub
+# 2. 登录（会打开浏览器）
+huggingface-cli login
+# 3. 创建 Space（替换 YOUR_USERNAME 和 YOUR_SPACE_NAME）
+huggingface-cli repo create YOUR_SPACE_NAME --type space --sdk gradio
+# 4. 克隆 Space 仓库
+git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
+cd YOUR_SPACE_NAME
+# 5. 复制代码文件到 Space 目录
+# （将当前代码目录的所有文件复制过来）
+# 6. 提交并推送
+git add .
+git commit -m "Deploy SoulX-Singer to HF Space"
+git push
+```
+## ⚙️ Space 配置说明
+Space 配置在 `README.md` 的 YAML 头部：
+```yaml
+---
+title: SoulX-Singer
+emoji: 🎤
+sdk: gradio
+sdk_version: "6.3.0"
+app_file: app.py
+python_version: "3.10"
+suggested_hardware: t4-small  # 取消注释以启用 GPU
+---
+```
+### 硬件选择建议
+- **CPU Basic**: 免费，但推理速度较慢，适合测试
+- **GPU T4 Small**: 推荐，推理速度快，首次下载模型后缓存
+- **GPU T4 Medium/Large**: 适合高并发或更复杂的推理
+### 修改硬件配置
+1. 进入 Space 页面
+2. 点击 **"Settings"** 标签
+3. 在 **"Hardware"** 部分选择所需硬件
+4. 保存后 Space 会重启
+## 🔍 故障排查
+### 问题 1：构建失败
+**检查点：**
+- ✅ `requirements.txt` 中所有依赖版本是否兼容
+- ✅ `app.py` 文件是否存在且可执行
+- ✅ `README.md` 的 YAML 配置是否正确
+**查看日志：**
+- 在 Space 页面的 **"Logs"** 标签查看详细错误信息
+### 问题 2：模型下载失败
+**可能原因：**
+- 网络连接问题
+- Hugging Face Hub 认证问题
+**解决方案：**
+- 确保 Space 有网络访问权限（默认有）
+- 如果使用私有模型，需要在 Space Settings 中添加 HF Token
+### 问题 3：应用启动后无法访问
+**检查点：**
+- ✅ `app.py` 中 `server_name="0.0.0.0"` 已设置
+- ✅ 端口使用环境变量 `PORT`（Space 会自动注入）
+- ✅ 查看 **"Logs"** 确认应用是否成功启动
+### 问题 4：内存不足
+**解决方案：**
+- 升级到更大的硬件（T4 Medium/Large）
+- 或优化代码，减少内存占用
+## 📝 重要提示
+1. **首次运行时间**：首次部署时，模型下载可能需要 5-15 分钟，请耐心等待
+2. **模型缓存**：下载的模型会缓存在 Space 的存��中，重启后无需重新下载
+3. **存储限制**：免费 Space 有存储限制，确保模型文件不会超过限制
+4. **自动重启**：Space 会在代码更新后自动重启
+5. **日志查看**：遇到问题时，首先查看 **"Logs"** 标签页的详细日志
+## 🔗 相关链接
+- [Hugging Face Spaces 文档](https://huggingface.co/docs/hub/spaces)
+- [Gradio 文档](https://gradio.app/docs/)
+- [SoulX-Singer 模型页面](https://huggingface.co/Soul-AILab/SoulX-Singer)
+- [SoulX-Singer-Preprocess 模型页面](https://huggingface.co/Soul-AILab/SoulX-Singer-Preprocess)
+## ✅ 部署检查清单
+部署前确认：
+- [ ] `app.py` 文件存在且正确
+- [ ] `requirements.txt` 包含所有依赖（包括 `huggingface_hub`）
+- [ ] `README.md` 包含正确的 YAML 配置
+- [ ] 所有必需的代码文件都已上传
+- [ ] `.gitignore` 正确配置（排除 `pretrained_models/` 和 `outputs/`）
+- [ ] Space 硬件配置合适（推荐 GPU T4 Small）
+部署后验证：
+- [ ] Space 构建成功（无错误日志）
+- [ ] 模型自动下载完成
+- [ ] Web 界面可以正常访问
+- [ ] 可以上传音频文件进行测试
+- [ ] 推理功能正常工作
+---
+**祝部署顺利！** 🎉

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -1,14 +1,231 @@
 ---
-title: SoulX Singer
-emoji: 👁
-colorFrom: purple
-colorTo: yellow
 sdk: gradio
-sdk_version: 6.5.1
 app_file: app.py
-pinned: false
-license: apache-2.0
-short_description: Zero-shot Singing Voice Synthesis
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: SoulX-Singer
+emoji: 🎤
 sdk: gradio
+sdk_version: "6.3.0"
 app_file: app.py
+python_version: "3.10"
+# GPU recommended for inference speed (optional: use CPU for light usage)
+# suggested_hardware: t4-small
 ---
+<div align="center">
+  <h1>🎤 SoulX-Singer</h1>
+  <p>
+    Official inference code for<br>
+    <b><em>SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis</em></b>
+  </p>
+  <p>
+    <img src="assets/soulx-logo.png" alt="SoulX-Logo" style="height:80px;">
+  </p>
+  <p>
+    <a href="https://soul-ailab.github.io/soulx-singer/"><img src="https://img.shields.io/badge/Demo-Page-lightgrey" alt="Demo Page"></a>
+    <a href="https://huggingface.co/Soul-AILab/SoulX-Singer"><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue' alt="HF-model"></a>
+    <a href="assets/technical-report.pdf"><img src="https://img.shields.io/badge/Report-Github-red" alt="Technical Report"></a>
+    <a href="https://github.com/Soul-AILab/SoulX-Singer"><img src="https://img.shields.io/badge/License-Apache%202.0-blue" alt="License"></a>
+  </p>
+</div>
+---
+## 🎵 Overview
+**SoulX-Singer** is a high-fidelity, zero-shot singing voice synthesis model that enables users to generate realistic singing voices for unseen singers.
+It supports **melody-conditioned (F0 contour)** and **score-conditioned (MIDI notes)** control for precise pitch, rhythm, and expression.
+---
+## ✨ Key Features
+- **🎤 Zero-Shot Singing** – Generate high-fidelity voices for unseen singers, no fine-tuning needed.
+- **🎵 Flexible Control Modes** – Melody (F0) and Score (MIDI) conditioning.
+- **📚 Large-Scale Dataset** – 42,000+ hours of aligned vocals, lyrics, notes across Mandarin, English, Cantonese.
+- **🧑‍🎤 Timbre Cloning** – Preserve singer identity across languages, styles, and edited lyrics.
+- **✏️ Singing Voice Editing** – Modify lyrics while keeping natural prosody.
+- **🌐 Cross-Lingual Synthesis** – High-fidelity synthesis by disentangling timbre from content.
+---
+<p align="center">
+  <img src="assets/performance_radar.png" width="80%" alt="Performance Radar"/>
+</p>
+---
+## 🎬 Demo Examples
+<div align="center">
+<https://github.com/user-attachments/assets/13306f10-3a29-46ba-bcef-d6308d05cbcc>
+</div>
+<div align="center">
+<https://github.com/user-attachments/assets/2eb260fe-6f0b-408c-aab8-5b81ddddb284>
+</div>
+---
+## 📰 News
+- **[2026-02-06]** SoulX-Singer inference code and models released.
+---
+## 🚀 Quick Start
+**Note:** This repo does not ship pretrained weights. SVS and preprocessing models must be downloaded from Hugging Face (see step 3).
+### 1. Clone Repository
+```bash
+git clone https://github.com/Soul-AILab/SoulX-Singer.git
+cd SoulX-Singer
+```
+### 2. Set Up Environment
+**1. Install Conda** (if not already installed): https://docs.conda.io/en/latest/miniconda.html
+**2. Create and activate a Conda environment:**
+```
+conda create -n soulxsinger -y python=3.10
+conda activate soulxsinger
+```
+**3. Install dependencies:**
+```
+pip install -r requirements.txt
+```
+⚠️ If you are in mainland China, use a PyPI mirror:
+```
+pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
+```
+---
+### 3. Download Pretrained Models
+**This repository does not include pretrained models.** You must download them from Hugging Face:
+- [Soul-AILab/SoulX-Singer](https://huggingface.co/Soul-AILab/SoulX-Singer) (SVS model)
+- [Soul-AILab/SoulX-Singer-Preprocess](https://huggingface.co/Soul-AILab/SoulX-Singer-Preprocess) (preprocessing models)
+Install Hugging Face Hub and download:
+```sh
+pip install -U huggingface_hub
+# SoulX-Singer SVS model
+huggingface-cli download Soul-AILab/SoulX-Singer --local-dir pretrained_models/SoulX-Singer
+# Preprocessing models (vocal separation, F0, ASR, etc.)
+huggingface-cli download Soul-AILab/SoulX-Singer-Preprocess --local-dir pretrained_models/SoulX-Singer-Preprocess
+```
+### 4. Run the Demo
+Run the inference demo:
+``` sh
+bash example/infer.sh
+```
+This script relies on metadata generated from the preprocessing pipeline, including vocal separation and transcription. Users should follow the steps in [preprocess](preprocess/README.md) to prepare the necessary metadata before running the demo with their own data.
+**⚠️ Important Note**
+The metadata produced by the automatic preprocessing pipeline may not perfectly align the singing audio with the corresponding lyrics and musical notes. For best synthesis quality, we strongly recommend manually correcting the alignment using the 🎼 [Midi-Editor](https://huggingface.co/spaces/Soul-AILab/SoulX-Singer-Midi-Editor).
+How to use the Midi-Editor:
+- [Eiditing Metadata with Midi-Editor](preprocess/README.md#L104-L105)
+### 🌐 WebUI
+You can launch the interactive interface with:
+```
+python webui.py
+```
+### 🚀 Deploy as Hugging Face Space
+This repo is ready to deploy as a [Hugging Face Space](https://huggingface.co/spaces). **Pretrained models are not included;** `app.py` downloads them from the Hub on first run.
+**📖 详细部署指南请查看：[DEPLOY.md](DEPLOY.md)**
+**快速步骤：**
+1. **创建 Space**：访问 [huggingface.co/spaces](https://huggingface.co/spaces)，点击 "Create new Space"，选择 **Gradio** SDK
+2. **上传代码**：使用 Git 推送或 Web 界面上传代码文件
+3. **配置硬件**：在 Space Settings 中选择 **GPU T4 Small**（推荐）以加快推理速度
+4. **等待启动**：Space 会自动安装依赖、下载模型并启动应用（首次运行可能需要 5-15 分钟）
+模型会自动从以下仓库下载：
+- [Soul-AILab/SoulX-Singer](https://huggingface.co/Soul-AILab/SoulX-Singer) (SVS model)
+- [Soul-AILab/SoulX-Singer-Preprocess](https://huggingface.co/Soul-AILab/SoulX-Singer-Preprocess) (preprocessing models)
+## 🚧 Roadmap
+- [ ] 🖥️ Web-based UI for easy and interactive inference
+- [ ] 🌐 Online demo deployment on Hugging Face Spaces
+- [ ] 📊 Release the SoulX-Singer-Eval benchmark
+- [ ] 📚 Comprehensive tutorials and usage documentation
+## 🙏 Acknowledgements
+Special thanks to the following open-source projects:
+- [F5-TTS](https://github.com/SWivid/F5-TTS)
+- [Amphion](https://github.com/open-mmlab/Amphion/tree/main)
+- [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
+- [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke)
+- [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer)
+- [RMVPE](https://github.com/Dream-High/RMVPE)
+[Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)
+- [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
+- [ROSVOT](https://github.com/RickyL-2000/ROSVOT)
+## 📄 License
+We use the Apache 2.0 license. Researchers and developers are free to use the codes and model weights of our SoulX-Singer. Check the license at [LICENSE](LICENSE) for more details.
+##  ⚠️ Usage Disclaimer
+SoulX-Singer is intended for academic research, educational purposes, and legitimate applications such as personalized singing synthesis and assistive technologies.
+Please note:
+- 🎤 Respect intellectual property, privacy, and personal consent when generating singing content.
+- 🚫 Do not use the model to impersonate individuals without authorization or to create deceptive audio.
+- ⚠️ The developers assume no liability for any misuse of this model.
+We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles. For ethics or misuse concerns, please contact us.
+## 📬 Contact Us
+We welcome your feedback, questions, and collaboration:
+- **Email**: qianjiale@soulapp.cn | menghao@soulapp.cn | wangxinsheng@soulapp.cn
+- **Join discussions**: WeChat or Soul APP groups for technical discussions and updates:
+<p align="center">
+  <!-- <em>Due to group limits, if you can't scan the QR code, please add my WeChat for group access  -->
+      <!-- : <strong>Tiamo James</strong></em> -->
+  <br>
+  <span style="display: inline-block; margin-right: 10px;">
+    <img src="assets/soul_wechat01.jpg" width="500" alt="WeChat Group QR Code"/>
+  </span>
+  <!-- <span style="display: inline-block;">
+    <img src="assets/wechat_tiamo.jpg" width="300" alt="WeChat QR Code"/>
+  </span> -->
+</p>

app.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+Hugging Face Space entry point for SoulX-Singer.
+Downloads pretrained models from the Hub if needed, then launches the Gradio app.
+"""
+import os
+import sys
+from pathlib import Path
+ROOT = Path(__file__).resolve().parent
+PRETRAINED_DIR = ROOT / "pretrained_models"
+MODEL_DIR_SVS = PRETRAINED_DIR / "SoulX-Singer"
+MODEL_DIR_PREPROCESS = PRETRAINED_DIR / "SoulX-Singer-Preprocess"
+def ensure_pretrained_models():
+    """Download SoulX-Singer and Preprocess models from Hugging Face Hub if not present."""
+    if (MODEL_DIR_SVS / "model.pt").exists() and MODEL_DIR_PREPROCESS.exists():
+        print("Pretrained models already present, skipping download.", flush=True)
+        return
+    try:
+        from huggingface_hub import snapshot_download
+    except ImportError:
+        print(
+            "huggingface_hub not installed. Install with: pip install huggingface_hub",
+            file=sys.stderr,
+            flush=True,
+        )
+        raise
+    PRETRAINED_DIR.mkdir(parents=True, exist_ok=True)
+    if not (MODEL_DIR_SVS / "model.pt").exists():
+        print("Downloading SoulX-Singer model...", flush=True)
+        snapshot_download(
+            repo_id="Soul-AILab/SoulX-Singer",
+            local_dir=str(MODEL_DIR_SVS),
+            local_dir_use_symlinks=False,
+        )
+        print("SoulX-Singer model ready.", flush=True)
+    if not MODEL_DIR_PREPROCESS.exists():
+        print("Downloading SoulX-Singer-Preprocess models...", flush=True)
+        snapshot_download(
+            repo_id="Soul-AILab/SoulX-Singer-Preprocess",
+            local_dir=str(MODEL_DIR_PREPROCESS),
+            local_dir_use_symlinks=False,
+        )
+        print("SoulX-Singer-Preprocess models ready.", flush=True)
+if __name__ == "__main__":
+    os.chdir(ROOT)
+    ensure_pretrained_models()
+    from webui import render_interface
+    page = render_interface()
+    page.queue()
+    page.launch(
+        server_name="0.0.0.0",
+        server_port=int(os.environ.get("PORT", "7860")),
+    )

assets/performance_radar.png ADDED Viewed

Git LFS Details

SHA256: 8a5fe64523e65072d7c8014e4584b9f20b5e4f43bbd54edee9f2a068ef174162
Pointer size: 131 Bytes
Size of remote file: 137 kB

assets/soul_wechat01.jpg ADDED Viewed

Git LFS Details

SHA256: b452c23c33f4d0771f922aed4ceb92c0d6e893e74061f78b69a222f94bbd3c4a
Pointer size: 131 Bytes
Size of remote file: 835 kB

assets/soulx-logo.png ADDED Viewed

Git LFS Details

SHA256: 4fe6c191a71be0323d52b236d8ed57f346821ee66c4a9bd8b6232cbca9bf3daf
Pointer size: 131 Bytes
Size of remote file: 636 kB

assets/technical-report.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab2876f8850ce09e2b8ce7e929f8b9adf7de10f13900cb013f548f9707b80061
+size 7927691

cli/inference.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import os
+import torch
+import json
+import argparse
+from tqdm import tqdm
+import numpy as np
+import soundfile as sf
+from collections import OrderedDict
+from omegaconf import DictConfig
+from soulxsinger.utils.file_utils import load_config
+from soulxsinger.models.soulxsinger import SoulXSinger
+from soulxsinger.utils.data_processor import DataProcessor
+def build_model(
+    model_path: str,
+    config: DictConfig,
+    device: str = "cuda",
+):
+    """
+    Build the model from the pre-trained model path and model configuration.
+    Args:
+        model_path (str): Path to the checkpoint file.
+        config (DictConfig): Model configuration.
+        device (str, optional): Device to use. Defaults to "cuda".
+    Returns:
+        Tuple[torch.nn.Module, torch.nn.Module]: The initialized model and vocoder.
+    """
+    if not os.path.isfile(model_path):
+        raise FileNotFoundError(
+            f"Model checkpoint not found: {model_path}. "
+            "Please download the pretrained model and place it at the path, or set --model_path."
+        )
+    model = SoulXSinger(config).to(device)
+    print("Model initialized.")
+    print("Model parameters:", sum(p.numel() for p in model.parameters()) / 1e6, "M")
+    checkpoint = torch.load(model_path, weights_only=False, map_location=device)
+    if "state_dict" not in checkpoint:
+        raise KeyError(
+            f"Checkpoint at {model_path} has no 'state_dict' key. "
+            "Expected a checkpoint saved with model.state_dict()."
+        )
+    model.load_state_dict(checkpoint["state_dict"], strict=True)
+    model.eval()
+    model.to(device)
+    print("Model checkpoint loaded.")
+    return model
+def process(args, config, model: torch.nn.Module):
+    """Run the full inference pipeline given a data_processor and model.
+    """
+    if args.control not in ("melody", "score"):
+        raise ValueError(f"control must be 'melody' or 'score', got: {args.control}")
+    print(f"prompt_metadata_path: {args.prompt_metadata_path}")
+    print(f"target_metadata_path: {args.target_metadata_path}")
+    os.makedirs(args.save_dir, exist_ok=True)
+    data_processor = DataProcessor(
+        hop_size=config.audio.hop_size,
+        sample_rate=config.audio.sample_rate,
+        phoneset_path=args.phoneset_path,
+        device=args.device,
+    )
+    with open(args.prompt_metadata_path, "r", encoding="utf-8") as f:
+        prompt_meta_list = json.load(f)
+    if not prompt_meta_list:
+        raise ValueError("Prompt metadata is empty. Please run preprocess on prompt audio first.")
+    prompt_meta = prompt_meta_list[0]  # load the first segment as the prompt
+    with open(args.target_metadata_path, "r", encoding="utf-8") as f:
+        target_meta_list = json.load(f)
+    infer_prompt_data = data_processor.process(prompt_meta, args.prompt_wav_path)
+    assert len(target_meta_list) > 0, "No target segments found in the target metadata."
+    generated_len = int(target_meta_list[-1]["time"][1] / 1000 * config.audio.sample_rate)
+    generated_merged = np.zeros(generated_len, dtype=np.float32)
+    for idx, target_meta in enumerate(
+        tqdm(target_meta_list, total=len(target_meta_list), desc="Inferring segments"),
+    ):
+        start_sample_idx = int(target_meta["time"][0] / 1000 * config.audio.sample_rate)
+        end_sample_idx = int(target_meta["time"][1] / 1000 * config.audio.sample_rate)
+        infer_target_data = data_processor.process(target_meta, None)
+        infer_data = {
+            "prompt": infer_prompt_data,
+            "target": infer_target_data,
+        }
+        with torch.no_grad():
+            generated_audio = model.infer(
+                infer_data,
+                auto_shift=args.auto_shift,
+                pitch_shift=args.pitch_shift,
+                n_steps=config.infer.n_steps,
+                cfg=config.infer.cfg,
+                control=args.control,
+            )
+        generated_audio = generated_audio.squeeze().cpu().numpy()
+        generated_merged[start_sample_idx : start_sample_idx + generated_audio.shape[0]] = generated_audio
+    merged_path = os.path.join(args.save_dir, "generated.wav")
+    sf.write(merged_path, generated_merged, 24000)
+    print(f"Generated audio saved to {merged_path}")
+def main(args, config):
+    model = build_model(
+        model_path=args.model_path,
+        config=config,
+        device=args.device,
+    )
+    process(args, config, model)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--device", type=str, default="cuda")
+    parser.add_argument("--model_path", type=str, default='pretrained_models/soulx-singer/model.pt')
+    parser.add_argument("--config", type=str, default='soulxsinger/config/soulxsinger.yaml')
+    parser.add_argument("--prompt_wav_path", type=str, default='example/audio/zh_prompt.wav')
+    parser.add_argument("--prompt_metadata_path", type=str, default='example/metadata/zh_prompt.json')
+    parser.add_argument("--target_metadata_path", type=str, default='example/metadata/zh_target.json')
+    parser.add_argument("--phoneset_path", type=str, default='soulxsinger/utils/phoneme/phone_set.json')
+    parser.add_argument("--save_dir", type=str, default='outputs')
+    parser.add_argument("--auto_shift", action="store_true")
+    parser.add_argument("--pitch_shift", type=int, default=0)
+    parser.add_argument(
+        "--control",
+        type=str,
+        default="melody",
+        choices=["melody", "score"],
+        help="Control mode: melody or score only",
+    )
+    args = parser.parse_args()
+    config = load_config(args.config)
+    main(args, config)

deploy_to_hf.sh ADDED Viewed

	@@ -0,0 +1,70 @@

+#!/bin/bash
+# 快速部署脚本：将 SoulX-Singer 部署到 Hugging Face Space
+# 使用方法: ./deploy_to_hf.sh YOUR_USERNAME YOUR_SPACE_NAME
+set -e
+if [ $# -lt 2 ]; then
+    echo "用法: $0 <YOUR_USERNAME> <YOUR_SPACE_NAME>"
+    echo "示例: $0 myusername soulx-singer-demo"
+    exit 1
+fi
+USERNAME=$1
+SPACE_NAME=$2
+SPACE_REPO="https://huggingface.co/spaces/${USERNAME}/${SPACE_NAME}"
+echo "🚀 开始部署到 Hugging Face Space..."
+echo "Space: ${USERNAME}/${SPACE_NAME}"
+echo ""
+# 检查是否已安装 huggingface_hub
+if ! command -v huggingface-cli &> /dev/null; then
+    echo "⚠️  未检测到 huggingface-cli，正在安装..."
+    pip install -U huggingface_hub
+fi
+# 检查是否已登录
+if ! huggingface-cli whoami &> /dev/null; then
+    echo "🔐 请先登录 Hugging Face..."
+    huggingface-cli login
+fi
+# 创建 Space（如果不存在）
+echo "📦 检查 Space 是否存在..."
+if ! huggingface-cli repo info "${USERNAME}/${SPACE_NAME}" --repo-type space &> /dev/null; then
+    echo "✨ 创建新的 Space..."
+    huggingface-cli repo create "${SPACE_NAME}" --type space --sdk gradio
+else
+    echo "✅ Space 已存在"
+fi
+# 检查是否已初始化 Git
+if [ ! -d ".git" ]; then
+    echo "📝 初始化 Git 仓库..."
+    git init
+    git add .
+    git commit -m "Initial commit for HF Space deployment" || echo "⚠️  没有新文件需要提交"
+fi
+# 检查远程仓库
+if git remote | grep -q "^origin$"; then
+    echo "🔄 更新远程仓库地址..."
+    git remote set-url origin "${SPACE_REPO}"
+else
+    echo "➕ 添加远程仓库..."
+    git remote add origin "${SPACE_REPO}"
+fi
+# 推送代码
+echo "📤 推送代码到 Hugging Face..."
+git push -u origin main || git push -u origin master
+echo ""
+echo "✅ 部署完成！"
+echo "🌐 Space 地址: ${SPACE_REPO}"
+echo ""
+echo "💡 提示："
+echo "   - Space 会自动开始构建，请查看 Logs 标签页"
+echo "   - 首次运行会下载模型，可能需要 5-15 分钟"
+echo "   - 建议在 Space Settings 中选择 GPU T4 Small 硬件"

example/audio/en_prompt.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "index": "vocal_5220_10280",
+    "language": "English",
+    "time": [
+      5220,
+      10280
+    ],
+    "duration": "0.24 0.36 0.30 0.78 0.24 0.56 0.19 0.53 0.36 0.20 0.32 0.57 0.19 0.22",
+    "text": "<SP> Ooh Ooh <SP> I wish nothing nothing more more the best best <SP>",
+    "phoneme": "<SP> en_UW1 en_UW1 <SP> en_AY1 en_W-IH1-SH en_N-AH1-TH-IH0-NG en_N-AH1-TH-IH0-NG en_M-AO1-R en_M-AO1-R en_DH-AH0 en_B-EH1-S-T en_B-EH1-S-T <SP>",
+    "note_pitch": "0 63 65 0 65 67 68 62 62 64 67 67 65 0",
+    "note_type": "1 2 3 1 2 2 2 3 2 3 2 2 3 1",
+    "f0": "0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 345.2 343.1 341.6 339.8 337.8 331.9 319.5 312.1 310.8 312.6 315.1 316.1 315.3 314.6 315.3 317.9 322.0 329.6 337.5 344.7 347.5 347.2 344.3 339.5 338.2 341.7 342.8 342.2 340.7 343.0 342.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 347.0 345.3 348.7 350.2 350.9 350.3 344.7 340.3 338.3 338.0 342.8 347.4 348.3 346.7 343.4 339.5 340.5 345.2 350.4 357.7 367.3 376.6 385.9 392.6 393.6 389.9 384.7 381.8 382.0 383.0 380.6 373.5 367.9 377.0 385.4 391.4 393.8 395.6 396.1 397.2 399.8 406.0 413.5 416.1 416.0 414.4 413.5 412.9 415.5 418.9 417.5 408.8 389.2 373.9 0.0 0.0 0.0 288.5 286.0 284.2 285.6 288.9 291.3 293.5 294.5 295.2 297.8 299.5 301.0 303.0 305.9 306.8 306.0 304.4 301.8 301.0 300.8 301.8 310.2 309.8 308.2 305.9 303.6 301.5 299.3 298.5 300.0 302.1 303.5 303.6 302.2 299.7 297.5 296.3 296.4 296.8 298.6 302.6 311.8 322.0 333.8 349.0 368.8 393.3 407.1 410.7 407.0 402.3 401.2 401.7 403.9 405.7 403.5 396.8 387.4 378.6 377.8 381.4 384.0 384.7 383.5 382.5 380.8 377.3 378.4 383.5 390.0 392.7 390.5 387.6 385.3 382.7 381.0 382.8 383.9 382.2 379.6 379.3 380.2 383.1 386.0 386.5 385.4 384.3 383.7 384.4 386.2 388.2 388.5 385.0 378.6 360.4 333.7 328.2 332.4 340.2 348.9 339.6 334.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
+  }
+]

example/audio/en_prompt.mp3 ADDED Viewed

Binary file (86.8 kB). View file

example/audio/en_target.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "index": "vocal_0_6900",
+    "language": "English",
+    "time": [
+      0,
+      6900
+    ],
+    "duration": "0.16 0.24 0.32 0.15 0.17 0.24 0.15 0.44 0.29 0.32 0.24 0.32 0.22 0.18 0.24 0.25 1.01 0.26 0.48 0.29 0.79 0.14",
+    "text": "<SP> Who says you're you're not pretty <SP> pretty <SP> Who says you're you're not beautiful beautiful <SP> Who says says <SP>",
+    "phoneme": "<SP> en_HH-UW1 en_S-EH1-Z en_Y-UH1-R en_Y-UH1-R en_N-AA1-T en_P-R-IH1-T-IY0 <SP> en_P-R-IH1-T-IY0 <SP> en_HH-UW1 en_S-EH1-Z en_Y-UH1-R en_Y-UH1-R en_N-AA1-T en_B-Y-UW1-T-AH0-F-AH0-L en_B-Y-UW1-T-AH0-F-AH0-L <SP> en_HH-UW1 en_S-EH1-Z en_S-EH1-Z <SP>",
+    "note_pitch": "0 68 67 65 63 63 66 67 70 66 68 67 65 63 63 67 65 63 65 61 58 0",
+    "note_type": "1 2 2 2 3 2 2 1 3 1 2 2 2 3 2 2 3 1 2 2 3 1",
+    "f0": "0.0 0.0 382.7 387.7 385.9 379.8 376.0 380.9 390.1 403.2 415.3 423.6 421.6 402.6 385.2 381.1 0.0 0.0 425.8 419.0 409.6 397.8 392.2 389.0 388.5 391.4 389.1 381.4 375.9 0.0 0.0 0.0 0.0 359.0 354.7 353.8 353.7 354.7 353.1 351.1 350.4 349.0 348.9 346.3 337.4 328.0 312.8 303.1 298.4 296.0 298.9 302.0 306.3 307.9 307.3 307.5 307.3 302.9 301.8 0.0 0.0 0.0 0.0 0.0 343.7 364.3 375.9 368.5 358.1 359.1 365.9 378.4 393.1 406.0 412.5 410.9 407.0 404.1 403.5 403.4 401.5 399.4 397.7 395.4 394.4 394.8 395.5 396.5 397.5 400.8 407.9 415.1 417.8 453.1 472.2 481.0 482.3 481.9 480.8 478.7 477.4 476.8 474.8 467.5 446.0 390.4 382.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 374.3 375.5 370.9 370.7 373.1 378.3 392.2 407.6 418.5 423.9 423.2 415.9 395.4 0.0 0.0 0.0 421.5 416.2 405.3 391.0 383.1 380.8 383.0 388.3 388.8 378.3 371.7 0.0 0.0 0.0 371.4 365.1 362.7 358.5 353.0 352.0 353.5 356.1 356.4 353.6 348.3 341.1 330.6 317.7 303.8 293.3 296.5 297.7 301.4 305.3 308.8 308.8 308.2 308.2 306.3 305.6 285.0 269.8 265.6 280.0 304.4 331.2 351.0 357.9 364.2 370.6 381.1 392.9 399.0 399.1 395.0 389.5 379.9 363.0 338.9 318.5 305.6 300.3 299.6 296.3 292.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 309.6 322.1 329.8 331.2 332.1 332.6 332.4 335.4 340.7 345.0 347.2 346.2 342.6 339.6 337.4 338.3 340.9 342.6 344.0 344.6 344.0 344.2 343.6 341.9 338.8 336.7 337.6 341.1 347.0 350.4 343.0 326.6 330.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 310.0 315.7 317.7 317.4 316.5 314.4 314.1 322.5 336.3 350.0 354.2 352.9 350.5 348.4 347.1 347.3 348.4 349.8 349.8 350.4 350.3 323.9 324.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 297.3 289.8 279.5 275.1 276.1 276.1 274.9 275.4 274.6 271.8 268.6 264.0 258.3 251.7 244.3 239.9 236.1 233.7 234.0 236.0 237.3 236.9 235.2 233.5 231.7 231.0 232.1 233.6 235.4 236.2 236.7 235.8 234.1 232.2 231.3 232.6 233.5 235.2 236.0 232.3 228.8 229.6 233.8 241.3 239.4 226.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
+  }
+]

example/audio/en_target.mp3 ADDED Viewed

Binary file (66.9 kB). View file

example/audio/music.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "index": "vocal_240_51240",
+    "language": "Mandarin",
+    "time": [
+      240,
+      51240
+    ],
+    "duration": "0.21 0.18 0.34 0.38 0.26 0.22 0.50 0.20 0.33 0.13 0.60 0.22 0.38 0.52 0.30 0.12 1.82 0.98 0.20 0.26 0.38 0.36 0.58 0.54 0.26 0.50 1.46 3.03 0.24 0.24 0.29 0.25 0.20 0.14 0.80 0.22 0.54 0.30 0.24 0.16 0.58 0.28 0.38 1.73 0.79 0.24 0.30 0.34 0.30 0.30 0.34 0.21 0.21 0.36 0.34 0.23 1.85 1.90 0.23 0.39 0.68 0.50 0.31 0.43 0.76 0.38 2.00 1.87 0.68 0.72 0.56 0.62 0.80 0.40 0.42 1.68 1.79 0.70 0.66 0.54 0.24 0.48 0.68 0.40 2.34 0.14",
+    "text": "<SP> 只 是 因 为 为 在 人 群 群 中 多 看 了 你 一 眼 <SP> 再 也 没 能 忘 掉 你 容 颜 <SP> 梦 想 着 着 偶 偶 然 然 有 <SP> 一 一 天 再 相 见 <SP> 从 此 我 开 始 始 孤 孤 单 思 念 念 <SP> 想 想 你 时 你 你 在 天 边 <SP> 想 你 时 你 在 眼 前 前 <SP> 想 你 时 你 你 在 脑 海 <SP>",
+    "phoneme": "<SP> zh_zhi3 zh_shi4 zh_yin1 zh_wei4 zh_wei2 zh_zai4 zh_ren2 zh_qun2 zh_qun2 zh_zhong1 zh_duo1 zh_kan4 zh_le5 zh_ni3 zh_yi1 zh_yan3 <SP> zh_zai4 zh_ye3 zh_mei2 zh_neng2 zh_wang4 zh_diao4 zh_ni3 zh_rong2 zh_yan2 <SP> zh_meng4 zh_xiang3 zh_zhe5 zh_zhe5 zh_ou3 zh_ou3 zh_ran2 zh_ran2 zh_you3 <SP> zh_yi1 zh_yi1 zh_tian1 zh_zai4 zh_xiang1 zh_jian4 <SP> zh_cong2 zh_ci3 zh_wo3 zh_kai1 zh_shi3 zh_shi3 zh_gu1 zh_gu1 zh_dan1 zh_si1 zh_nian4 zh_nian4 <SP> zh_xiang3 zh_xiang3 zh_ni3 zh_shi2 zh_ni3 zh_ni3 zh_zai4 zh_tian1 zh_bian1 <SP> zh_xiang3 zh_ni3 zh_shi2 zh_ni3 zh_zai4 zh_yan3 zh_qian2 zh_qian2 <SP> zh_xiang3 zh_ni3 zh_shi2 zh_ni3 zh_ni3 zh_zai4 zh_nao3 zh_hai3 <SP>",
+    "note_pitch": "0 64 64 64 66 68 66 66 66 64 64 64 64 66 64 60 61 0 63 63 63 64 66 63 61 59 56 0 68 66 66 68 68 66 66 64 64 0 64 66 61 61 61 64 0 62 63 63 64 64 66 65 66 61 58 59 56 0 69 71 66 68 68 71 66 64 61 0 66 61 68 66 64 63 61 59 0 71 66 68 68 71 66 64 61 0",
+    "note_type": "1 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 3 2 3 2 3 2 1 2 3 2 2 2 2 1 2 2 2 2 2 3 2 3 2 2 2 3 1 2 3 2 2 2 3 2 2 2 1 2 2 2 2 2 2 2 3 1 2 2 2 2 3 2 2 2 1",
+    "f0": "0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 318.9 332.2 331.2 323.9 312.3 0.0 0.0 0.0 0.0 0.0 0.0 340.9 342.3 340.2 337.4 333.5 329.3 327.8 328.3 329.1 330.5 331.9 331.5 329.2 327.6 327.4 327.7 329.6 330.5 329.5 328.1 328.5 330.1 330.4 329.5 328.9 331.0 333.6 332.8 331.7 330.2 330.6 330.9 329.5 327.6 332.4 343.5 363.9 368.7 369.3 368.2 366.7 371.3 382.0 392.9 402.3 409.8 413.8 416.3 416.4 415.4 414.1 413.4 415.3 417.4 418.3 413.6 416.0 0.0 0.0 0.0 0.0 368.2 362.5 361.3 362.0 362.9 364.3 366.4 367.5 368.4 368.9 368.1 367.6 369.7 371.1 371.1 369.5 368.0 367.3 366.6 367.1 369.5 372.0 371.3 368.2 368.1 369.6 369.9 373.4 376.5 360.2 285.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 384.0 376.7 372.0 371.7 373.9 376.3 375.5 361.2 333.0 317.8 319.0 329.1 333.5 333.2 333.2 333.3 328.0 314.5 0.0 0.0 0.0 320.6 326.5 332.9 335.7 334.1 329.9 327.4 326.3 326.6 329.1 330.5 330.0 328.0 326.7 328.6 331.1 332.4 332.1 332.8 333.2 333.1 331.8 328.6 318.9 293.2 323.6 327.5 330.0 332.8 331.1 319.4 272.2 0.0 0.0 0.0 0.0 278.5 294.5 310.0 318.5 322.7 327.4 334.1 337.2 336.3 328.5 321.0 324.2 341.5 362.2 374.7 375.8 370.5 366.0 365.2 368.7 371.7 374.4 373.0 368.7 370.4 374.8 375.1 372.5 368.7 363.0 357.1 359.0 366.8 377.1 379.5 371.5 359.9 351.0 358.4 371.7 377.9 375.8 367.8 359.7 363.6 375.3 379.9 377.5 372.9 359.8 345.2 337.1 334.5 333.4 332.8 329.6 319.2 292.8 261.7 253.4 262.8 273.0 278.1 279.3 278.6 277.6 277.9 278.3 278.0 277.3 276.7 275.6 275.3 276.1 277.8 278.2 277.7 277.6 277.2 276.4 275.5 273.9 271.8 270.7 274.7 282.7 285.5 281.4 273.1 264.4 262.2 268.9 278.4 284.3 283.6 277.3 268.5 262.2 263.2 270.7 278.8 285.0 287.3 282.0 272.1 265.2 266.0 272.1 278.2 283.8 285.1 281.4 272.6 267.2 269.1 276.1 282.7 289.3 290.5 285.6 273.9 267.5 267.4 271.8 277.7 282.6 283.1 277.8 269.8 265.6 268.6 273.8 281.1 286.2 287.2 282.3 270.4 264.5 263.3 263.0 268.5 268.3 268.7 270.9 277.2 277.4 284.3 285.7 282.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 311.5 306.4 306.0 306.8 305.8 305.0 307.2 310.8 312.4 313.5 314.4 314.6 313.7 312.2 311.1 311.1 312.8 314.6 312.7 306.5 299.7 302.7 308.7 315.2 314.2 310.9 309.4 309.8 310.6 310.7 310.8 311.4 312.2 313.0 312.5 312.2 312.2 311.9 312.0 312.2 311.4 306.1 303.3 309.3 315.0 325.3 324.4 323.9 325.0 326.4 327.7 327.7 328.1 328.6 327.8 328.3 328.3 327.6 328.6 336.1 347.2 359.4 371.0 374.9 371.3 366.5 369.5 375.5 376.2 370.2 365.7 367.3 372.9 376.9 375.3 368.4 360.1 358.0 365.4 380.8 383.5 380.4 374.4 367.6 365.9 372.0 376.3 377.9 375.7 372.6 368.6 361.2 353.7 0.0 309.4 302.3 299.2 300.6 305.7 308.5 308.6 309.0 310.9 310.9 309.5 308.6 309.0 311.3 313.0 314.4 314.5 313.0 312.1 311.2 308.0 299.5 295.8 295.0 285.7 278.0 280.1 281.2 279.9 278.3 278.3 279.5 279.7 272.2 259.1 0.0 0.0 0.0 240.7 236.0 233.8 233.5 234.6 235.7 235.6 236.2 239.8 245.4 248.8 250.1 250.4 249.9 249.1 247.2 241.7 231.7 216.0 197.6 192.1 197.2 204.3 206.5 205.4 201.7 200.8 203.5 208.2 210.9 210.3 207.1 203.5 202.3 202.1 202.8 205.1 207.3 208.1 207.3 205.4 202.1 199.3 200.5 206.1 212.9 212.9 206.5 197.1 191.1 191.2 196.8 203.6 207.8 208.7 206.4 201.7 197.3 197.4 200.7 206.5 209.9 209.9 206.9 203.3 201.2 203.4 207.7 210.5 208.9 207.6 206.5 204.2 202.0 203.6 205.7 210.8 213.1 214.3 210.6 204.1 199.7 202.3 211.9 217.7 215.1 215.0 215.3 213.3 0.0 0.0 216.3 0.0 0.0 195.2 209.0 205.3 201.3 196.8 195.7 195.8 195.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 431.9 425.8 421.0 417.7 419.3 422.7 423.4 421.1 417.9 411.9 408.6 411.5 414.9 419.6 426.7 424.2 0.0 0.0 0.0 280.8 279.4 369.7 371.8 370.6 369.8 370.5 370.5 371.4 374.6 374.0 361.0 332.7 328.0 0.0 359.0 365.0 371.1 373.8 371.9 369.7 369.6 371.0 378.3 390.4 403.4 414.1 417.9 417.5 417.0 416.6 415.7 414.9 414.5 413.8 413.5 413.6 413.8 415.2 416.9 417.3 415.4 415.0 414.1 411.4 409.2 403.5 392.7 375.6 367.5 365.8 368.8 370.7 370.8 369.0 367.0 366.0 366.2 367.8 369.2 368.8 367.2 366.7 367.3 367.3 368.8 368.5 366.8 365.3 364.0 363.0 365.0 367.3 367.7 365.5 364.0 364.9 367.4 370.3 371.3 369.9 366.5 364.1 368.3 385.8 406.7 409.4 399.7 376.0 357.2 359.9 372.4 377.5 366.9 345.1 320.6 315.8 321.8 329.0 329.9 326.4 324.5 324.8 324.9 325.3 325.1 324.7 325.9 326.8 326.6 326.6 326.4 325.4 323.2 322.5 323.9 326.2 328.6 329.9 329.5 329.0 328.2 327.7 327.9 329.6 331.2 331.4 334.2 340.1 335.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 309.0 320.1 329.6 334.4 334.4 334.5 338.2 346.0 355.1 361.4 366.1 369.5 372.7 373.3 372.9 375.2 375.8 374.7 367.6 360.1 0.0 0.0 0.0 0.0 0.0 0.0 283.6 280.4 277.2 274.3 271.8 271.0 273.3 277.2 278.8 278.1 277.2 277.4 278.8 279.3 280.2 281.4 279.2 274.9 268.3 254.1 239.9 0.0 0.0 267.7 267.7 270.1 271.8 273.1 276.3 278.3 269.1 256.1 0.0 0.0 0.0 0.0 286.0 281.1 275.7 272.7 271.7 271.7 272.5 274.0 275.5 277.9 281.1 287.1 314.0 342.2 363.0 376.8 375.4 365.8 351.8 323.3 0.0 0.0 0.0 0.0 299.6 322.7 336.5 336.9 335.4 332.9 330.0 326.6 323.4 322.3 324.4 326.7 328.5 328.5 326.4 324.0 322.5 323.2 325.5 327.7 328.6 328.1 325.4 320.9 316.5 316.1 320.8 327.7 333.7 333.2 325.6 315.5 306.9 305.5 315.1 331.0 343.3 341.6 331.4 319.5 308.3 307.6 318.1 329.5 338.2 342.0 337.4 329.4 321.0 316.6 319.8 330.8 340.6 344.5 342.3 331.4 319.3 315.7 322.9 329.3 336.5 344.0 341.1 328.3 318.7 320.0 328.2 333.5 337.9 339.7 338.5 327.3 323.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 285.3 284.6 284.3 288.2 295.1 305.6 312.2 315.5 287.8 0.0 0.0 0.0 0.0 0.0 309.8 312.4 313.8 312.8 310.8 310.7 312.5 313.2 310.7 305.7 305.9 308.1 309.9 311.3 311.0 309.8 309.7 310.6 311.1 311.1 311.0 311.9 313.7 316.4 319.5 311.6 281.3 283.4 306.6 0.0 0.0 0.0 334.4 326.7 322.6 321.6 321.7 324.5 332.0 336.7 333.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 332.5 331.7 332.7 334.8 334.7 335.9 339.7 345.6 353.9 362.3 370.2 374.0 373.8 371.4 370.0 368.2 367.0 368.0 370.4 371.2 371.3 370.0 360.3 0.0 0.0 0.0 319.1 326.8 331.5 332.0 331.6 327.2 328.2 334.1 343.1 350.7 358.0 363.5 368.4 370.8 371.8 371.1 370.2 368.5 368.0 371.1 375.5 376.3 369.5 0.0 0.0 0.0 280.4 267.2 263.6 265.8 271.2 275.0 276.4 276.9 277.1 280.9 287.5 284.7 280.7 0.0 0.0 0.0 0.0 0.0 0.0 231.6 226.0 222.9 224.1 226.2 228.5 234.3 240.9 246.6 249.8 249.6 246.3 245.2 245.8 247.2 248.3 249.3 249.0 245.9 242.7 235.8 226.3 217.8 213.1 209.3 208.0 207.0 205.9 205.8 203.4 201.5 205.3 207.5 209.1 210.4 208.9 203.2 199.2 199.3 201.2 205.6 206.3 204.0 202.7 202.2 203.3 206.1 205.6 201.9 198.8 195.5 195.6 198.7 204.2 214.5 218.3 214.3 207.0 199.1 192.4 189.9 193.6 201.0 210.6 212.4 209.2 202.0 195.2 190.0 190.4 196.0 204.2 209.2 208.9 203.0 195.8 190.7 189.8 194.6 200.8 208.7 213.5 210.5 202.0 193.9 187.7 187.8 193.1 199.8 202.8 204.1 203.8 200.4 197.0 193.9 191.6 192.5 198.4 204.8 203.9 203.5 201.3 200.2 198.4 198.5 201.0 204.0 204.6 205.4 202.0 199.2 194.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 426.6 432.4 436.6 437.2 434.6 433.0 434.0 453.1 473.8 492.8 503.2 501.8 491.3 479.6 478.1 489.7 503.6 508.3 503.2 489.5 482.3 484.2 495.7 505.4 506.6 501.6 495.4 492.7 497.1 500.4 496.1 490.1 480.4 453.8 412.6 373.6 357.7 354.5 356.8 361.0 365.0 367.8 369.5 370.1 371.1 371.3 370.8 370.1 369.7 369.0 369.5 370.4 371.1 370.9 370.2 370.8 371.3 372.7 380.2 386.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 435.0 427.8 417.0 411.6 412.6 411.2 410.7 412.5 413.7 414.9 414.8 415.7 415.4 414.0 413.8 414.6 415.2 416.3 415.2 413.6 413.4 414.9 416.8 416.4 414.7 413.8 413.2 414.7 417.5 420.7 428.2 443.1 459.8 481.7 502.3 510.8 509.1 497.8 490.6 493.5 500.8 508.3 511.8 504.6 493.5 487.3 492.3 507.2 518.8 514.8 496.4 489.6 490.4 491.8 493.3 0.0 0.0 0.0 0.0 0.0 391.9 368.8 360.2 358.3 360.0 362.0 361.8 360.9 360.3 361.8 365.7 368.7 369.3 367.7 364.7 365.6 368.4 370.4 371.7 369.0 366.8 367.5 367.9 369.4 374.6 384.6 396.8 388.2 382.8 0.0 0.0 0.0 0.0 0.0 352.6 339.5 328.2 326.0 326.5 326.9 328.2 329.4 329.4 329.3 337.3 349.1 350.9 343.6 333.4 326.5 327.1 331.5 333.4 326.8 318.4 0.0 0.0 0.0 282.0 283.6 281.6 279.5 273.2 267.9 268.8 273.2 277.5 279.8 278.7 276.8 273.3 268.7 265.0 266.1 272.4 282.3 284.9 279.4 268.5 255.3 253.0 259.9 270.2 281.5 284.3 279.7 270.1 260.2 256.4 260.1 266.6 273.0 279.6 283.6 282.3 272.3 262.1 255.8 259.8 269.6 274.9 278.7 278.4 271.5 261.7 255.8 256.8 263.9 272.1 279.7 278.4 269.9 259.0 255.9 262.1 269.2 274.5 279.6 282.1 280.7 274.3 269.9 270.9 270.7 273.8 276.9 274.9 271.1 266.5 266.7 269.4 276.9 282.9 282.0 279.5 274.9 271.2 267.9 263.1 270.2 281.3 285.5 283.9 280.6 271.1 262.8 263.9 263.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 362.9 364.0 366.8 366.6 363.6 359.8 357.7 359.9 367.6 371.6 365.4 359.3 358.6 359.5 365.5 375.7 378.1 371.5 359.5 350.8 358.6 374.7 381.7 376.8 363.6 353.3 356.9 367.8 378.3 380.8 375.7 372.3 372.6 374.7 372.9 365.2 346.4 314.8 285.6 277.3 275.4 275.3 277.5 279.2 279.2 278.8 277.6 276.4 276.5 277.6 278.8 279.9 281.8 282.7 281.0 279.2 278.6 279.4 279.9 279.2 278.7 281.7 285.8 289.9 288.7 284.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 428.0 417.9 410.0 409.5 411.2 414.0 417.5 417.9 418.1 419.3 419.0 417.8 417.1 417.2 416.6 415.8 415.0 415.1 414.5 410.9 401.5 388.2 371.5 359.2 363.0 371.7 375.7 370.6 359.6 353.4 358.7 367.0 374.5 378.2 373.7 362.8 358.0 361.6 368.5 373.7 377.7 375.8 368.8 361.7 360.5 366.7 372.5 375.6 380.1 384.9 383.5 376.6 0.0 0.0 0.0 0.0 0.0 331.7 320.0 316.8 320.7 325.1 327.0 327.4 326.6 326.4 325.5 325.1 326.2 327.6 328.5 328.4 327.7 328.7 329.9 329.8 329.7 327.7 326.5 327.8 328.9 329.3 329.8 329.2 328.8 329.4 330.4 331.0 331.0 330.5 329.2 328.2 328.3 328.7 329.4 327.5 322.9 314.8 299.3 290.8 292.3 299.0 305.4 309.2 312.1 313.9 316.8 320.4 327.6 333.1 338.0 341.2 342.1 338.4 331.0 0.0 0.0 0.0 0.0 0.0 0.0 311.7 292.5 281.8 276.1 275.1 276.6 277.6 277.4 274.0 264.8 249.6 239.1 238.8 243.0 245.7 245.0 241.9 237.9 234.6 237.1 242.7 250.4 252.5 248.3 240.7 231.6 228.3 234.0 240.6 247.0 250.6 248.5 242.1 233.6 226.8 227.2 234.0 240.7 246.9 248.9 244.4 237.6 230.8 228.8 236.4 246.9 250.2 249.2 245.4 239.9 232.5 224.7 226.5 238.5 252.0 257.4 255.9 248.4 237.4 230.5 231.6 238.8 247.6 252.6 253.9 253.0 248.9 241.8 236.4 234.1 236.5 246.2 257.7 255.4 248.2 238.5 234.8 237.8 244.4 250.5 253.8 251.7 246.7 240.0 238.3 244.8 251.6 256.4 256.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 484.0 484.5 487.5 490.6 487.2 479.7 478.9 483.8 494.1 505.7 510.3 501.1 485.4 472.4 472.3 495.6 515.8 517.1 503.8 486.5 475.1 479.9 491.4 507.9 509.9 505.0 496.2 485.7 484.5 490.8 495.4 497.7 496.6 489.9 469.1 424.3 387.2 373.0 367.3 364.4 364.6 366.2 366.6 368.8 371.9 373.1 371.4 368.9 367.1 367.0 367.7 368.8 370.1 369.8 369.2 369.8 370.0 369.0 369.0 371.0 375.2 385.3 391.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 425.2 418.5 409.4 409.5 408.1 406.3 408.4 410.3 412.8 413.8 412.6 411.2 410.8 410.4 410.2 410.5 412.2 413.6 413.4 412.6 412.6 413.4 415.6 417.9 418.5 417.0 415.0 415.0 416.8 421.5 437.9 455.0 472.6 493.0 502.2 496.7 484.6 480.6 487.2 498.6 506.3 506.2 499.4 489.7 489.5 496.2 503.8 507.2 501.8 491.1 489.9 501.1 520.5 533.7 525.8 0.0 0.0 0.0 0.0 0.0 421.8 365.2 351.0 352.0 358.0 362.6 365.6 365.4 363.2 362.5 362.2 362.4 364.1 365.2 364.9 363.5 363.4 366.3 368.5 370.0 369.7 368.5 366.9 365.7 366.3 368.6 370.8 372.2 370.2 373.1 377.0 376.5 371.0 346.8 326.1 325.2 328.4 331.4 334.0 332.2 327.2 324.4 331.3 349.7 363.2 362.0 346.1 327.4 321.8 329.8 340.7 348.4 349.9 346.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 269.5 275.7 282.6 282.9 278.8 275.2 274.3 273.5 273.9 275.3 276.4 277.3 276.1 272.3 268.0 268.3 272.5 279.1 281.7 280.1 271.3 261.0 257.3 262.7 273.2 280.8 283.3 279.3 269.1 258.3 258.0 266.3 278.2 286.7 287.8 283.4 274.9 264.7 257.7 260.0 272.0 286.3 294.7 289.8 274.5 263.8 263.4 268.8 277.3 284.1 286.5 285.2 281.7 272.7 264.3 260.3 267.7 281.7 289.2 289.0 281.2 266.2 256.0 254.4 261.5 276.4 286.8 288.8 286.7 273.8 260.9 260.8 270.2 283.4 291.3 292.8 286.1 273.8 265.2 264.0 271.3 283.8 290.3 289.7 277.7 266.9 260.5 263.0 267.8 281.5 286.7 286.1 285.5 280.1 279.0 283.0 284.1 284.5 285.5 282.2 0.0 0.0 280.0 276.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
+  }
+]

example/audio/music.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:04b35a7b9d03adc494c304af5c4413aa33a02a54a7110016d6e3b559843d90de
+size 1243961

example/audio/yue_target.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "index": "vocal_420_14370",
+    "language": "Cantonese",
+    "time": [
+      420,
+      14370
+    ],
+    "duration": "0.31 0.26 0.28 0.26 0.40 0.20 0.42 0.24 0.36 0.24 0.32 0.26 0.94 0.32 0.24 0.30 0.34 0.22 0.34 0.90 0.22 0.36 0.32 0.30 0.22 0.36 0.22 0.32 0.34 0.20 0.40 0.24 0.30 0.38 0.22 0.32 0.28 0.36 0.24 0.34 0.26 0.60",
+    "text": "<SP> 我 的 心 情 又 像 真 该 等 被 揭 开 嘴 巴 却 再 仰 千 台 人 潮 内 越 文 静 越 变 得 不 受 理 睬 睬 自 己 己 要 交 出 意 外",
+    "phoneme": "<SP> yue_ngo5 yue_dik1 yue_sam1 yue_cing4 yue_jau6 yue_zoeng6 yue_zan1 yue_goi1 yue_dang2 yue_bei6 yue_kit3 yue_hoi1 yue_zeoi2 yue_baa1 yue_koek3 yue_zoi3 yue_joeng5 yue_cin1 yue_toi4 yue_jan4 yue_ciu4 yue_noi6 yue_jyut6 yue_man4 yue_zing6 yue_jyut6 yue_bin3 yue_dak1 yue_bat1 yue_sau6 yue_lei5 yue_coi2 yue_coi2 yue_zi6 yue_gei2 yue_gei2 yue_jiu3 yue_gaau1 yue_ceot1 yue_ji3 yue_ngoi6",
+    "note_pitch": "0 52 57 59 55 57 59 62 60 58 54 57 59 59 57 55 54 53 57 51 50 54 57 58 54 57 59 61 64 59 54 54 57 59 51 56 58 57 56 56 55 52",
+    "note_type": "1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2",
+    "f0": "0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 74.8 74.8 82.4 85.7 81.0 76.2 78.8 111.9 129.0 146.8 160.6 175.0 182.1 172.9 163.9 190.3 214.7 218.4 221.5 223.5 220.2 209.1 173.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 254.5 248.7 245.1 243.3 239.8 238.8 241.2 244.0 245.7 245.6 239.8 216.8 191.2 0.0 0.0 170.9 179.4 189.0 192.6 193.3 192.8 193.5 194.1 194.1 194.4 195.4 197.4 199.7 202.2 205.8 210.6 212.3 214.4 216.5 219.4 221.9 222.4 222.4 222.8 222.9 217.1 189.6 175.4 0.0 0.0 255.5 251.2 246.5 247.3 248.9 249.1 249.4 251.9 253.6 250.2 247.1 246.6 239.1 193.8 191.0 0.0 295.9 301.1 302.8 301.5 296.6 287.7 286.4 290.4 294.2 297.1 297.3 294.7 287.9 273.4 221.0 262.3 265.3 259.7 255.9 254.7 255.1 256.2 257.4 259.3 260.8 261.3 249.3 209.1 194.0 236.7 224.9 210.2 202.6 197.9 201.4 210.6 220.6 230.1 237.9 242.4 242.7 241.0 234.7 220.3 190.9 179.0 185.6 182.3 178.3 177.1 179.0 181.3 182.4 184.5 187.6 185.9 172.3 161.7 167.7 0.0 0.0 206.9 210.4 213.0 214.2 215.6 217.3 217.1 194.7 181.9 184.5 182.3 171.8 155.9 161.4 167.7 195.8 235.3 245.0 245.8 241.9 237.0 232.3 231.3 234.9 241.8 250.9 253.2 248.9 238.0 226.0 220.5 224.9 236.6 250.9 259.6 262.0 259.1 252.5 246.5 241.8 236.0 228.8 216.3 210.5 211.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 209.1 197.4 191.2 191.2 200.4 222.5 235.6 248.0 252.9 249.4 243.0 235.6 216.1 182.1 175.8 213.4 214.1 214.7 215.1 215.2 215.9 215.7 214.8 211.6 202.2 187.0 183.6 0.0 0.0 0.0 191.3 192.1 191.4 192.4 193.3 192.6 188.9 171.2 0.0 0.0 0.0 0.0 0.0 0.0 192.1 186.5 181.8 178.2 176.1 176.8 178.2 179.4 179.7 180.4 183.3 184.4 182.7 179.0 176.1 174.6 168.7 166.5 167.2 168.8 171.8 175.6 185.4 194.7 197.2 192.6 181.9 0.0 0.0 0.0 192.6 204.8 217.5 220.0 220.4 218.7 216.0 214.2 216.0 216.8 213.0 200.5 183.9 0.0 0.0 158.5 156.8 159.6 161.4 160.7 159.4 159.5 159.3 157.1 152.4 152.1 156.4 162.9 167.3 166.4 160.3 149.6 146.1 149.5 155.4 161.3 164.1 163.3 161.3 154.8 148.5 144.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 99.4 97.6 102.0 115.3 131.4 142.5 150.1 155.1 159.7 162.5 161.1 152.3 0.0 0.0 0.0 0.0 174.2 183.7 189.8 191.6 190.4 189.1 189.5 190.4 191.9 192.1 192.5 196.2 203.3 212.2 214.1 216.7 217.4 217.2 216.8 216.4 214.9 214.7 216.3 218.9 220.4 221.5 224.6 231.2 237.9 242.2 242.6 240.3 239.0 240.4 241.7 242.3 238.4 184.9 178.6 194.1 203.6 194.7 185.4 180.5 181.9 185.4 188.7 191.6 193.9 194.6 192.2 191.0 189.5 186.7 181.0 0.0 0.0 219.2 217.8 216.8 215.3 213.3 212.0 213.6 215.2 215.0 214.7 215.8 218.2 221.0 224.4 229.9 237.5 244.9 246.6 246.6 248.0 250.2 249.8 193.6 186.3 193.0 0.0 0.0 0.0 288.3 289.9 287.8 287.2 287.7 285.0 283.4 281.9 280.1 283.4 287.8 290.9 291.6 287.6 240.0 238.5 0.0 333.3 328.9 325.3 323.1 321.6 317.4 298.0 274.9 0.0 0.0 0.0 0.0 0.0 0.0 243.7 248.7 245.5 243.3 243.8 246.2 244.5 226.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 197.1 189.0 184.3 182.5 184.8 188.1 188.8 190.3 190.4 189.5 188.7 188.0 189.8 188.2 185.0 182.2 179.7 178.1 178.7 188.8 208.5 217.2 220.0 218.6 212.8 191.9 167.3 0.0 0.0 0.0 183.9 201.1 210.5 210.3 209.5 206.7 206.5 212.0 221.1 235.9 250.8 253.0 247.7 238.4 229.7 229.1 235.1 244.7 251.7 253.5 251.4 246.7 242.4 234.6 209.5 180.1 173.1 0.0 0.0 0.0 147.9 147.5 155.4 159.1 160.2 161.4 162.0 161.7 159.0 152.4 139.6 121.5 126.4 0.0 0.0 197.2 200.2 196.8 195.8 200.3 203.2 203.8 202.5 203.7 212.8 220.0 225.8 231.6 234.1 231.7 228.3 225.4 226.0 229.7 234.3 237.3 238.2 236.9 232.9 227.1 220.3 215.1 211.3 205.9 210.1 216.0 217.6 218.4 218.9 219.3 218.5 217.1 216.9 217.7 216.4 212.1 196.3 171.7 171.3 210.9 203.0 194.9 194.8 199.1 205.4 210.5 214.9 219.6 224.8 225.1 221.4 212.4 198.5 0.0 0.0 204.9 204.1 208.9 212.7 212.8 213.2 214.9 214.4 208.4 189.0 159.7 160.4 193.1 198.9 196.0 192.8 193.0 195.3 195.2 194.5 193.6 193.9 192.5 192.3 192.2 183.0 164.8 150.0 147.3 150.7 155.9 160.8 163.6 164.8 161.8 156.6 155.0 160.5 166.2 167.3 165.1 162.0 154.3 142.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
+  }
+]

example/audio/yue_target.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a699c2649eec48ed1e9a6caae2af918bf7d49e5e4ad39cf3cca0916942bc7db2
+size 353361

example/audio/zh_prompt.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "index": "vocal_320_10687",
+    "language": "Mandarin",
+    "time": [
+      320,
+      10687
+    ],
+    "duration": "0.23 0.34 0.26 0.70 0.52 0.46 0.36 0.44 0.14 0.24 0.64 0.47 0.51 1.10 0.28 0.38 0.32 0.32 0.38 0.32 0.31 0.19 1.45",
+    "text": "<SP> 除 了 想 你 你 <SP> 除 了 了 爱 你 你 <SP> 我 什 么 什 么 都 愿 愿 意",
+    "phoneme": "<SP> zh_chu2 zh_le5 zh_xiang3 zh_ni3 zh_ni3 <SP> zh_chu2 zh_le5 zh_le5 zh_ai4 zh_ni3 zh_ni3 <SP> zh_wo3 zh_shen2 zh_me5 zh_shen2 zh_me5 zh_dou1 zh_yuan4 zh_yuan4 zh_yi4",
+    "note_pitch": "0 62 65 67 67 69 0 67 69 67 65 67 69 67 67 66 64 64 60 60 65 67 0",
+    "note_type": "1 2 2 2 2 3 1 2 2 3 2 2 3 1 2 2 2 2 2 2 2 3 2",
+    "f0": "0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 294.1 288.2 290.6 294.6 295.7 292.8 291.5 294.4 295.4 294.7 293.5 292.2 294.4 295.8 293.2 297.7 320.1 338.1 348.3 348.7 344.4 342.6 346.2 354.8 356.9 353.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 401.1 403.8 405.0 402.1 398.4 395.6 393.3 392.0 391.4 390.3 390.2 390.3 390.3 391.3 392.8 393.6 391.7 390.5 391.4 391.6 391.5 393.5 393.8 390.8 387.4 387.8 389.3 390.8 392.2 391.6 390.2 389.8 389.1 388.4 390.0 395.5 397.2 396.7 395.5 395.1 394.6 394.9 395.6 395.2 394.6 395.4 395.9 394.0 391.7 390.7 391.7 392.6 391.6 395.7 405.8 441.7 462.5 463.8 450.2 430.9 414.5 415.2 426.7 439.8 454.4 462.9 447.8 422.5 400.6 403.2 423.5 451.3 482.3 492.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 435.9 414.3 406.2 402.8 398.2 396.7 396.7 395.2 398.3 416.9 441.6 446.9 441.5 437.3 435.3 432.9 431.6 432.3 434.8 436.7 433.8 422.8 406.7 390.9 382.0 381.7 384.7 382.7 368.1 357.4 355.6 355.1 352.4 348.1 346.4 348.7 351.6 355.0 354.4 351.7 349.9 349.2 348.1 346.0 345.4 344.2 344.4 345.5 346.6 349.0 349.7 349.1 349.5 349.6 349.7 349.4 349.5 352.2 354.5 355.7 355.6 356.9 359.4 361.5 363.8 360.4 354.3 357.2 363.9 372.4 382.6 399.1 402.9 400.6 395.5 390.5 388.9 390.2 391.1 391.9 391.5 390.4 390.2 391.3 391.6 391.0 388.6 386.2 389.1 403.8 430.2 441.8 449.5 448.1 443.2 438.3 432.9 430.7 434.6 442.0 447.6 446.0 440.3 434.7 431.1 435.7 442.3 445.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 423.8 402.1 398.5 397.6 393.4 390.5 391.4 402.5 427.6 442.5 442.5 435.1 430.1 430.8 439.9 447.4 442.2 426.1 412.3 399.8 391.8 389.5 388.5 387.8 386.4 384.7 384.5 387.9 391.2 391.8 392.9 393.8 392.0 392.0 395.4 398.1 398.2 396.3 393.1 391.1 388.9 386.6 383.1 381.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 357.8 367.6 370.9 365.4 359.3 355.7 358.1 372.7 396.8 404.0 398.4 392.6 389.2 388.8 383.6 362.3 341.1 325.1 326.3 327.9 331.3 333.3 326.0 319.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 356.4 353.5 349.0 346.5 338.4 328.4 323.7 326.2 334.2 338.4 331.6 309.8 280.6 256.4 252.8 256.1 258.3 256.9 257.6 261.1 259.7 258.7 258.6 260.2 262.2 262.4 262.9 264.2 263.4 260.2 256.4 253.2 238.7 223.3 231.6 257.7 258.1 258.4 258.8 258.0 256.8 255.8 254.0 255.1 258.1 261.9 263.7 262.6 256.7 253.1 250.2 246.7 258.7 294.4 327.1 342.6 346.0 344.1 341.2 342.3 345.2 350.1 364.7 382.5 396.1 396.1 389.2 381.8 381.1 387.2 397.0 399.3 390.7 374.3 360.1 350.6 346.6 347.4 350.7 354.4 354.3 351.7 349.8 348.4 346.9 347.0 348.4 349.5 351.0 352.3 353.6 353.3 350.5 348.3 345.5 344.3 344.4 347.0 350.6 352.0 351.0 350.8 350.1 347.6 345.7 347.0 350.3 351.7 350.7 348.5 346.9 347.7 349.0 349.1 348.5 346.8 346.3 348.4 349.0 349.2 351.1 349.6 348.3 350.5 351.1 348.0 347.6 349.1 351.3 356.0 361.3 360.6 354.0 341.0 316.2 302.9 302.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
+  }
+]

example/audio/zh_prompt.mp3 ADDED Viewed

Binary file (86.1 kB). View file

example/audio/zh_target.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "index": "vocal_0_6710",
+    "language": "Mandarin",
+    "time": [
+      0,
+      6710
+    ],
+    "duration": "0.13 0.26 0.24 0.22 0.24 0.33 0.13 0.24 0.22 0.46 0.69 0.84 0.26 0.30 0.16 0.26 0.26 0.20 0.32 0.94",
+    "text": "<SP> 像 我 这 样 懦 懦 弱 的 人 人 <SP> 凡 事 都 要 留 留 几 分",
+    "phoneme": "<SP> zh_xiang4 zh_wo3 zh_zhe4 zh_yang4 zh_nuo4 zh_nuo4 zh_ruo4 zh_de5 zh_ren2 zh_ren2 <SP> zh_fan2 zh_shi4 zh_dou1 zh_yao4 zh_liu2 zh_liu2 zh_ji3 zh_fen1",
+    "note_pitch": "0 50 53 55 53 56 54 53 50 51 53 0 51 53 55 53 54 56 51 53",
+    "note_type": "1 2 2 2 2 2 3 2 2 2 3 1 2 2 2 2 2 3 2 2",
+    "f0": "0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 132.7 137.2 144.9 147.0 148.0 148.6 148.9 148.5 147.9 147.4 148.1 149.1 154.3 166.4 173.1 175.0 176.3 175.9 172.7 173.6 175.9 172.9 159.1 165.8 0.0 214.2 213.4 210.2 201.5 198.1 197.1 197.2 197.8 200.8 206.4 206.1 200.5 189.9 180.6 172.0 170.8 171.4 176.2 180.9 182.3 182.2 181.0 180.5 183.4 192.6 211.8 220.6 223.7 219.3 211.9 207.0 203.6 202.6 204.2 204.4 204.1 202.1 198.0 192.9 185.9 177.9 174.1 174.6 174.5 173.8 173.6 172.4 168.3 168.3 172.7 173.2 171.9 170.7 170.2 169.9 170.6 173.1 172.4 164.2 148.2 147.8 152.3 148.5 143.8 145.6 149.2 149.9 150.1 152.5 153.6 154.7 156.1 155.0 152.4 152.1 153.7 155.3 156.4 156.8 157.3 157.7 157.1 156.8 157.8 158.9 157.9 157.5 157.1 157.0 159.1 162.0 167.7 172.1 174.9 176.2 174.5 172.0 170.9 171.3 172.5 173.1 173.5 173.1 174.1 174.6 175.2 176.7 177.2 177.3 176.9 175.9 174.1 172.4 174.1 174.8 171.8 172.1 176.5 177.3 176.0 179.4 179.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 137.5 138.5 145.8 151.0 153.3 154.3 156.2 158.5 161.0 158.1 148.6 151.3 157.0 163.5 172.7 176.4 176.1 175.8 175.6 174.7 171.4 169.3 161.7 194.2 199.3 199.7 201.0 202.7 201.4 199.4 198.5 196.6 194.0 190.6 186.1 181.9 179.4 177.4 177.1 176.3 175.8 175.5 174.8 173.6 172.4 170.8 165.0 160.6 175.9 179.3 179.8 180.4 180.3 178.8 178.6 181.5 185.0 190.7 198.3 206.3 210.6 210.9 207.4 203.5 203.4 204.6 203.5 195.6 182.5 0.0 0.0 0.0 0.0 144.7 144.1 146.6 150.2 151.9 153.4 155.0 156.0 155.9 155.4 155.2 153.5 147.0 144.8 0.0 0.0 0.0 0.0 0.0 0.0 181.1 178.9 178.0 177.5 176.0 173.2 172.9 172.9 174.1 176.2 177.3 178.7 178.8 176.0 175.2 175.1 176.3 178.1 177.6 177.4 177.9 177.6 177.3 177.3 177.5 176.6 175.7 176.6 177.5 177.2 175.9 174.8 173.5 174.0 175.7 177.4 177.8 174.7"
+  }
+]

example/audio/zh_target.mp3 ADDED Viewed

Binary file (54.2 kB). View file

example/infer.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/bin/bash
+script_dir=$(dirname "$(realpath "$0")")
+root_dir=$(dirname "$script_dir")
+cd $root_dir || exit
+export PYTHONPATH=$root_dir:$PYTHONPATH
+model_path=pretrained_models/SoulX-Singer/model.pt
+config=soulxsinger/config/soulxsinger.yaml
+prompt_wav_path=example/audio/zh_prompt.mp3
+prompt_metadata_path=example/audio/zh_prompt.json
+target_metadata_path=example/audio/music.json
+phoneset_path=soulxsinger/utils/phoneme/phone_set.json
+save_dir=example/generated/music
+control=score  # melody or score
+python -m cli.inference \
+    --device cuda \
+    --model_path $model_path \
+    --config $config \
+    --prompt_wav_path $prompt_wav_path \
+    --prompt_metadata_path $prompt_metadata_path \
+    --target_metadata_path $target_metadata_path \
+    --phoneset_path $phoneset_path \
+    --save_dir $save_dir \
+    --auto_shift \
+    --pitch_shift 0

example/preprocess.sh ADDED Viewed

	@@ -0,0 +1,41 @@

+#!/bin/bash
+script_dir=$(dirname "$(realpath "$0")")
+root_dir=$(dirname "$script_dir")
+cd $root_dir || exit
+export PYTHONPATH=$root_dir:$PYTHONPATH
+device=cuda
+####### Run Prompt Annotation #######
+audio_path=example/audio/zh_prompt.mp3
+save_dir=example/transcriptions/zh_prompt
+language=Mandarin
+vocal_sep=False
+max_merge_duration=30000
+python -m preprocess.pipeline \
+    --audio_path $audio_path \
+    --save_dir $save_dir \
+    --language $language \
+    --device $device \
+    --vocal_sep $vocal_sep \
+    --max_merge_duration $max_merge_duration
+####### Run Target Annotation #######
+audio_path=example/audio/music.mp3
+save_dir=example/transcriptions/music
+language=Mandarin
+vocal_sep=True
+max_merge_duration=60000
+python -m preprocess.pipeline \
+    --audio_path $audio_path \
+    --save_dir $save_dir \
+    --language $language \
+    --device $device \
+    --vocal_sep $vocal_sep \
+    --max_merge_duration $max_merge_duration

preprocess/README.md ADDED Viewed

	@@ -0,0 +1,155 @@

+# 🎵 SoulX-Singer-Preprocess
+This part offers a comprehensive **singing transcription and editing toolkit** for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the **customizable creation and editing of lyric-aligned MIDI scores**.
+## ✨ Features
+The toolkit includes the following core modules:
+- 🎤 **Clean Dry Vocal Extraction**
+  Extracts the lead vocal track from polyphonic music audio and dereverberation.
+- 📝 **Lyrics Transcription**
+  Automatically transcribes lyrics from clean vocal.
+- 🎶 **Note Transcription**
+  Converts singing voice into note-level representations for SVS.
+- 🎼 **MIDI Editor**
+  Supports customizable creation and editing of MIDI scores integrated with lyrics.
+## 🔧 Python Environment
+Before running the pipeline, set up the Python environment as follows:
+1. **Install Conda** (if not already installed): https://docs.conda.io/en/latest/miniconda.html
+2. **Activate or create a conda environment** (recommended Python 3.10):
+   - If you already have the `soulxsinger` environment:
+     ```bash
+     conda activate soulxsinger
+     ```
+   - Otherwise, create it first:
+     ```bash
+     conda create -n soulxsinger -y python=3.10
+     conda activate soulxsinger
+     ```
+3. **Install dependencies** from the `preprocess` directory:
+   ```bash
+   cd preprocess
+   pip install -r requirements.txt
+   ```
+## 📁 Data Preparation
+Before running the pipeline, prepare the following inputs:
+- **Prompt audio**
+  Reference audio that provides timbre and style
+- **Target audio**
+  Original vocal or music audio to be processed and transcribed.
+Configure the corresponding parameters in:
+```
+example/preprocess.sh
+```
+Typical configuration includes:
+- Input / output paths
+- Module enable switches
+## 🚀 Usage
+After configuring `preprocess.sh`, run the transcription pipeline with:
+```bash
+bash example/preprocess.sh
+```
+The script will automatically execute the following steps:
+1. **Vocal separation and dereverberation**
+2. **F0 extraction and voice activity detection (VAD)**
+3. **Lyrics transcription**
+4. **Note transcription**
+---
+After the pipeline completes, you will obtain **SoulX-Singer–style metadata** that can be directly used for Singing Voice Synthesis (SVS).
+**Output paths:**
+- The final metadata (**JSON file**) is written **in the same directory as your input audio**, with the **same filename** (e.g. `audio.mp3` → `audio.json`)
+- All **intermediate results** (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured **`save_dir`**.
+⚠️ **Important Note**
+Transcription errors—especially in **lyrics** and **note annotations**—can significantly affect the final SVS quality. We **strongly recommend manually reviewing and correcting** the generated metadata before inference.
+To support this, we provide a **MIDI Editor** for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:
+**Export metadata to MIDI** → edit in the MIDI Editor → **Import edited MIDI back to metadata** for SVS.
+---
+#### Step 1: Metadata → MIDI (for editing)
+Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:
+```bash
+preprocess_root=example/transcriptions/music
+python -m preprocess.tools.midi_parser \
+    --meta2midi \
+    --meta "${preprocess_root}/metadata.json" \
+    --midi "${preprocess_root}/vocal.mid"
+```
+#### Step 2: Edit in the MIDI Editor
+Open the MIDI Editor (see [MIDI Editor Tutorial](tools/midi_editor/README.md)), load `vocal.mid`, and correct lyrics, pitches, or durations as needed. Save the result as e.g. `vocal_edited.mid`.
+#### Step 3: MIDI → Metadata (for SoulX-Singer inference)
+Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:
+```bash
+python -m preprocess.tools.midi_parser \
+    --midi2meta \
+    --midi "${preprocess_root}/vocal_edited.mid" \
+    --meta "${preprocess_root}/edit_metadata.json" \
+    --vocal "${preprocess_root}/vocal.wav" \
+```
+Use `edit_metadata.json` (and the wavs under `edit_cut_wavs`) as the target metadata in your inference pipeline.
+## 🔗 References & Dependencies
+This project builds upon the following excellent open-source works:
+### 🎧 Vocal Separation & Dereverberation
+- [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
+- [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke)
+- [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer)
+### 🎼 F0 Extraction
+- [RMVPE](https://github.com/Dream-High/RMVPE)
+### 📝 Lyrics Transcription (ASR)
+- [Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)
+- [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
+### 🎶 Note Transcription
+- [ROSVOT](https://github.com/RickyL-2000/ROSVOT)
+We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.

preprocess/pipeline.py ADDED Viewed

	@@ -0,0 +1,146 @@

+import json
+import shutil
+import soundfile as sf
+from pathlib import Path
+import librosa
+from preprocess.utils import convert_metadata, merge_short_segments
+from preprocess.tools import (
+    F0Extractor,
+    VocalDetector,
+    VocalSeparator,
+    NoteTranscriber,
+    LyricTranscriber,
+)
+class PreprocessPipeline:
+    def __init__(self, device: str, language: str, save_dir: str, vocal_sep: bool = True, max_merge_duration: int = 60000):
+        self.device = device
+        self.language = language
+        self.save_dir = save_dir
+        self.vocal_sep = vocal_sep
+        self.max_merge_duration = max_merge_duration
+        if vocal_sep:
+            self.vocal_separator = VocalSeparator(
+                sep_model_path="pretrained_models/SoulX-Singer-Preprocess/mel-band-roformer-karaoke/mel_band_roformer_karaoke_becruily.ckpt",
+                sep_config_path="pretrained_models/SoulX-Singer-Preprocess/mel-band-roformer-karaoke/config_karaoke_becruily.yaml",
+                der_model_path="pretrained_models/SoulX-Singer-Preprocess/dereverb_mel_band_roformer/dereverb_mel_band_roformer_anvuew_sdr_19.1729.ckpt",
+                der_config_path="pretrained_models/SoulX-Singer-Preprocess/dereverb_mel_band_roformer/dereverb_mel_band_roformer_anvuew.yaml",
+                device=device
+            )
+        else:
+            self.vocal_separator = None
+        self.f0_extractor = F0Extractor(
+            model_path="pretrained_models/SoulX-Singer-Preprocess/rmvpe/rmvpe.pt",
+            device=device,
+        )
+        self.vocal_detector = VocalDetector(
+            cut_wavs_output_dir=  f"{save_dir}/cut_wavs",
+        )
+        self.lyric_transcriber = LyricTranscriber(
+            zh_model_path="pretrained_models/SoulX-Singer-Preprocess/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
+            en_model_path="pretrained_models/SoulX-Singer-Preprocess/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2.nemo",
+            device=device
+        )
+        self.note_transcriber = NoteTranscriber(
+            rosvot_model_path="pretrained_models/SoulX-Singer-Preprocess/rosvot/rosvot/model.pt",
+            rwbd_model_path="pretrained_models/SoulX-Singer-Preprocess/rosvot/rwbd/model.pt",
+            device=device
+        )
+    def run(
+        self,
+        audio_path: str,
+        vocal_sep: bool = True,
+        max_merge_duration: int = 60000,
+        language: str = "Mandarin"
+    ) -> None:
+        vocal_sep = self.vocal_sep if vocal_sep is None else vocal_sep
+        max_merge_duration = self.max_merge_duration if max_merge_duration is None else max_merge_duration
+        language = self.language if language is None else language
+        output_dir = Path(self.save_dir)
+        output_dir.mkdir(parents=True, exist_ok=True)
+        if vocal_sep:
+            # Perform vocal/accompaniment separation
+            sep = self.vocal_separator.process(audio_path)
+            vocal = sep.vocals_dereverbed.T
+            acc = sep.accompaniment.T
+            sample_rate = sep.sample_rate
+            vocal_path = output_dir / "vocal.wav"
+            acc_path = output_dir / "acc.wav"
+            sf.write(vocal_path, vocal, sample_rate)
+            sf.write(acc_path, acc, sample_rate)
+        else:
+            # Use the original audio as vocal source (no separation)
+            vocal, sample_rate = librosa.load(audio_path, sr=None, mono=True)
+            vocal_path = output_dir / "vocal.wav"
+            sf.write(vocal_path, vocal, sample_rate)
+        vocal_f0 = self.f0_extractor.process(str(vocal_path))
+        segments = self.vocal_detector.process(str(vocal_path), f0=vocal_f0)
+        metadata = []
+        for seg in segments:
+            self.f0_extractor.process(seg["wav_fn"], f0_path=seg["wav_fn"].replace(".wav", "_f0.npy"))
+            words, durs = self.lyric_transcriber.process(
+                seg["wav_fn"], language
+            )
+            seg["words"] = words
+            seg["word_durs"] = durs
+            seg["language"] = language
+            metadata.append(
+                self.note_transcriber.process(seg, segment_info=seg)
+            )
+        merged = merge_short_segments(
+            vocal,
+            sample_rate,
+            metadata,
+            output_dir / "long_cut_wavs",
+            max_duration_ms=max_merge_duration,
+        )
+        final_metadata = []
+        for item in merged:
+            self.f0_extractor.process(item.wav_fn, f0_path=item.wav_fn.replace(".wav", "_f0.npy"))
+            final_metadata.append(convert_metadata(item))
+        with open(output_dir / "metadata.json", "w", encoding="utf-8") as f:
+            json.dump(final_metadata, f, ensure_ascii=False, indent=2)
+        shutil.copy(output_dir / "metadata.json", audio_path.replace(".wav", ".json").replace(".mp3", ".json").replace(".flac", ".json"))
+def main(args):
+    pipeline = PreprocessPipeline(
+        device=args.device,
+        language=args.language,
+        save_dir=args.save_dir,
+        vocal_sep=args.vocal_sep,
+        max_merge_duration=args.max_merge_duration,
+    )
+    pipeline.run(
+        audio_path=args.audio_path,
+        language=args.language
+    )
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--audio_path", type=str, required=True, help="Path to the input audio file")
+    parser.add_argument("--save_dir", type=str, required=True, help="Directory to save the output files")
+    parser.add_argument("--language", type=str, default="Mandarin", help="Language of the audio")
+    parser.add_argument("--device", type=str, default="cuda:0", help="Device to run the models on")
+    parser.add_argument("--vocal_sep", type=bool, default=True, help="Whether to perform vocal separation")
+    parser.add_argument("--max_merge_duration", type=int, default=60000, help="Maximum merged segment duration in milliseconds")
+    args = parser.parse_args()
+    main(args)

preprocess/requirements.txt ADDED Viewed

	@@ -0,0 +1,33 @@

+beartype==0.22.9
+einops==0.8.2
+funasr==1.3.0
+g2p_en==2.1.0
+g2pM==0.1.2.5
+librosa==0.11.0
+loralib==0.1.2
+matplotlib==3.10.8
+mido==1.3.3
+ml_collections==1.1.0
+nemo_toolkit==2.6.1
+nltk==3.9.2
+numba==0.63.1
+numpy==2.2.6
+omegaconf==2.3.0
+packaging==24.2
+praat-parselmouth==0.4.7
+pretty_midi==0.2.11
+pyloudnorm==0.2.0
+pyworld==0.3.5
+rotary_embedding_torch==0.8.9
+sageattention==1.0.6
+scikit_learn==1.7.2
+scipy==1.15.3
+six==1.17.0
+scikit_image==0.25.2
+soundfile==0.13.1
+ToJyutping==3.2.0
+torch==2.10.0
+torchaudio==2.10.0
+tqdm==4.67.1
+wandb==0.24.2
+webrtcvad==2.0.10

preprocess/tools/__init__.py ADDED Viewed

	@@ -0,0 +1,53 @@

+"""Preprocess tools.
+This package provides a thin, stable import surface for common preprocess components.
+Examples:
+    from preprocess.tools import (
+        F0Extractor,
+        PitchExtractor,
+        VocalDetectionModel,
+        VocalSeparationModel,
+        VocalExtractionModel,
+        NoteTranscriptionModel,
+        LyricTranscriptionModel,
+    )
+Note:
+    Keep these imports lightweight. If a tool pulls heavy dependencies at import time,
+    consider switching to lazy imports.
+"""
+from __future__ import annotations
+# Core tools
+from .f0_extraction import F0Extractor
+from .vocal_detection import VocalDetector
+# Some tools may live outside this package in different layouts across branches.
+# Keep the public surface stable while avoiding hard import failures.
+try:
+    from .vocal_separation.model import VocalSeparator  # type: ignore
+except Exception:  # pragma: no cover
+    VocalSeparator = None  # type: ignore
+try:
+    from .note_transcription.model import NoteTranscriber  # type: ignore
+except Exception:  # pragma: no cover
+    NoteTranscriber = None  # type: ignore
+try:
+    from .lyric_transcription import LyricTranscriber
+except Exception:  # pragma: no cover
+    LyricTranscriber = None  # type: ignore
+__all__ = [
+    "F0Extractor",
+    "VocalDetector",
+]
+if VocalSeparator is not None:
+    __all__.append("VocalSeparator")
+if LyricTranscriber is not None:
+    __all__.append("LyricTranscriber")
+if NoteTranscriber is not None:
+    __all__.append("NoteTranscriber")

preprocess/tools/f0_extraction.py ADDED Viewed

	@@ -0,0 +1,527 @@

+# https://github.com/Dream-High/RMVPE
+import math
+import time
+import librosa
+import numpy as np
+from librosa.filters import mel
+from scipy.interpolate import interp1d
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class BiGRU(nn.Module):
+    def __init__(self, input_features, hidden_features, num_layers):
+        super(BiGRU, self).__init__()
+        self.gru = nn.GRU(
+            input_features,
+            hidden_features,
+            num_layers=num_layers,
+            batch_first=True,
+            bidirectional=True,
+        )
+    def forward(self, x):
+        return self.gru(x)[0]
+class ConvBlockRes(nn.Module):
+    def __init__(self, in_channels, out_channels, momentum=0.01):
+        super(ConvBlockRes, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+            nn.Conv2d(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+        )
+        if in_channels != out_channels:
+            self.shortcut = nn.Conv2d(in_channels, out_channels, (1, 1))
+    def forward(self, x):
+        if not hasattr(self, "shortcut"):
+            return self.conv(x) + x
+        else:
+            return self.conv(x) + self.shortcut(x)
+class ResEncoderBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, n_blocks=1, momentum=0.01):
+        super(ResEncoderBlock, self).__init__()
+        self.n_blocks = n_blocks
+        self.conv = nn.ModuleList()
+        self.conv.append(ConvBlockRes(in_channels, out_channels, momentum))
+        for i in range(n_blocks - 1):
+            self.conv.append(ConvBlockRes(out_channels, out_channels, momentum))
+        self.kernel_size = kernel_size
+        if self.kernel_size is not None:
+            self.pool = nn.AvgPool2d(kernel_size=kernel_size)
+    def forward(self, x):
+        for conv in self.conv:
+            x = conv(x)
+        if self.kernel_size is not None:
+            return x, self.pool(x)
+        else:
+            return x
+class Encoder(nn.Module):
+    def __init__(self, in_channels, in_size, n_encoders, kernel_size, n_blocks, out_channels=16, momentum=0.01):
+        super(Encoder, self).__init__()
+        self.n_encoders = n_encoders
+        self.bn = nn.BatchNorm2d(in_channels, momentum=momentum)
+        self.layers = nn.ModuleList()
+        self.latent_channels = []
+        for i in range(self.n_encoders):
+            self.layers.append(
+                ResEncoderBlock(in_channels, out_channels, kernel_size, n_blocks, momentum=momentum)
+            )
+            self.latent_channels.append([out_channels, in_size])
+            in_channels = out_channels
+            out_channels *= 2
+            in_size //= 2
+        self.out_size = in_size
+        self.out_channel = out_channels
+    def forward(self, x):
+        concat_tensors = []
+        x = self.bn(x)
+        for layer in self.layers:
+            t, x = layer(x)
+            concat_tensors.append(t)
+        return x, concat_tensors
+class Intermediate(nn.Module):
+    def __init__(self, in_channels, out_channels, n_inters, n_blocks, momentum=0.01):
+        super(Intermediate, self).__init__()
+        self.n_inters = n_inters
+        self.layers = nn.ModuleList()
+        self.layers.append(ResEncoderBlock(in_channels, out_channels, None, n_blocks, momentum))
+        for i in range(self.n_inters - 1):
+            self.layers.append(ResEncoderBlock(out_channels, out_channels, None, n_blocks, momentum))
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+        return x
+class ResDecoderBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, stride, n_blocks=1, momentum=0.01):
+        super(ResDecoderBlock, self).__init__()
+        out_padding = (0, 1) if stride == (1, 2) else (1, 1)
+        self.n_blocks = n_blocks
+        self.conv1 = nn.Sequential(
+            nn.ConvTranspose2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=stride,
+                padding=(1, 1),
+                output_padding=out_padding,
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+        )
+        self.conv2 = nn.ModuleList()
+        self.conv2.append(ConvBlockRes(out_channels * 2, out_channels, momentum))
+        for i in range(n_blocks - 1):
+            self.conv2.append(ConvBlockRes(out_channels, out_channels, momentum))
+    def forward(self, x, concat_tensor):
+        x = self.conv1(x)
+        x = torch.cat((x, concat_tensor), dim=1)
+        for conv2 in self.conv2:
+            x = conv2(x)
+        return x
+class Decoder(nn.Module):
+    def __init__(self, in_channels, n_decoders, stride, n_blocks, momentum=0.01):
+        super(Decoder, self).__init__()
+        self.layers = nn.ModuleList()
+        self.n_decoders = n_decoders
+        for i in range(self.n_decoders):
+            out_channels = in_channels // 2
+            self.layers.append(
+                ResDecoderBlock(in_channels, out_channels, stride, n_blocks, momentum)
+            )
+            in_channels = out_channels
+    def forward(self, x, concat_tensors):
+        for i, layer in enumerate(self.layers):
+            x = layer(x, concat_tensors[-1 - i])
+        return x
+class DeepUnet(nn.Module):
+    def __init__(self, kernel_size, n_blocks, en_de_layers=5, inter_layers=4, in_channels=1, en_out_channels=16):
+        super(DeepUnet, self).__init__()
+        self.encoder = Encoder(in_channels, 128, en_de_layers, kernel_size, n_blocks, en_out_channels)
+        self.intermediate = Intermediate(
+            self.encoder.out_channel // 2,
+            self.encoder.out_channel,
+            inter_layers,
+            n_blocks,
+        )
+        self.decoder = Decoder(self.encoder.out_channel, en_de_layers, kernel_size, n_blocks)
+    def forward(self, x):
+        x, concat_tensors = self.encoder(x)
+        x = self.intermediate(x)
+        x = self.decoder(x, concat_tensors)
+        return x
+class E2E(nn.Module):
+    def __init__(self, n_blocks, n_gru, kernel_size, en_de_layers=5, inter_layers=4, in_channels=1, en_out_channels=16):
+        super(E2E, self).__init__()
+        self.unet = DeepUnet(kernel_size, n_blocks, en_de_layers, inter_layers, in_channels, en_out_channels)
+        self.cnn = nn.Conv2d(en_out_channels, 3, (3, 3), padding=(1, 1))
+        if n_gru:
+            self.fc = nn.Sequential(
+                BiGRU(3 * 128, 256, n_gru),
+                nn.Linear(512, 360),
+                nn.Dropout(0.25),
+                nn.Sigmoid(),
+            )
+        else:
+            self.fc = nn.Sequential(
+                nn.Linear(3 * 128, 360),
+                nn.Dropout(0.25),
+                nn.Sigmoid()
+            )
+    def forward(self, mel):
+        mel = mel.transpose(-1, -2).unsqueeze(1)
+        x = self.cnn(self.unet(mel)).transpose(1, 2).flatten(-2)
+        x = self.fc(x)
+        return x
+class MelSpectrogram(torch.nn.Module):
+    def __init__(self, is_half, n_mel_channels, sampling_rate, win_length, hop_length,
+                 n_fft=None, mel_fmin=0, mel_fmax=None, clamp=1e-5):
+        super().__init__()
+        n_fft = win_length if n_fft is None else n_fft
+        self.hann_window = {}
+        mel_basis = mel(
+            sr=sampling_rate,
+            n_fft=n_fft,
+            n_mels=n_mel_channels,
+            fmin=mel_fmin,
+            fmax=mel_fmax,
+            htk=True,
+        )
+        mel_basis = torch.from_numpy(mel_basis).float()
+        self.register_buffer("mel_basis", mel_basis)
+        self.n_fft = win_length if n_fft is None else n_fft
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.sampling_rate = sampling_rate
+        self.n_mel_channels = n_mel_channels
+        self.clamp = clamp
+        self.is_half = is_half
+    def forward(self, audio, keyshift=0, speed=1, center=True):
+        factor = 2 ** (keyshift / 12)
+        n_fft_new = int(np.round(self.n_fft * factor))
+        win_length_new = int(np.round(self.win_length * factor))
+        hop_length_new = int(np.round(self.hop_length * speed))
+        keyshift_key = str(keyshift) + "_" + str(audio.device)
+        if keyshift_key not in self.hann_window:
+            self.hann_window[keyshift_key] = torch.hann_window(win_length_new).to(audio.device)
+        fft = torch.stft(
+            audio,
+            n_fft=n_fft_new,
+            hop_length=hop_length_new,
+            win_length=win_length_new,
+            window=self.hann_window[keyshift_key],
+            center=center,
+            return_complex=True,
+        )
+        magnitude = torch.sqrt(fft.real.pow(2) + fft.imag.pow(2))
+        if keyshift != 0:
+            size = self.n_fft // 2 + 1
+            resize = magnitude.size(1)
+            if resize < size:
+                magnitude = F.pad(magnitude, (0, 0, 0, size - resize))
+            magnitude = magnitude[:, :size, :] * self.win_length / win_length_new
+        mel_output = torch.matmul(self.mel_basis, magnitude)
+        if self.is_half:
+            mel_output = mel_output.half()
+        log_mel_spec = torch.log(torch.clamp(mel_output, min=self.clamp))
+        return log_mel_spec
+class RMVPE:
+    def __init__(self, model_path: str, is_half, device=None):
+        self.is_half = is_half
+        if device is None:
+            device = "cuda:0" if torch.cuda.is_available() else "cpu"
+        self.device = torch.device(device) if isinstance(device, str) else device
+        self.mel_extractor = MelSpectrogram(
+            is_half=is_half,
+            n_mel_channels=128,
+            sampling_rate=16000,
+            win_length=1024,
+            hop_length=160,
+            n_fft=None,
+            mel_fmin=30,
+            mel_fmax=8000
+        ).to(self.device)
+        model = E2E(n_blocks=4, n_gru=1, kernel_size=(2, 2))
+        ckpt = torch.load(model_path, map_location=self.device)
+        model.load_state_dict(ckpt)
+        model.eval()
+        if is_half:
+            model = model.half()
+        else:
+            model = model.float()
+        self.model = model.to(self.device)
+        cents_mapping = 20 * np.arange(360) + 1997.3794084376191
+        self.cents_mapping = np.pad(cents_mapping, (4, 4))  # 368
+    def mel2hidden(self, mel):
+        with torch.no_grad():
+            n_frames = mel.shape[-1]
+            n_pad = 32 * ((n_frames - 1) // 32 + 1) - n_frames
+            if n_pad > 0:
+                mel = F.pad(mel, (0, n_pad), mode="constant")
+            mel = mel.half() if self.is_half else mel.float()
+            hidden = self.model(mel)
+            return hidden[:, :n_frames]
+    def decode(self, hidden, thred=0.03):
+        cents_pred = self.to_local_average_cents(hidden, thred=thred)
+        f0 = 10 * (2 ** (cents_pred / 1200))
+        f0[f0 == 10] = 0
+        return f0
+    def infer_from_audio(self, audio, thred=0.03):
+        if not torch.is_tensor(audio):
+            audio = torch.from_numpy(audio)
+        mel = self.mel_extractor(audio.float().to(self.device).unsqueeze(0), center=True)
+        hidden = self.mel2hidden(mel)
+        hidden = hidden.squeeze(0).cpu().numpy()
+        if self.is_half:
+            hidden = hidden.astype("float32")
+        f0 = self.decode(hidden, thred=thred)
+        return f0
+    def to_local_average_cents(self, salience, thred=0.05):
+        center = np.argmax(salience, axis=1)
+        salience = np.pad(salience, ((0, 0), (4, 4)))
+        center += 4
+        todo_salience = []
+        todo_cents_mapping = []
+        starts = center - 4
+        ends = center + 5
+        for idx in range(salience.shape[0]):
+            todo_salience.append(salience[:, starts[idx]:ends[idx]][idx])
+            todo_cents_mapping.append(self.cents_mapping[starts[idx]:ends[idx]])
+        todo_salience = np.array(todo_salience)
+        todo_cents_mapping = np.array(todo_cents_mapping)
+        product_sum = np.sum(todo_salience * todo_cents_mapping, 1)
+        weight_sum = np.sum(todo_salience, 1)
+        devided = product_sum / weight_sum
+        maxx = np.max(salience, axis=1)
+        devided[maxx <= thred] = 0
+        return devided
+class F0Extractor:
+    """Extract frame-level f0 from singing voice.
+    Wrapper around an RMVPE network that:
+        1) loads the checkpoint once in ``__init__``
+        2) exposes a simple :py:meth:`process` API and optionally saves ``*_f0.npy``.
+    """
+    def __init__(
+        self,
+        model_path: str,
+        device: str = "cpu",
+        *,
+        is_half: bool = False,
+        input_sr: int = 16000,
+        target_sr: int = 24000,
+        hop_size: int = 480,
+        max_duration: float = 300,
+        thred: float = 0.03,
+        verbose: bool = True,
+    ):
+        """Initialize the f0 extractor.
+        Args:
+            model_path: Path to RMVPE checkpoint.
+            device: Torch device string, e.g. ``"cuda:0"`` / ``"cpu"``.
+            is_half: Whether to run the model in fp16.
+            input_sr: Input resample rate used by RMVPE frontend.
+            target_sr: Target sample rate for the output f0 grid.
+            hop_size: Target hop size for the output f0 grid.
+            max_duration: Max duration (seconds) for interpolation grid.
+            thred: Voicing threshold used when decoding salience.
+            verbose: Whether to print verbose logs.
+        """
+        self.model_path = model_path
+        self.input_sr = input_sr
+        self.target_sr = target_sr
+        self.hop_size = hop_size
+        self.max_duration = max_duration
+        self.thred = thred
+        self.verbose = verbose
+        self.model = RMVPE(model_path, is_half=is_half, device=device)
+        if self.verbose:
+            print(
+                "[f0 extraction] init success:",
+                f"device={device}",
+                f"model_path={model_path}",
+                f"is_half={is_half}",
+                f"input_sr={input_sr}",
+                f"target_sr={target_sr}",
+                f"hop_size={hop_size}",
+                f"thred={thred}",
+            )
+    @staticmethod
+    def interpolate_f0(
+        f0_16k: np.ndarray,
+        original_length: int,
+        original_sr: int,
+        *,
+        target_sr: int = 48000,
+        hop_size: int = 256,
+        max_duration: float = 20.0,
+    ) -> np.ndarray:
+        """Interpolate f0 from RMVPE's 16k hop grid to target mel hop grid."""
+        mel_target_sr = target_sr
+        mel_hop_size = hop_size
+        mel_max_duration = max_duration
+        batch_max_length = int(mel_max_duration * mel_target_sr / mel_hop_size)
+        duration_in_seconds = original_length / original_sr
+        effective_target_length = int(duration_in_seconds * mel_target_sr)
+        original_frames = math.ceil(effective_target_length / mel_hop_size)
+        target_frames = min(original_frames, batch_max_length)
+        rmvpe_hop = 160
+        t_16k = np.arange(len(f0_16k)) * (rmvpe_hop / 16000.0)
+        t_target = np.arange(target_frames) * (mel_hop_size / float(mel_target_sr))
+        if len(f0_16k) > 0:
+            f_interp = interp1d(
+                t_16k,
+                f0_16k,
+                kind="linear",
+                bounds_error=False,
+                fill_value=0.0,
+                assume_sorted=True,
+            )
+            f0 = f_interp(t_target)
+        else:
+            f0 = np.zeros(target_frames)
+        if len(f0) != target_frames:
+            f0 = (
+                f0[:target_frames]
+                if len(f0) > target_frames
+                else np.pad(f0, (0, target_frames - len(f0)), "constant")
+            )
+        return f0
+    def process(self, audio_path: str, *, f0_path: str | None = None, verbose: Optional[bool] = None) -> np.ndarray:
+        """Run f0 extraction for a single wav.
+        Args:
+            audio_path: Path to the input wav file.
+            f0_path: if is not None, save the f0 data to this path.
+            verbose: Override instance-level verbose flag for this call.
+        Returns:
+            np.ndarray: shape ``[T]``, f0 in Hz (0 for unvoiced).
+        """
+        verbose = self.verbose if verbose is None else verbose
+        if verbose:
+            print(f"[f0 extraction] process: start: {audio_path}")
+            t0 = time.time()
+        audio, _ = librosa.load(audio_path, sr=self.input_sr)
+        f0_16k = self.model.infer_from_audio(audio, thred=self.thred)
+        f0 = self.interpolate_f0(
+            f0_16k,
+            original_length=audio.shape[-1],
+            original_sr=self.input_sr,
+            target_sr=self.target_sr,
+            hop_size=self.hop_size,
+            max_duration=self.max_duration,
+        )
+        if verbose:
+            dt = time.time() - t0
+            voiced_ratio = float(np.mean(f0 > 0)) if len(f0) else 0.0
+            print(
+                "[f0 extraction] process: done:",
+                f"frames={len(f0)}",
+                f"voiced_ratio={voiced_ratio:.3f}",
+                f"time={dt:.3f}s",
+            )
+        if f0_path is not None:
+            np.save(f0_path, f0)
+        return f0
+if __name__ == "__main__":
+    model_path = (
+        "pretrained_models/rmvpe/rmvpe.pt"
+    )
+    audio_path = "./outputs/transcription/test.wav"
+    pe = F0Extractor(
+        model_path,
+        device="cuda",
+    )
+    f0 = pe.process(audio_path)

preprocess/tools/g2p.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import re
+import ToJyutping
+from g2pM import G2pM
+from g2p_en import G2p as G2pE
+_EN_WORD_RE = re.compile(r"^[A-Za-z]+(?:'[A-Za-z]+)*$")
+_ZH_WORD_RE = re.compile(r"[\u4e00-\u9fff]")
+EN_FLAG = "en_"
+YUE_FLAG = "yue_"
+ZH_FLAG = "zh_"
+g2p_zh = G2pM()
+g2p_en = G2pE()
+def is_chinese_char(word: str) -> bool:
+    if len(word) != 1:
+        return False
+    return bool(_ZH_WORD_RE.fullmatch(word))
+def is_english_word(word: str) -> bool:
+    if not word:
+        return False
+    return bool(_EN_WORD_RE.fullmatch(word))
+def g2p_cantonese(sent):
+    return ToJyutping.get_jyutping_list(sent)       # with tone
+def g2p_mandarin(sent):
+    return g2p_zh(sent, tone=True, char_split=False)
+def g2p_english(word):
+    return g2p_en(word)
+def g2p_transform(words, lang):
+    zh_words = []
+    transformed_words = [0] * len(words)
+    for idx, w in enumerate(words):
+        if w == "<SP>":
+            transformed_words[idx] = w
+            continue
+        w = w.replace("?", "").replace(".", "").replace("!", "").replace(",", "")
+        if is_chinese_char(w):
+            zh_words.append([idx, w])
+        else:
+            if is_english_word(w):
+                w = EN_FLAG + "-".join(g2p_english(w.lower()))
+            else:
+                w = "<SP>"
+        transformed_words[idx] = w
+    sent = "".join([k[1] for k in zh_words])
+    # zh (zh and yue) transformer to g2p
+    if len(sent) > 0:
+        if lang == "Cantonese":
+            g2pm_rst = g2p_cantonese(sent)       # with tone
+            g2pm_rst = [YUE_FLAG + k[1] for k in g2pm_rst]
+        else:
+            g2pm_rst = g2p_mandarin(sent)
+            g2pm_rst = [ZH_FLAG + k for k in g2pm_rst]
+        for p, w in zip([k[0] for k in zh_words], g2pm_rst):
+            transformed_words[p] = w
+    return transformed_words

preprocess/tools/lyric_transcription.py ADDED Viewed

	@@ -0,0 +1,279 @@

+# https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary
+# https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
+import os
+import re
+import time
+from typing import Any, Dict, List, Tuple
+import librosa
+import numpy as np
+from funasr import AutoModel
+def _build_words_with_gaps(raw_words, raw_timestamps, wav_fn: str):
+    words, word_durs = [], []
+    prev = 0.0
+    for w, t in zip(raw_words, raw_timestamps):
+        s, e = float(t[0]), float(t[1])
+        if s > prev:
+            words.append("<SP>")
+            word_durs.append(s - prev)
+        words.append(w)
+        word_durs.append(e - s)
+        prev = e
+    wav_len = librosa.get_duration(filename=wav_fn)
+    if wav_len > prev:
+        if len(words) == 0:
+            words.append("<SP>")
+            word_durs.append(wav_len)
+            return words, word_durs
+        if words[-1] != "<SP>":
+            words.append("<SP>")
+            word_durs.append(wav_len - prev)
+        else:
+            word_durs[-1] += wav_len - prev
+    return words, word_durs
+def _word_dur_post_process(words, word_durs, f0):
+    """Post-process word durations using f0 to better place silences.
+    """
+    # f0 time grid parameters
+    sr = 24000  # f0 sample rate
+    hop_length = 480  # f0 hop length
+    # Convert word durations (seconds) to frame boundaries on the f0 grid.
+    boundaries = np.cumsum([
+        0,
+        *[
+            int(dur * sr / hop_length)
+            for dur in word_durs
+        ],
+    ]).tolist()
+    sil_tolerance = 5   # tolerance frames for silence detection
+    ext_tolerance = 5   # tolerance frames for vocal extension
+    new_words: list[str] = []
+    new_word_durs: list[float] = []
+    if words:
+        new_words.append(words[0])
+        new_word_durs.append(word_durs[0])
+    for i in range(1, len(words)):
+        word = words[i]
+        if word == "<SP>":
+            start_frame = boundaries[i]
+            end_frame = boundaries[i + 1]
+            num_frames = end_frame - start_frame
+            frame_idx = start_frame
+            # Find first region with at least 5 consecutive "unvoiced" frames.
+            unvoiced_count = 0
+            while frame_idx < end_frame:
+                if f0[frame_idx] <= 1:  # unvoiced
+                    unvoiced_count += 1
+                    if unvoiced_count >= sil_tolerance:
+                        frame_idx -= sil_tolerance - 1  # back to the last voiced frame
+                        break
+                else:
+                    unvoiced_count = 0
+                frame_idx += 1
+            voice_frames = frame_idx - start_frame
+            if voice_frames >= int(num_frames * 0.9):  # over 90% voiced
+                # Treat the whole "<SP>" as silence and merge into previous word.
+                new_word_durs[-1] += word_durs[i]
+            elif voice_frames >= ext_tolerance:  # over 5 frames voiced
+                # Split the "<SP>" into two parts: leading silence and tail kept as "<SP>".
+                dur = voice_frames * hop_length / sr
+                new_word_durs[-1] += dur
+                new_words.append("<SP>")
+                new_word_durs.append(word_durs[i] - dur)
+            else:
+                # Too short to adjust, keep as-is.
+                new_words.append(word)
+                new_word_durs.append(word_durs[i])
+        else:
+            new_words.append(word)
+            new_word_durs.append(word_durs[i])
+    return new_words, new_word_durs
+class _ASRZhModel:
+    """Mandarin/Cantonese ASR wrapper."""
+    def __init__(self, model_path: str, device: str):
+        self.model = AutoModel(
+            model=model_path,
+            disable_update=True,
+            device=device,
+        )
+    def process(self, wav_fn):
+        out = self.model.generate(wav_fn, output_timestamp=True)[0]
+        raw_words = out["text"].replace("@", "").split(" ")
+        raw_timestamps = [[t[0] / 1000, t[1] / 1000] for t in out["timestamp"]]
+        words, word_durs = _build_words_with_gaps(raw_words, raw_timestamps, wav_fn)
+        if os.path.exists(wav_fn.replace(".wav", "_f0.npy")):
+            words, word_durs = _word_dur_post_process(
+                words, word_durs, np.load(wav_fn.replace(".wav", "_f0.npy"))
+            )
+        return words, word_durs
+class _ASREnModel:
+    """English ASR wrapper for NeMo Parakeet-TDT."""
+    def __init__(self, model_path: str, device: str):
+        try:
+            import nemo.collections.asr as nemo_asr  # type: ignore
+        except Exception as e:  # pragma: no cover
+            raise ImportError(
+                "NeMo (nemo_toolkit) is required for ASR English but is not available in this Python env. "
+                "Install it in the active environment, then retry."
+            ) from e
+        self.model = nemo_asr.models.ASRModel.restore_from(
+            restore_path=model_path,
+            map_location=device,
+        )
+        self.model.eval()
+    @staticmethod
+    def _clean_word(word: str) -> str:
+        return re.sub(r"[\?\.,:]", "", word).strip()
+    @staticmethod
+    def _extract_word_segments(output: Any) -> List[Dict[str, Any]]:
+        ts = getattr(output, "timestamp", None)
+        if not ts or not isinstance(ts, dict):
+            return []
+        word_ts = ts.get("word")
+        return word_ts if isinstance(word_ts, list) else []
+    def process(self, wav_fn: str) -> Tuple[List[str], List[float]]:
+        outputs = self.model.transcribe(
+            [wav_fn],
+            timestamps=True,
+            batch_size=1,
+            num_workers=0,
+        )
+        output = outputs[0] if outputs else None
+        raw_words: List[str] = []
+        raw_timestamps: List[List[float]] = []
+        if output is not None:
+            for w in self._extract_word_segments(output):
+                s, e = float(w.get("start", 0.0)), float(w.get("end", 0.0))
+                word = self._clean_word(str(w.get("word", "")))
+                if word:
+                    raw_words.append(word)
+                    raw_timestamps.append([s, e])
+        words, durs = _build_words_with_gaps(raw_words, raw_timestamps, wav_fn)
+        if os.path.exists(wav_fn.replace(".wav", "_f0.npy")):
+            words, durs = _word_dur_post_process(
+                words, durs, np.load(wav_fn.replace(".wav", "_f0.npy"))
+            )
+        return words, durs
+class LyricTranscriber:
+    """Transcribe lyrics from singing voice segment
+    """
+    def __init__(
+        self,
+        zh_model_path: str,
+        en_model_path: str,
+        device: str = "cuda",
+        *,
+        verbose: bool = True,
+    ):
+        """Initialize lyric transcriber.
+        Args:
+            zh_model_path (str): Path to the Chinese model file.
+            en_model_path (str): Path to the English model file.
+            device (str): Device to use for tensor operations.
+            verbose (bool): Whether to print verbose logs.
+        """
+        self.verbose = verbose
+        self.device = device
+        self.zh_model_path = zh_model_path
+        self.en_model_path = en_model_path
+        if self.verbose:
+            print(
+                "[lyric transcription] init: start:",
+                f"device={device}",
+                f"model_path={zh_model_path}",
+            )
+        # Always initialize Chinese ASR.
+        self.zh_model = _ASRZhModel(device=device, model_path=zh_model_path)
+        # English ASR will be lazily initialized on first English request to avoid long waiting cost when importing NeMo
+        self.en_model = None
+        if self.verbose:
+            print("[lyric transcription] init: success")
+    def process(self, wav_fn, language: str | None = "Mandarin", *, verbose: bool | None = None):
+        """ Lyric transcriber process
+        Args:
+            wav_fn (str): Path to the audio file.
+            language (str | None): Language of the audio. Defaults to "Mandarin". Supports "Mandarin", "Cantonese" and "English".
+            verbose (bool | None): Whether to print verbose logs. Defaults to None.
+        """
+        v = self.verbose if verbose is None else verbose
+        if language not in {"Mandarin", "Cantonese", "English"}:
+            raise ValueError(f"Unsupported language: {language}, should be one of ['Mandarin', 'Cantonese', 'English']")
+        if v:
+            print(f"[lyric transcription] process: start: wav_fn={wav_fn} language={language}")
+            t0 = time.time()
+        lang = (language or "auto").lower()
+        if lang in {"english"}:
+            if self.en_model is None:
+                # Lazy-load NeMo model only when English is actually used.
+                if v:
+                    print("[lyric transcription] init English ASR, please make sure NeMo is installed")
+                self.en_model = _ASREnModel(model_path=self.en_model_path, device=self.device)
+            out = self.en_model.process(wav_fn)
+        else:
+            out = self.zh_model.process(wav_fn)
+        if v:
+            words, durs = out
+            n_words = len(words) if isinstance(words, list) else 0
+            dur_sum = float(sum(durs)) if isinstance(durs, list) else 0.0
+            dt = time.time() - t0
+            print(
+                "[lyric transcription] process: done:",
+                f"n_words={n_words}",
+                f"dur_sum={dur_sum:.3f}s",
+                f"time={dt:.3f}s",
+            )
+        return out
+if __name__ == "__main__":
+    m = LyricTranscriber(
+        zh_model_path="pretrained_models/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
+        en_model_path="pretrained_models/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2.nemo",
+        device="cuda"
+    )
+    print(m.process("example/test/asr_zh.wav", language="Mandarin"))
+    print(m.process("example/test/asr_en.wav", language="English"))

preprocess/tools/midi_parser.py ADDED Viewed

	@@ -0,0 +1,669 @@

+"""
+SoulX-Singer MIDI <-> metadata converter.
+Converts between SoulX-Singer-style metadata JSON (with note_text, note_dur,
+note_pitch, note_type per segment) and standard MIDI files. Uses an internal
+Note dataclass (start_s, note_dur, note_text, note_pitch, note_type) as the
+intermediate representation.
+"""
+import os
+import json
+import shutil
+from dataclasses import dataclass
+from typing import Any, List, Tuple, Union
+import librosa
+import mido
+from soundfile import write
+from .f0_extraction import F0Extractor
+from .g2p import g2p_transform
+# Audio and segmenting constants (used by _edit_data_to_meta)
+SAMPLE_RATE = 44100
+DEFAULT_LANGUAGE = "Mandarin"
+MAX_GAP_SEC = 5.0  # gap (sec) above which we start a new segment
+MAX_SEGMENT_DUR_SUM_SEC = 60.0  # max cumulative note duration per segment (sec)
+MIN_GAP_THRESHOLD_SEC = 0.001  # ignore gaps smaller than this
+LONG_SILENCE_THRESHOLD_SEC = 0.05  # treat as separate <SP> if gap larger
+MAX_LEADING_SP_DUR_SEC = 2.0  # cap leading silence in a segment to this (sec)
+DEFAULT_RMVPE_MODEL_PATH = "pretrained_models/SoulX-Singer-Preprocess/rmvpe/rmvpe.pt"
+@dataclass
+class Note:
+    """Single note: text, duration (seconds), pitch (MIDI), type. start_s is absolute start time in seconds (for ordering / MIDI)."""
+    start_s: float
+    note_dur: float
+    note_text: str
+    note_pitch: int
+    note_type: int
+    @property
+    def end_s(self) -> float:
+        return self.start_s + self.note_dur
+def remove_duplicate_segments(meta_data: List[dict]) -> None:
+    """Merge consecutive identical notes (same text, pitch, type) within each segment. Mutates meta_data in place."""
+    for idx, segment in enumerate(meta_data):
+        texts = segment["note_text"]
+        durs = segment["note_dur"]
+        pitches = segment["note_pitch"]
+        types = segment["note_type"]
+        new_texts = []
+        new_durs = []
+        new_pitches = []
+        new_types = []
+        for i in range(len(texts)):
+            if i == 0:
+                new_texts.append(texts[i])
+                new_durs.append(durs[i])
+                new_pitches.append(pitches[i])
+                new_types.append(types[i])
+                continue
+            t, d, p, ty = texts[i], durs[i], pitches[i], types[i]
+            if t == "<SP>" and texts[i - 1] == "<SP>":
+                new_durs[-1] += d
+                continue
+            if t == texts[i - 1] and p == pitches[i - 1] and ty == types[i - 1]:
+                new_durs[-1] += d
+            else:
+                new_texts.append(t)
+                new_durs.append(d)
+                new_pitches.append(p)
+                new_types.append(ty)
+        meta_data[idx]["note_text"] = new_texts
+        meta_data[idx]["note_dur"] = new_durs
+        meta_data[idx]["note_pitch"] = new_pitches
+        meta_data[idx]["note_type"] = new_types
+def meta2notes(meta_path: str) -> List[Note]:
+    """Parse SoulX-Singer metadata JSON into a flat list of Note (absolute start_s)."""
+    with open(meta_path, "r", encoding="utf-8") as f:
+        segments = json.load(f)
+    if not isinstance(segments, list):
+        raise ValueError(f"Metadata must be a list of segments, got {type(segments).__name__}")
+    if not segments:
+        raise ValueError("Metadata has no segments.")
+    notes: List[Note] = []
+    for seg in segments:
+        offset_s = seg["time"][0] / 1000
+        words = [str(x).replace("<AP>", "<SP>") for i, x in enumerate(seg["text"].split())]
+        word_durs = [float(x) for x in seg["duration"].split()]
+        pitches = [int(x) for x in seg["note_pitch"].split()]
+        types = [int(x) if words[i] != "<SP>" else 1 for i, x in enumerate(seg["note_type"].split())]
+        if len(words) != len(word_durs) or len(word_durs) != len(pitches) or len(pitches) != len(types):
+            raise ValueError(
+                f"Length mismatch in segment {seg.get('item_name', '?')}: "
+                "note_text, note_dur, note_pitch, note_type must have same length"
+            )
+        current_s = offset_s
+        for text, dur, pitch, type_ in zip(words, word_durs, pitches, types):
+            notes.append(
+                Note(
+                    start_s=current_s,
+                    note_dur=float(dur),
+                    note_text=str(text),
+                    note_pitch=int(pitch),
+                    note_type=int(type_),
+                )
+            )
+            current_s += float(dur)
+    return notes
+def _append_segment_to_meta(
+    meta_path_str: str,
+    cut_wavs_output_dir: str,
+    vocal_file: str,
+    audio_data: Any,
+    meta_data: List[dict],
+    note_start: List[float],
+    note_end: List[float],
+    note_text: List[Any],
+    note_pitch: List[Any],
+    note_type: List[Any],
+    note_dur: List[float],
+    end_time_ms_override: float | None = None,
+) -> None:
+    """Write one segment wav and append one segment dict to meta_data. Caller clears note_* lists after."""
+    base_name = os.path.splitext(os.path.basename(meta_path_str))[0]
+    item_name = f"{base_name}_{len(meta_data)}"
+    wav_fn = os.path.join(cut_wavs_output_dir, f"{item_name}.wav")
+    start_ms = int(note_start[0] * 1000)
+    end_ms = (
+        int(end_time_ms_override)
+        if end_time_ms_override is not None
+        else int(note_end[-1] * 1000)
+    )
+    start_sample = int(note_start[0] * SAMPLE_RATE)
+    end_sample = int(note_end[-1] * SAMPLE_RATE)
+    write(wav_fn, audio_data[start_sample:end_sample], SAMPLE_RATE)
+    meta_data.append({
+        "item_name": item_name,
+        "wav_fn": wav_fn,
+        "origin_wav_fn": vocal_file,
+        "start_time_ms": start_ms,
+        "end_time_ms": end_ms,
+        "language": DEFAULT_LANGUAGE,
+        "note_text": list(note_text),
+        "note_pitch": list(note_pitch),
+        "note_type": list(note_type),
+        "note_dur": list(note_dur),
+    })
+def convert_meta(meta_data: List[dict], rmvpe_model_path, device="cuda"):
+    pitch_extractor = F0Extractor(rmvpe_model_path, device=device, verbose=False)
+    converted_data = []
+    for item in meta_data:
+        wav_fn = item.get("wav_fn")
+        if not wav_fn or not os.path.isfile(wav_fn):
+            raise FileNotFoundError(f"Segment wav file not found: {wav_fn}")
+        f0 = pitch_extractor.process(wav_fn)
+        converted_item = {
+            "index": item.get("item_name"),
+            "language": item.get("language"),
+            "time": [item.get("start_time_ms", 0), item.get("end_time_ms", sum(item["note_dur"]) * 1000)],
+            "duration": " ".join(str(round(x, 2)) for x in item.get("note_dur", [])),
+            "text": " ".join(item.get("note_text", [])),
+            "phoneme": " ".join(g2p_transform(item.get("note_text", []), DEFAULT_LANGUAGE)),
+            "note_pitch": " ".join(str(x) for x in item.get("note_pitch", [])),
+            "note_type": " ".join(str(x) for x in item.get("note_type", [])),
+            "f0": " ".join(str(round(float(x), 1)) for x in f0),
+        }
+        converted_data.append(converted_item)
+    return converted_data
+def _edit_data_to_meta(
+    meta_path_str: str,
+    edit_data: List[dict],
+    vocal_file: str,
+    rmvpe_model_path: str | None = None,
+    device: str = "cuda",
+) -> None:
+    """Write SoulX-Singer metadata JSON from edit_data (list of {start, end, note_text, note_pitch, note_type})."""
+    # Use a fixed temporary directory for cut wavs
+    cut_wavs_output_dir = os.path.join(os.path.dirname(vocal_file), "cut_wavs_tmp")
+    os.makedirs(cut_wavs_output_dir, exist_ok=True)
+    note_text: List[Any] = []
+    note_pitch: List[Any] = []
+    note_type: List[Any] = []
+    note_dur: List[float] = []
+    note_start: List[float] = []
+    note_end: List[float] = []
+    prev_end = 0.0
+    meta_data: List[dict] = []
+    audio_data, _ = librosa.load(vocal_file, sr=SAMPLE_RATE, mono=True)
+    dur_sum = 0.0
+    for entry in edit_data:
+        start = float(entry["start"])
+        end = float(entry["end"])
+        text = entry["note_text"]
+        pitch = entry["note_pitch"]
+        type_ = entry["note_type"]
+        if text == "" or pitch == "" or type_ == "":
+            note_text.append("<SP>")
+            note_pitch.append(0)
+            note_type.append(1)
+            note_dur.append(end - start)
+            note_start.append(start)
+            note_end.append(end)
+            prev_end = end
+            dur_sum += end - start
+            continue
+        if (
+            len(note_text) > 0
+            and note_text[-1] == "<SP>"
+            and note_dur[-1] > MAX_LEADING_SP_DUR_SEC
+        ):
+            cut_time = note_dur[-1] - MAX_LEADING_SP_DUR_SEC
+            note_dur[-1] = MAX_LEADING_SP_DUR_SEC
+            end_ms_override = note_end[-1] * 1000 - cut_time * 1000
+            _append_segment_to_meta(
+                meta_path_str,
+                cut_wavs_output_dir,
+                vocal_file,
+                audio_data,
+                meta_data,
+                note_start,
+                note_end,
+                note_text,
+                note_pitch,
+                note_type,
+                note_dur,
+                end_time_ms_override=end_ms_override,
+            )
+            note_text = []
+            note_pitch = []
+            note_type = []
+            note_dur = []
+            note_start = []
+            note_end = []
+            prev_end = start
+            dur_sum = 0.0
+        gap_from_prev = start - prev_end
+        gap_from_last_note = (start - note_end[-1]) if note_end else 0.0
+        if (
+            gap_from_prev >= MAX_GAP_SEC
+            or gap_from_last_note >= MAX_GAP_SEC
+            or dur_sum >= MAX_SEGMENT_DUR_SUM_SEC
+        ):
+            if len(note_text) > 0:
+                _append_segment_to_meta(
+                    meta_path_str,
+                    cut_wavs_output_dir,
+                    vocal_file,
+                    audio_data,
+                    meta_data,
+                    note_start,
+                    note_end,
+                    note_text,
+                    note_pitch,
+                    note_type,
+                    note_dur,
+                )
+                note_text = []
+                note_pitch = []
+                note_type = []
+                note_dur = []
+                note_start = []
+                note_end = []
+                prev_end = start
+                dur_sum = 0.0
+        if start - prev_end > MIN_GAP_THRESHOLD_SEC:
+            if start - prev_end > LONG_SILENCE_THRESHOLD_SEC or len(note_text) == 0:
+                note_text.append("<SP>")
+                note_pitch.append(0)
+                note_type.append(1)
+                note_dur.append(start - prev_end)
+                note_start.append(prev_end)
+                note_end.append(start)
+            else:
+                if len(note_dur) > 0:
+                    note_dur[-1] += start - prev_end
+                    note_end[-1] = start
+        prev_end = end
+        note_text.append(text)
+        note_pitch.append(int(pitch))
+        note_type.append(int(type_))
+        note_dur.append(end - start)
+        note_start.append(start)
+        note_end.append(end)
+        dur_sum += end - start
+    if len(note_text) > 0:
+        _append_segment_to_meta(
+            meta_path_str,
+            cut_wavs_output_dir,
+            vocal_file,
+            audio_data,
+            meta_data,
+            note_start,
+            note_end,
+            note_text,
+            note_pitch,
+            note_type,
+            note_dur,
+        )
+    remove_duplicate_segments(meta_data)
+    _rmvpe_path = rmvpe_model_path or DEFAULT_RMVPE_MODEL_PATH
+    converted_data = convert_meta(meta_data, _rmvpe_path, device)
+    with open(meta_path_str, "w", encoding="utf-8") as f:
+        json.dump(converted_data, f, ensure_ascii=False, indent=2)
+    # Clean up temporary cut wavs directory
+    try:
+        shutil.rmtree(cut_wavs_output_dir, ignore_errors=True)
+    except Exception:
+        pass
+def notes2meta(
+    notes: List[Note],
+    meta_path: str,
+    vocal_file: str,
+    rmvpe_model_path: str | None = None,
+    device: str = "cuda",
+) -> None:
+    """Write SoulX-Singer metadata JSON from a list of Note (segmenting + wav cuts)."""
+    edit_data = [
+        {
+            "start": n.start_s,
+            "end": n.end_s,
+            "note_text": n.note_text,
+            "note_pitch": str(n.note_pitch),
+            "note_type": str(n.note_type),
+        }
+        for n in notes
+    ]
+    _edit_data_to_meta(
+        str(meta_path),
+        edit_data,
+        vocal_file,
+        rmvpe_model_path=rmvpe_model_path,
+        device=device,
+    )
+@dataclass(frozen=True)
+class MidiDefaults:
+    ticks_per_beat: int = 500
+    tempo: int = 500000  # microseconds per beat (120 BPM)
+    time_signature: Tuple[int, int] = (4, 4)
+    velocity: int = 64
+def _seconds_to_ticks(seconds: float, ticks_per_beat: int, tempo: int) -> int:
+    return int(round(seconds * ticks_per_beat * 1_000_000 / tempo))
+def notes2midi(
+    notes: List[Note],
+    midi_path: str,
+    defaults: MidiDefaults | None = None,
+) -> None:
+    """Write MIDI file from a list of Note."""
+    defaults = defaults or MidiDefaults()
+    if not notes:
+        raise ValueError("Empty note list.")
+    events: List[Tuple[int, int, Union[mido.Message, mido.MetaMessage]]] = []
+    for n in notes:
+        start_s = n.start_s
+        end_s = n.end_s
+        if end_s <= start_s:
+            continue
+        start_ticks = _seconds_to_ticks(
+            start_s, defaults.ticks_per_beat, defaults.tempo
+        )
+        end_ticks = _seconds_to_ticks(
+            end_s, defaults.ticks_per_beat, defaults.tempo
+        )
+        if end_ticks <= start_ticks:
+            end_ticks = start_ticks + 1
+        lyric = n.note_text
+        try:
+            lyric = lyric.encode("utf-8").decode("latin1")
+        except (UnicodeEncodeError, UnicodeDecodeError):
+            pass
+        if n.note_type == 3:
+            lyric = "-"
+        events.append(
+            (start_ticks, 1, mido.MetaMessage("lyrics", text=lyric, time=0))
+        )
+        events.append(
+            (
+                start_ticks,
+                2,
+                mido.Message(
+                    "note_on",
+                    note=n.note_pitch,
+                    velocity=defaults.velocity,
+                    time=0,
+                ),
+            )
+        )
+        events.append(
+            (
+                end_ticks,
+                0,
+                mido.Message("note_off", note=n.note_pitch, velocity=0, time=0),
+            )
+        )
+    events.sort(key=lambda x: (x[0], x[1]))
+    mid = mido.MidiFile(ticks_per_beat=defaults.ticks_per_beat)
+    track = mido.MidiTrack()
+    mid.tracks.append(track)
+    track.append(mido.MetaMessage("set_tempo", tempo=defaults.tempo, time=0))
+    track.append(
+        mido.MetaMessage(
+            "time_signature",
+            numerator=defaults.time_signature[0],
+            denominator=defaults.time_signature[1],
+            time=0,
+        )
+    )
+    last_tick = 0
+    for tick, _, msg in events:
+        msg.time = max(0, tick - last_tick)
+        track.append(msg)
+        last_tick = tick
+    track.append(mido.MetaMessage("end_of_track", time=0))
+    mid.save(midi_path)
+def midi2notes(midi_path: str) -> List[Note]:
+    """Parse MIDI file into a list of Note. Merges all tracks; tempo from last set_tempo event."""
+    mid = mido.MidiFile(midi_path)
+    ticks_per_beat = mid.ticks_per_beat
+    tempo = 500000
+    raw_notes: List[dict] = []
+    lyrics: List[Tuple[int, str]] = []
+    for track in mid.tracks:
+        abs_ticks = 0
+        active = {}
+        for msg in track:
+            abs_ticks += msg.time
+            if msg.type == "set_tempo":
+                tempo = msg.tempo
+            elif msg.type == "lyrics":
+                text = msg.text
+                try:
+                    text = text.encode("latin1").decode("utf-8")
+                except Exception:
+                    pass
+                lyrics.append((abs_ticks, text))
+            elif msg.type == "note_on":
+                key = (msg.channel, msg.note)
+                if msg.velocity > 0:
+                    active[key] = (abs_ticks, msg.velocity)
+                else:
+                    if key in active:
+                        start_ticks, vel = active.pop(key)
+                        raw_notes.append(
+                            {
+                                "midi": msg.note,
+                                "start_ticks": start_ticks,
+                                "duration_ticks": abs_ticks - start_ticks,
+                                "velocity": vel,
+                                "lyric": "",
+                            }
+                        )
+            elif msg.type == "note_off":
+                key = (msg.channel, msg.note)
+                if key in active:
+                    start_ticks, vel = active.pop(key)
+                    raw_notes.append(
+                        {
+                            "midi": msg.note,
+                            "start_ticks": start_ticks,
+                            "duration_ticks": abs_ticks - start_ticks,
+                            "velocity": vel,
+                            "lyric": "",
+                        }
+                    )
+    if not raw_notes:
+        raise ValueError("No notes found in MIDI file")
+    for n in raw_notes:
+        n["end_ticks"] = n["start_ticks"] + n["duration_ticks"]
+    raw_notes.sort(key=lambda n: n["start_ticks"])
+    lyrics.sort(key=lambda x: x[0])
+    trimmed = []
+    for note in raw_notes:
+        while trimmed:
+            prev = trimmed[-1]
+            if note["start_ticks"] < prev["end_ticks"]:
+                prev["end_ticks"] = note["start_ticks"]
+                prev["duration_ticks"] = prev["end_ticks"] - prev["start_ticks"]
+                if prev["duration_ticks"] <= 0:
+                    trimmed.pop()
+                    continue
+            break
+        trimmed.append(note)
+    raw_notes = trimmed
+    tolerance = ticks_per_beat // 100
+    lyric_idx = 0
+    for note in raw_notes:
+        while lyric_idx < len(lyrics) and lyrics[lyric_idx][0] < note["start_ticks"] - tolerance:
+            lyric_idx += 1
+        if lyric_idx < len(lyrics):
+            lyric_ticks, lyric_text = lyrics[lyric_idx]
+            if abs(lyric_ticks - note["start_ticks"]) <= tolerance:
+                note["lyric"] = lyric_text
+                lyric_idx += 1
+    def ticks_to_seconds(ticks: int) -> float:
+        return (ticks / ticks_per_beat) * (tempo / 1_000_000)
+    result: List[Note] = []
+    prev_end_s = 0.0
+    for idx, n in enumerate(raw_notes):
+        start_s = ticks_to_seconds(n["start_ticks"])
+        end_s = ticks_to_seconds(n["end_ticks"])
+        if prev_end_s > start_s:
+            start_s = prev_end_s
+        dur_s = end_s - start_s
+        if dur_s <= 0:
+            continue
+        lyric = n.get("lyric", "")
+        if not lyric:
+            tp = 2
+            text = "啦"
+        elif lyric == "<SP>":
+            tp = 1
+            text = "<SP>"
+        elif lyric == "-":
+            tp = 3
+            text = raw_notes[idx - 1].get("lyric", "-") if idx > 0 else "-"
+        else:
+            tp = 2
+            text = lyric
+        result.append(
+            Note(
+                start_s=start_s,
+                note_dur=dur_s,
+                note_text=text,
+                note_pitch=n["midi"],
+                note_type=tp,
+            )
+        )
+        prev_end_s = end_s
+    return result
+def meta2midi(meta_path: str, midi_path: str, defaults: MidiDefaults | None = None) -> None:
+    """Convert SoulX-Singer metadata JSON to MIDI file (meta -> List[Note] -> midi)."""
+    notes = meta2notes(meta_path)
+    notes2midi(notes, midi_path, defaults)
+    print(f"Saved MIDI to {midi_path}")
+def midi2meta(
+    midi_path: str,
+    meta_path: str,
+    vocal_file: str,
+    rmvpe_model_path: str | None = None,
+    device: str = "cuda",
+) -> None:
+    """Convert MIDI file to SoulX-Singer metadata JSON (midi -> List[Note] -> meta)."""
+    meta_dir = os.path.dirname(meta_path)
+    if meta_dir:
+        os.makedirs(meta_dir, exist_ok=True)
+    # cut_wavs will be written to a fixed temporary directory inside _edit_data_to_meta
+    notes = midi2notes(midi_path)
+    notes2meta(
+        notes,
+        meta_path,
+        vocal_file,
+        rmvpe_model_path=rmvpe_model_path,
+        device=device,
+    )
+    print(f"Saved Meta to {meta_path}")
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(
+        description="Convert SoulX-Singer metadata JSON <-> MIDI."
+    )
+    parser.add_argument("--meta", type=str, help="Path to metadata JSON")
+    parser.add_argument("--midi", type=str, help="Path to MIDI file")
+    parser.add_argument("--vocal", type=str, help="Path to vocal wav (for midi2meta)")
+    parser.add_argument(
+        "--meta2midi",
+        action="store_true",
+        help="Convert meta -> midi (requires --meta and --midi)",
+    )
+    parser.add_argument(
+        "--midi2meta",
+        action="store_true",
+        help="Convert midi -> meta (requires --midi, --meta, --vocal, --cut_wavs_dir)",
+    )
+    parser.add_argument(
+        "--rmvpe_model_path",
+        type=str,
+        help="Path to RMVPE model",
+        default="pretrained_models/SoulX-Singer-Preprocess/rmvpe/rmvpe.pt",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        help="Device to use for RMVPE",
+        default="cuda",
+    )
+    args = parser.parse_args()
+    if args.meta2midi:
+        if not args.meta or not args.midi:
+            parser.error("--meta2midi requires --meta and --midi")
+        meta2midi(args.meta, args.midi)
+    elif args.midi2meta:
+        if not args.midi or not args.meta or not args.vocal:
+            parser.error(
+                "--midi2meta requires --midi, --meta, --vocal"
+            )
+        midi2meta(
+            args.midi,
+            args.meta,
+            args.vocal,
+            rmvpe_model_path=args.rmvpe_model_path,
+            device=args.device,
+        )
+    else:
+        parser.print_help()

preprocess/tools/note_transcription/__init__.py ADDED Viewed

File without changes

preprocess/tools/note_transcription/model.py ADDED Viewed

	@@ -0,0 +1,522 @@

+# https://github.com/RickyL-2000/ROSVOT
+import math
+import sys
+import traceback
+import json
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional
+import librosa
+import numpy as np
+import torch
+import matplotlib.pyplot as plt
+from .utils.os_utils import safe_path
+from .utils.commons.hparams import set_hparams
+from .utils.commons.ckpt_utils import load_ckpt
+from .utils.commons.dataset_utils import pad_or_cut_xd
+from .utils.audio.mel import MelNet
+from .utils.audio.pitch_utils import (
+    norm_interp_f0,
+    denorm_f0,
+    f0_to_coarse,
+    boundary2Interval,
+    save_midi,
+    midi_to_hz,
+)
+from .utils.rosvot_utils import (
+    get_mel_len,
+    align_word,
+    regulate_real_note_itv,
+    regulate_ill_slur,
+    bd_to_durs,
+)
+from .modules.pe.rmvpe import RMVPE
+from .modules.rosvot.rosvot import MidiExtractor, WordbdExtractor
+@torch.no_grad()
+def infer_sample(
+    item: Dict[str, Any],
+    hparams: Dict[str, Any],
+    models: Dict[str, Any],
+    device: torch.device,
+    *,
+    save_dir: Optional[str] = None,
+    apply_rwbd: Optional[bool] = None,
+    # outputs
+    save_plot: bool = False,
+    no_save_midi: bool = True,
+    no_save_npy: bool = True,
+    verbose: bool = False,
+) -> Dict[str, Any]:
+    if "item_name" not in item or "wav_fn" not in item:
+        raise ValueError('item must contain keys: "item_name" and "wav_fn"')
+    item_name = item["item_name"]
+    wav_src = item["wav_fn"]
+    # Decide RWBD usage
+    if apply_rwbd is None:
+        apply_rwbd_ = ("word_durs" not in item)
+    else:
+        apply_rwbd_ = bool(apply_rwbd)
+    # Models
+    model = models["model"]
+    mel_net = models["mel_net"]
+    pe = models.get("pe")
+    wbd_predictor = models.get("wbd_predictor")
+    if wbd_predictor is None and apply_rwbd_:
+         raise ValueError("apply_rwbd is True but wbd_predictor model is not provided in models")
+    # ---- Prepare Data  ----
+    if isinstance(wav_src, str):
+        wav, _ = librosa.core.load(wav_src, sr=hparams["audio_sample_rate"])
+    else:
+        wav = wav_src
+        if not isinstance(wav, np.ndarray):
+            wav = np.asarray(wav)
+    wav = wav.astype(np.float32)
+    # Calculate timestamps and alignment lengths
+    wav_len_samples = wav.shape[-1]
+    mel_len = get_mel_len(wav_len_samples, hparams["hop_size"])
+    # Word boundary preparation
+    mel2word = None
+    word_durs_filtered = None
+    if not apply_rwbd_:
+        if "word_durs" not in item:
+             raise ValueError('apply_rwbd=False but item has no "word_durs"')
+        wd_raw = list(item["word_durs"])
+        min_word_dur = hparams.get("min_word_dur", 20) / 1000
+        word_durs_filtered = []
+        for i, wd in enumerate(wd_raw):
+            if wd < min_word_dur:
+                if i == 0 and len(wd_raw) > 1:
+                    wd_raw[i + 1] += wd
+                elif len(word_durs_filtered) > 0:
+                    word_durs_filtered[-1] += wd
+            else:
+                word_durs_filtered.append(wd)
+        mel2word, _ = align_word(word_durs_filtered, mel_len, hparams["hop_size"], hparams["audio_sample_rate"])
+        mel2word = np.asarray(mel2word)
+        if mel2word.size > 0 and mel2word[0] == 0:
+             mel2word = mel2word + 1
+        mel2word_len = int(np.sum(mel2word > 0))
+        real_len = min(mel_len, mel2word_len)
+    else:
+        real_len = min(mel_len, hparams["max_frames"])
+    T = math.ceil(min(real_len, hparams["max_frames"]) / hparams["frames_multiple"]) * hparams["frames_multiple"]
+    # ---- Input Tensors & Padding ----
+    target_samples = T * hparams["hop_size"]
+    wav_t = torch.from_numpy(wav).float().to(device).unsqueeze(0) # [1, L]
+    if wav_t.shape[-1] < target_samples:
+        wav_t = pad_or_cut_xd(wav_t, target_samples, 1)
+    # ---- Pitch Extraction ----
+    if pe is not None:
+        f0s, uvs = pe.get_pitch_batch(
+            wav_t,
+            sample_rate=hparams["audio_sample_rate"],
+            hop_size=hparams["hop_size"],
+            lengths=[real_len],
+            fmax=hparams["f0_max"],
+            fmin=hparams["f0_min"],
+        )
+        f0_1d, uv_1d = norm_interp_f0(f0s[0][:T])
+        f0_t = pad_or_cut_xd(torch.FloatTensor(f0_1d).to(device), T, 0).unsqueeze(0)
+        uv_t = pad_or_cut_xd(torch.FloatTensor(uv_1d).to(device), T, 0).long().unsqueeze(0)
+        pitch_coarse = f0_to_coarse(denorm_f0(f0_t, uv_t)).to(device)
+        f0_np = denorm_f0(f0_t, uv_t)[0].detach().cpu().numpy()[:real_len]
+    else:
+        f0_t = uv_t = pitch_coarse = None
+        f0_np = None
+    # ---- Mel Extraction ----
+    mel = mel_net(wav_t) # [1, T_padded, C]
+    mel = pad_or_cut_xd(mel, T, 1)
+    # Construct non-padding mask
+    mel_nonpadding_mask = torch.zeros(1, T, device=device)
+    mel_nonpadding_mask[:, :real_len] = 1.0
+    # Apply mask to mel (zero out padding)
+    mel = (mel.transpose(1, 2) * mel_nonpadding_mask.unsqueeze(1)).transpose(1, 2)
+    # Re-calculate non_padding bool mask
+    mel_nonpadding = mel.abs().sum(-1) > 0
+    # ---- Word Boundary ----
+    word_durs_used = None
+    if apply_rwbd_:
+        mel_input = mel[:, :, : hparams.get("wbd_use_mel_bins", 80)]
+        wbd_outputs = wbd_predictor(
+            mel=mel_input,
+            pitch=pitch_coarse,
+            uv=uv_t,
+            non_padding=mel_nonpadding,
+            train=False,
+        )
+        word_bd = wbd_outputs["word_bd_pred"] # [1, T]
+    else:
+        # Construct word_bd from provided durs
+        mel2word_t = pad_or_cut_xd(torch.LongTensor(mel2word).to(device), T, 0)
+        word_bd = torch.zeros_like(mel2word_t)
+        # Vectorized check
+        word_bd[1:] = (mel2word_t[1:] != mel2word_t[:-1]).long()
+        word_bd[real_len:] = 0
+        word_bd = word_bd.unsqueeze(0) # [1, T]
+        word_durs_used = np.array(word_durs_filtered)
+    # ---- Main Inference ----
+    mel_input = mel[:, :, : hparams.get("use_mel_bins", 80)]
+    outputs = model(
+        mel=mel_input,
+        word_bd=word_bd,
+        pitch=pitch_coarse,
+        uv=uv_t,
+        non_padding=mel_nonpadding,
+        train=False,
+    )
+    note_lengths = outputs["note_lengths"].detach().cpu().numpy()
+    note_bd_pred = outputs["note_bd_pred"][0].detach().cpu().numpy()[:real_len]
+    note_pred = outputs["note_pred"][0].detach().cpu().numpy()[: note_lengths[0]]
+    note_bd_logits = torch.sigmoid(outputs["note_bd_logits"])[0].detach().cpu().numpy()[:real_len]
+    if note_pred.shape == (0,):
+        if verbose:
+            print(f"skip {item_name}: no notes detected")
+        return {
+            "item_name": item_name,
+            "pitches": [],
+            "note_durs": [],
+            "note2words": None,
+        }
+    # ---- Post-Processing & Regulation ----
+    note_itv_pred = boundary2Interval(note_bd_pred)
+    note2words = None
+    if apply_rwbd_:
+        word_bd_np = outputs['word_bd_pred'][0].detach().cpu().numpy()[:real_len]
+        word_durs_derived = np.array(bd_to_durs(word_bd_np)) * hparams['hop_size'] / hparams['audio_sample_rate']
+        word_durs_for_reg = word_durs_derived
+        word_bd_for_reg = word_bd_np
+    else:
+        word_bd_for_reg = word_bd[0].detach().cpu().numpy()[:real_len]
+        word_durs_for_reg = word_durs_used
+    should_regulate = hparams.get("infer_regulate_real_note_itv", True) and (not apply_rwbd_)
+    if should_regulate and (word_durs_for_reg is not None):
+        try:
+            note_itv_pred_secs, note2words = regulate_real_note_itv(
+                note_itv_pred,
+                note_bd_pred,
+                word_bd_for_reg,
+                word_durs_for_reg,
+                hparams["hop_size"],
+                hparams["audio_sample_rate"],
+            )
+            note_pred, note_itv_pred_secs, note2words = regulate_ill_slur(note_pred, note_itv_pred_secs, note2words)
+        except Exception as err:
+            if verbose:
+                _, exc_value, exc_tb = sys.exc_info()
+                tb = traceback.extract_tb(exc_tb)[-1]
+                print(f"postprocess failed: {err}: {exc_value} in {tb[0]}:{tb[1]} '{tb[2]}' in {tb[3]}")
+            # Fallback
+            note_itv_pred_secs = note_itv_pred * hparams["hop_size"] / hparams["audio_sample_rate"]
+            note2words = None
+    else:
+        note_itv_pred_secs = note_itv_pred * hparams["hop_size"] / hparams["audio_sample_rate"]
+    # ---- Output ----
+    note_durs = [float((itv[1] - itv[0])) for itv in note_itv_pred_secs]
+    out = {
+        "item_name": item_name,
+        "pitches": note_pred.tolist(),
+        "note_durs": note_durs,
+        "note2words": note2words.tolist() if note2words is not None else None,
+    }
+    # ---- Saving ----
+    if save_dir is not None:
+        save_dir_path = Path(save_dir)
+        save_dir_path.mkdir(parents=True, exist_ok=True)
+        fn = str(item_name)
+        if not no_save_midi:
+            save_midi(note_pred, note_itv_pred_secs, safe_path(save_dir_path / "midi" / f"{fn}.mid"))
+        if not no_save_npy:
+            np.save(safe_path(save_dir_path / "npy" / f"[note]{fn}.npy"), out, allow_pickle=True)
+        if save_plot:
+            fig = plt.figure()
+            if f0_np is not None:
+                plt.plot(f0_np, color="red", label="f0")
+            midi_pred = np.zeros(note_bd_pred.shape[0], dtype=np.float32)
+            itvs = np.round(note_itv_pred_secs * hparams["audio_sample_rate"] / hparams["hop_size"]).astype(int)
+            for i, itv in enumerate(itvs):
+                midi_pred[itv[0] : itv[1]] = note_pred[i]
+            plt.plot(midi_to_hz(midi_pred), color="blue", label="pred midi")
+            plt.plot(note_bd_logits * 100, color="green", label="note bd logits x100")
+            plt.legend()
+            plt.tight_layout()
+            plt.savefig(safe_path(save_dir_path / "plot" / f"[MIDI]{fn}.png"), format="png")
+            plt.close(fig)
+    return out
+def load_rosvot_models(ckpt, config="", wbd_ckpt="", wbd_config="", device="cuda:0", verbose=False, thr=0.85):
+    """
+    Load models once to reuse across multiple items.
+    """
+    dev = torch.device(device)
+    # 1. Hparams
+    config_path = Path(ckpt).with_name("config.yaml") if config == "" else config
+    pe_ckpt = Path(ckpt).parent.parent / "rmvpe/model.pt"
+    hparams = set_hparams(
+        config=config_path,
+        print_hparams=verbose,
+        hparams_str=f"note_bd_threshold={thr}",
+    )
+    # 2. Main Model
+    model = MidiExtractor(hparams)
+    load_ckpt(model, ckpt, verbose=verbose)
+    model.eval().to(dev)
+    # 3. MelNet
+    mel_net = MelNet(hparams)
+    mel_net.to(dev)
+    # 4. Pitch Extractor
+    pe = None
+    if hparams.get("use_pitch_embed", False):
+        pe = RMVPE(pe_ckpt, device=dev)
+    # 5. Word Boundary Predictor (optional but we load if ckpt provided or needed)
+    wbd_predictor = None
+    if wbd_ckpt:
+        wbd_config_path = Path(wbd_ckpt).with_name("config.yaml") if wbd_config == "" else wbd_config
+        wbd_hparams = set_hparams(
+            config=wbd_config_path,
+            print_hparams=False,
+            hparams_str="",
+        )
+        hparams.update({
+            "wbd_use_mel_bins": wbd_hparams["use_mel_bins"],
+            "min_word_dur": wbd_hparams["min_word_dur"],
+        })
+        wbd_predictor = WordbdExtractor(wbd_hparams)
+        load_ckpt(wbd_predictor, wbd_ckpt, verbose=verbose)
+        wbd_predictor.eval().to(dev)
+    models = {
+        "model": model,
+        "mel_net": mel_net,
+        "pe": pe,
+        "wbd_predictor": wbd_predictor
+    }
+    return hparams, models
+class NoteTranscriber:
+    """Note transcription wrapper based on ROSVOT.
+    Loads ROSVOT and optional RWBD models once in ``__init__`` and
+    exposes a :py:meth:`process` API that turns an item dict into
+    aligned note metadata for downstream SVS.
+    """
+    def __init__(
+        self,
+        rosvot_model_path: str,
+        rwbd_model_path: str,
+        *,
+        rosvot_config_path: str = "",
+        rwbd_config_path: str = "",
+        device: str = "cuda:0",
+        thr: float = 0.85,
+        verbose: bool = True,
+    ):
+        """Initialize the note transcriber.
+        Args:
+            ckpt: Path to the main ROSVOT checkpoint.
+            config: Optional config YAML path for ROSVOT.
+            wbd_ckpt: Optional word-boundary checkpoint path.
+            wbd_config: Optional config YAML path for RWBD.
+            device: Torch device string, e.g. ``"cuda:0"`` / ``"cpu"``.
+            thr: Note boundary threshold.
+            verbose: Whether to print verbose logs.
+        """
+        self.verbose = verbose
+        self.device = torch.device(device)
+        self.hparams, self.models = load_rosvot_models(
+            ckpt=rosvot_model_path,
+            config=rosvot_config_path,
+            wbd_ckpt=rwbd_model_path,
+            wbd_config=rwbd_config_path,
+            device=device,
+            verbose=verbose,
+            thr=thr,
+        )
+        if self.verbose:
+            print(
+                "[note transcription] init success:",
+                f"device={self.device}",
+                f"rosvot_model_path={rosvot_model_path}",
+                f"rwbd_model_path={rwbd_model_path if rwbd_model_path else 'None'}",
+                f"thr={thr}",
+            )
+    def process(
+        self,
+        item: Dict[str, Any],
+        *,
+        segment_info: Optional[Dict[str, Any]] = None,
+        save_dir: Optional[str] = None,
+        apply_rwbd: Optional[bool] = None,
+        save_plot: bool = False,
+        no_save_midi: bool = True,
+        no_save_npy: bool = True,
+        verbose: Optional[bool] = None,
+    ) -> Dict[str, Any]:
+        """Run ROSVOT on a single item and post-process outputs.
+        Args:
+            item: Input metadata dict with at least ``item_name`` and ``wav_fn``.
+            segment_info: Optional segment metadata for sliced audio.
+            save_dir: Optional directory for debug artifacts (plots, midis).
+            apply_rwbd: Whether to run RWBD-based word boundary refinement.
+            save_plot: Whether to save diagnostic plots.
+            no_save_midi: If True, skip saving midi.
+            no_save_npy: If True, skip saving numpy intermediates.
+            verbose: Override instance-level verbose flag for this call.
+        Returns:
+            Dict with aligned note information for downstream SVS.
+        """
+        v = self.verbose if verbose is None else verbose
+        if v:
+            item_name = item.get("item_name", "")
+            wav_fn = item.get("wav_fn", "")
+            print(f"[note transcription] process: start: item_name={item_name} wav_fn={wav_fn}")
+            t0 = time.time()
+        rosvot_out = infer_sample(
+            item,
+            self.hparams,
+            self.models,
+            device=self.device,
+            save_dir=save_dir,
+            apply_rwbd=apply_rwbd,
+            save_plot=save_plot,
+            no_save_midi=no_save_midi,
+            no_save_npy=no_save_npy,
+            verbose=v,
+        )
+        out = self.post_process(
+            metadata=item,
+            segment_info=segment_info,
+            rosvot_out=rosvot_out,
+        )
+        if v:
+            dt = time.time() - t0
+            print(
+                "[note transcription] process: done:",
+                f"item_name={out.get('item_name','')}",
+                f"n_notes={len(out.get('note_pitch', []) or [])}",
+                f"time={dt:.3f}s",
+            )
+        return out
+    @staticmethod
+    def _normalize_note2words(note2words: list[int]) -> list[int]:
+        if not note2words:
+            return []
+        normalized = [note2words[0]]
+        for idx in range(1, len(note2words)):
+            if note2words[idx] < normalized[-1]:
+                normalized.append(normalized[-1])
+            else:
+                normalized.append(note2words[idx])
+        return normalized
+    @staticmethod
+    def _build_ep_types(note2words: list[int], align_words: list[str]) -> list[int]:
+        ep_types: list[int] = []
+        prev = -1
+        for i, w in zip(note2words, align_words):
+            if w == "<SP>":
+                ep_types.append(1)
+            else:
+                ep_types.append(2 if i != prev else 3)
+            prev = i
+        return ep_types
+    def post_process(
+        self,
+        *,
+        metadata: Dict[str, Any],
+        segment_info: Dict[str, Any],
+        rosvot_out: Dict[str, Any],
+    ) -> Dict[str, Any]:
+        """Build aligned note metadata using ROSVOT outputs."""
+        note2words_raw = rosvot_out.get("note2words") or []
+        note2words = self._normalize_note2words(note2words_raw)
+        align_words = [
+            metadata["words"][idx - 1]
+            for idx in note2words_raw
+            if 0 < idx <= len(metadata["words"])
+        ]
+        ep_types = self._build_ep_types(note2words, align_words) if align_words else []
+        return {
+            "item_name": rosvot_out.get("item_name", "") if not segment_info else segment_info["item_name"],
+            "wav_fn": metadata.get("wav_fn", "") if not segment_info else segment_info["wav_fn"],
+            "origin_wav_fn": metadata.get("origin_wav_fn", "") if not segment_info else segment_info["origin_wav_fn"],
+            "start_time_ms": "" if not segment_info else segment_info["start_time_ms"],
+            "end_time_ms": "" if not segment_info else segment_info["end_time_ms"],
+            "language": metadata.get("language", ""),
+            "note_text": align_words,
+            "note_dur": rosvot_out.get("note_durs", []),
+            "note_type": ep_types,
+            "note_pitch": rosvot_out.get("pitches", []),
+        }
+if __name__ == "__main__":
+    items = json.load(open("example/test/rosvot_input.json", "r"))
+    item = items[0]
+    m = NoteTranscriber(
+        rosvot_model_path="pretrained_models/rosvot/rosvot/model.pt",
+        rwbd_model_path="pretrained_models/rosvot/rwbd/model.pt",
+        device="cuda"
+    )
+    out = m.process(item)
+    print(out)

preprocess/tools/note_transcription/modules/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """ROSVOT model submodules."""

preprocess/tools/note_transcription/modules/commons/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Common ROSVOT layers and utilities."""

preprocess/tools/note_transcription/modules/commons/conformer/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Conformer layers for ROSVOT."""

preprocess/tools/note_transcription/modules/commons/conformer/conformer.py ADDED Viewed

	@@ -0,0 +1,96 @@

+from torch import nn
+from .espnet_positional_embedding import RelPositionalEncoding, ScaledPositionalEncoding, PositionalEncoding
+from .espnet_transformer_attn import RelPositionMultiHeadedAttention, MultiHeadedAttention
+from .layers import Swish, ConvolutionModule, EncoderLayer, MultiLayeredConv1d
+from ..layers import Embedding
+class ConformerLayers(nn.Module):
+    def __init__(self, hidden_size, num_layers, kernel_size=9, dropout=0.0, num_heads=4,
+                 use_last_norm=True, save_hidden=False):
+        super().__init__()
+        self.use_last_norm = use_last_norm
+        self.layers = nn.ModuleList()
+        positionwise_layer = MultiLayeredConv1d
+        positionwise_layer_args = (hidden_size, hidden_size * 4, 1, dropout)
+        self.pos_embed = RelPositionalEncoding(hidden_size, dropout)
+        self.encoder_layers = nn.ModuleList([EncoderLayer(
+            hidden_size,
+            RelPositionMultiHeadedAttention(num_heads, hidden_size, 0.0),
+            positionwise_layer(*positionwise_layer_args),
+            positionwise_layer(*positionwise_layer_args),
+            ConvolutionModule(hidden_size, kernel_size, Swish()),
+            dropout,
+        ) for _ in range(num_layers)])
+        if self.use_last_norm:
+            self.layer_norm = nn.LayerNorm(hidden_size)
+        else:
+            self.layer_norm = nn.Linear(hidden_size, hidden_size)
+        self.save_hidden = save_hidden
+        if save_hidden:
+            self.hiddens = []
+    def forward(self, x, padding_mask=None):
+        """
+        :param x: [B, T, H]
+        :param padding_mask: [B, T]
+        :return: [B, T, H]
+        """
+        self.hiddens = []
+        nonpadding_mask = x.abs().sum(-1) > 0
+        x = self.pos_embed(x)
+        for l in self.encoder_layers:
+            x, mask = l(x, nonpadding_mask[:, None, :])
+            if self.save_hidden:
+                self.hiddens.append(x[0])
+        x = x[0]
+        x = self.layer_norm(x) * nonpadding_mask.float()[:, :, None]
+        return x
+class FastConformerLayers(ConformerLayers):
+    def __init__(self, hidden_size, num_layers, kernel_size=9, dropout=0.0, num_heads=4,
+                 use_last_norm=True, save_hidden=False):
+        super(ConformerLayers, self).__init__()
+        self.use_last_norm = use_last_norm
+        self.layers = nn.ModuleList()
+        positionwise_layer = MultiLayeredConv1d
+        positionwise_layer_args = (hidden_size, hidden_size * 4, 1, dropout)
+        self.pos_embed = PositionalEncoding(hidden_size, dropout)
+        self.encoder_layers = nn.ModuleList([EncoderLayer(
+            hidden_size,
+            MultiHeadedAttention(num_heads, hidden_size, 0.0, flash=True),
+            positionwise_layer(*positionwise_layer_args),
+            positionwise_layer(*positionwise_layer_args),
+            ConvolutionModule(hidden_size, kernel_size, Swish()),
+            dropout,
+        ) for _ in range(num_layers)])
+        if self.use_last_norm:
+            self.layer_norm = nn.LayerNorm(hidden_size)
+        else:
+            self.layer_norm = nn.Linear(hidden_size, hidden_size)
+        self.save_hidden = save_hidden
+        if save_hidden:
+            self.hiddens = []
+class ConformerEncoder(ConformerLayers):
+    def __init__(self, hidden_size, dict_size, num_layers=None):
+        conformer_enc_kernel_size = 9
+        super().__init__(hidden_size, num_layers, conformer_enc_kernel_size)
+        self.embed = Embedding(dict_size, hidden_size, padding_idx=0)
+    def forward(self, x):
+        """
+        :param src_tokens: [B, T]
+        :return: [B x T x C]
+        """
+        x = self.embed(x)  # [B, T, H]
+        x = super(ConformerEncoder, self).forward(x)
+        return x
+class ConformerDecoder(ConformerLayers):
+    def __init__(self, hidden_size, num_layers):
+        conformer_dec_kernel_size = 9
+        super().__init__(hidden_size, num_layers, conformer_dec_kernel_size)

preprocess/tools/note_transcription/modules/commons/conformer/espnet_positional_embedding.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import math
+import torch
+class PositionalEncoding(torch.nn.Module):
+    """Positional encoding.
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+        reverse (bool): Whether to reverse the input position.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000, reverse=False):
+        """Construct an PositionalEncoding object."""
+        super(PositionalEncoding, self).__init__()
+        self.d_model = d_model
+        self.reverse = reverse
+        self.xscale = math.sqrt(self.d_model)
+        self.dropout = torch.nn.Dropout(p=dropout_rate)
+        self.pe = None
+        self.extend_pe(torch.tensor(0.0).expand(1, max_len))
+    def extend_pe(self, x):
+        """Reset the positional encodings."""
+        if self.pe is not None:
+            if self.pe.size(1) >= x.size(1):
+                if self.pe.dtype != x.dtype or self.pe.device != x.device:
+                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)
+                return
+        pe = torch.zeros(x.size(1), self.d_model)
+        if self.reverse:
+            position = torch.arange(
+                x.size(1) - 1, -1, -1.0, dtype=torch.float32
+            ).unsqueeze(1)
+        else:
+            position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32)
+            * -(math.log(10000.0) / self.d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.pe = pe.to(device=x.device, dtype=x.dtype)
+    def forward(self, x: torch.Tensor):
+        """Add positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x * self.xscale + self.pe[:, : x.size(1)]
+        return self.dropout(x)
+class ScaledPositionalEncoding(PositionalEncoding):
+    """Scaled positional encoding module.
+    See Sec. 3.2  https://arxiv.org/abs/1809.08895
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(d_model=d_model, dropout_rate=dropout_rate, max_len=max_len)
+        self.alpha = torch.nn.Parameter(torch.tensor(1.0))
+    def reset_parameters(self):
+        """Reset parameters."""
+        self.alpha.data = torch.tensor(1.0)
+    def forward(self, x):
+        """Add positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x + self.alpha * self.pe[:, : x.size(1)]
+        return self.dropout(x)
+class RelPositionalEncoding(PositionalEncoding):
+    """Relative positional encoding module.
+    See : Appendix B in https://arxiv.org/abs/1901.02860
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(d_model, dropout_rate, max_len, reverse=True)
+    def forward(self, x):
+        """Compute positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+            torch.Tensor: Positional embedding tensor (1, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x * self.xscale
+        pos_emb = self.pe[:, : x.size(1)]
+        return self.dropout(x), self.dropout(pos_emb)

preprocess/tools/note_transcription/modules/commons/conformer/espnet_transformer_attn.py ADDED Viewed

	@@ -0,0 +1,198 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+"""Multi-Head Attention layer definition."""
+from packaging import version
+import math
+import numpy
+import torch
+from torch import nn
+class MultiHeadedAttention(nn.Module):
+    """Multi-Head Attention layer.
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+    """
+    def __init__(self, n_head, n_feat, dropout_rate, flash=False):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadedAttention, self).__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        self.linear_q = nn.Linear(n_feat, n_feat)
+        self.linear_k = nn.Linear(n_feat, n_feat)
+        self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.attn = None
+        self.dropout = nn.Dropout(p=dropout_rate)
+        self.dropout_rate = dropout_rate
+        self.flash = flash
+    def forward_qkv(self, query, key, value):
+        """Transform query, key and value.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+        Returns:
+            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
+            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
+            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
+        """
+        n_batch = query.size(0)
+        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
+        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
+        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
+        q = q.transpose(1, 2)  # (batch, head, time1, d_k)
+        k = k.transpose(1, 2)  # (batch, head, time2, d_k)
+        v = v.transpose(1, 2)  # (batch, head, time2, d_k)
+        return q, k, v
+    def forward_attention(self, value, scores, mask):
+        """Compute attention context vector.
+        Args:
+            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
+            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
+            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+            min_value = float(
+                numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min
+            )
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(
+                mask, 0.0
+            )  # (batch, head, time1, time2)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = (
+            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        )  # (batch, time1, d_model)
+        return self.linear_out(x)  # (batch, time1, d_model)
+    def forward(self, query, key, value, mask):
+        """Compute scaled dot product attention.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        if version.parse(torch.__version__) >= version.parse("2.0") and self.flash:
+            n_batch = value.size(0)
+            x = torch.nn.functional.scaled_dot_product_attention(
+                q, k, v, attn_mask=mask.unsqueeze(1) if mask is not None else None, dropout_p=self.dropout_rate)
+            x = (
+                x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+            )  # (batch, time1, d_model)
+            return self.linear_out(x)
+        else:
+            scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
+            return self.forward_attention(v, scores, mask)
+class RelPositionMultiHeadedAttention(MultiHeadedAttention):
+    """Multi-Head Attention layer with relative position encoding.
+    Paper: https://arxiv.org/abs/1901.02860
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+    """
+    def __init__(self, n_head, n_feat, dropout_rate):
+        """Construct an RelPositionMultiHeadedAttention object."""
+        super().__init__(n_head, n_feat, dropout_rate)
+        # linear transformation for positional ecoding
+        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
+        # these two learnable bias are used in matrix c and matrix d
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        torch.nn.init.xavier_uniform_(self.pos_bias_u)
+        torch.nn.init.xavier_uniform_(self.pos_bias_v)
+    def rel_shift(self, x, zero_triu=False):
+        """Compute relative positinal encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, size).
+            zero_triu (bool): If true, return the lower triangular part of the matrix.
+        Returns:
+            torch.Tensor: Output tensor.
+        """
+        zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)
+        x_padded = torch.cat([zero_pad, x], dim=-1)
+        x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))
+        x = x_padded[:, :, 1:].view_as(x)
+        if zero_triu:
+            ones = torch.ones((x.size(2), x.size(3)))
+            x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
+        return x
+    def forward(self, query, key, value, pos_emb, mask):
+        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            pos_emb (torch.Tensor): Positional embedding tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        q = q.transpose(1, 2)  # (batch, time1, head, d_k)
+        n_batch_pos = pos_emb.size(0)
+        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
+        p = p.transpose(1, 2)  # (batch, head, time1, d_k)
+        # (batch, head, time1, d_k)
+        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)
+        # (batch, head, time1, d_k)
+        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)
+        # compute attention score
+        # first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # (batch, head, time1, time2)
+        matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))
+        # compute matrix b and matrix d
+        # (batch, head, time1, time2)
+        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))
+        matrix_bd = self.rel_shift(matrix_bd)
+        scores = (matrix_ac + matrix_bd) / math.sqrt(
+            self.d_k
+        )  # (batch, head, time1, time2)
+        return self.forward_attention(v, scores, mask)

preprocess/tools/note_transcription/modules/commons/conformer/layers.py ADDED Viewed

	@@ -0,0 +1,260 @@

+from torch import nn
+import torch
+from ..layers import LayerNorm
+class ConvolutionModule(nn.Module):
+    """ConvolutionModule in Conformer model.
+    Args:
+        channels (int): The number of channels of conv layers.
+        kernel_size (int): Kernerl size of conv layers.
+    """
+    def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):
+        """Construct an ConvolutionModule object."""
+        super(ConvolutionModule, self).__init__()
+        # kernerl_size should be a odd number for 'SAME' padding
+        assert (kernel_size - 1) % 2 == 0
+        self.pointwise_conv1 = nn.Conv1d(
+            channels,
+            2 * channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.depthwise_conv = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+            groups=channels,
+            bias=bias,
+        )
+        self.norm = nn.BatchNorm1d(channels)
+        self.pointwise_conv2 = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.activation = activation
+    def forward(self, x):
+        """Compute convolution module.
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, channels).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, channels).
+        """
+        # exchange the temporal dimension and the feature dimension
+        x = x.transpose(1, 2)
+        # GLU mechanism
+        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
+        x = nn.functional.glu(x, dim=1)  # (batch, channel, dim)
+        # 1D Depthwise Conv
+        x = self.depthwise_conv(x)
+        x = self.activation(self.norm(x))
+        x = self.pointwise_conv2(x)
+        return x.transpose(1, 2)
+class MultiLayeredConv1d(torch.nn.Module):
+    """Multi-layered conv1d for Transformer block.
+    This is a module of multi-leyered conv1d designed
+    to replace positionwise feed-forward network
+    in Transforner block, which is introduced in
+    `FastSpeech: Fast, Robust and Controllable Text to Speech`_.
+    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:
+        https://arxiv.org/pdf/1905.09263.pdf
+    """
+    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):
+        """Initialize MultiLayeredConv1d module.
+        Args:
+            in_chans (int): Number of input channels.
+            hidden_chans (int): Number of hidden channels.
+            kernel_size (int): Kernel size of conv1d.
+            dropout_rate (float): Dropout rate.
+        """
+        super(MultiLayeredConv1d, self).__init__()
+        self.w_1 = torch.nn.Conv1d(
+            in_chans,
+            hidden_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.w_2 = torch.nn.Conv1d(
+            hidden_chans,
+            in_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.dropout = torch.nn.Dropout(dropout_rate)
+    def forward(self, x):
+        """Calculate forward propagation.
+        Args:
+            x (torch.Tensor): Batch of input tensors (B, T, in_chans).
+        Returns:
+            torch.Tensor: Batch of output tensors (B, T, hidden_chans).
+        """
+        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)
+        return self.w_2(self.dropout(x).transpose(-1, 1)).transpose(-1, 1)
+class Swish(torch.nn.Module):
+    """Construct an Swish object."""
+    def forward(self, x):
+        """Return Swich activation function."""
+        return x * torch.sigmoid(x)
+class EncoderLayer(nn.Module):
+    """Encoder layer module.
+    Args:
+        size (int): Input dimension.
+        self_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
+            can be used as the argument.
+        feed_forward (torch.nn.Module): Feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        feed_forward_macaron (torch.nn.Module): Additional feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        conv_module (torch.nn.Module): Convolution module instance.
+            `ConvlutionModule` instance can be used as the argument.
+        dropout_rate (float): Dropout rate.
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
+    """
+    def __init__(
+            self,
+            size,
+            self_attn,
+            feed_forward,
+            feed_forward_macaron,
+            conv_module,
+            dropout_rate,
+            normalize_before=True,
+            concat_after=False,
+    ):
+        """Construct an EncoderLayer object."""
+        super(EncoderLayer, self).__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.feed_forward_macaron = feed_forward_macaron
+        self.conv_module = conv_module
+        self.norm_ff = LayerNorm(size)  # for the FNN module
+        self.norm_mha = LayerNorm(size)  # for the MHA module
+        if feed_forward_macaron is not None:
+            self.norm_ff_macaron = LayerNorm(size)
+            self.ff_scale = 0.5
+        else:
+            self.ff_scale = 1.0
+        if self.conv_module is not None:
+            self.norm_conv = LayerNorm(size)  # for the CNN module
+            self.norm_final = LayerNorm(size)  # for the final output of the block
+        self.dropout = nn.Dropout(dropout_rate)
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear = nn.Linear(size + size, size)
+    def forward(self, x_input, mask, cache=None):
+        """Compute encoded features.
+        Args:
+            x_input (Union[Tuple, torch.Tensor]): Input tensor w/ or w/o pos emb.
+                - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].
+                - w/o pos emb: Tensor (#batch, time, size).
+            mask (torch.Tensor): Mask tensor for the input (#batch, time).
+            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, size).
+            torch.Tensor: Mask tensor (#batch, time).
+        """
+        if isinstance(x_input, tuple):
+            x, pos_emb = x_input[0], x_input[1]
+        else:
+            x, pos_emb = x_input, None
+        # whether to use macaron style
+        if self.feed_forward_macaron is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_ff_macaron(x)
+            x = residual + self.ff_scale * self.dropout(self.feed_forward_macaron(x))
+            if not self.normalize_before:
+                x = self.norm_ff_macaron(x)
+        # multi-headed self-attention module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_mha(x)
+        if cache is None:
+            x_q = x
+        else:
+            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)
+            x_q = x[:, -1:, :]
+            residual = residual[:, -1:, :]
+            mask = None if mask is None else mask[:, -1:, :]
+        if pos_emb is not None:
+            x_att = self.self_attn(x_q, x, x, pos_emb, mask)
+        else:
+            x_att = self.self_attn(x_q, x, x, mask)
+        if self.concat_after:
+            x_concat = torch.cat((x, x_att), dim=-1)
+            x = residual + self.concat_linear(x_concat)
+        else:
+            x = residual + self.dropout(x_att)
+        if not self.normalize_before:
+            x = self.norm_mha(x)
+        # convolution module
+        if self.conv_module is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_conv(x)
+            x = residual + self.dropout(self.conv_module(x))
+            if not self.normalize_before:
+                x = self.norm_conv(x)
+        # feed forward module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_ff(x)
+        x = residual + self.ff_scale * self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm_ff(x)
+        if self.conv_module is not None:
+            x = self.norm_final(x)
+        if cache is not None:
+            x = torch.cat([cache, x], dim=1)
+        if pos_emb is not None:
+            return (x, pos_emb), mask
+        return x, mask

preprocess/tools/note_transcription/modules/commons/conv.py ADDED Viewed

	@@ -0,0 +1,175 @@

+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .layers import LayerNorm, Embedding
+class LambdaLayer(nn.Module):
+    def __init__(self, lambd):
+        super(LambdaLayer, self).__init__()
+        self.lambd = lambd
+    def forward(self, x):
+        return self.lambd(x)
+def init_weights_func(m):
+    classname = m.__class__.__name__
+    if classname.find("Conv1d") != -1:
+        torch.nn.init.xavier_uniform_(m.weight)
+def get_norm_builder(norm_type, channels, ln_eps=1e-6):
+    if norm_type == 'bn':
+        norm_builder = lambda: nn.BatchNorm1d(channels)
+    elif norm_type == 'in':
+        norm_builder = lambda: nn.InstanceNorm1d(channels, affine=True)
+    elif norm_type == 'gn':
+        norm_builder = lambda: nn.GroupNorm(8, channels)
+    elif norm_type == 'ln':
+        norm_builder = lambda: LayerNorm(channels, dim=1, eps=ln_eps)
+    else:
+        norm_builder = lambda: nn.Identity()
+    return norm_builder
+def get_act_builder(act_type):
+    if act_type == 'gelu':
+        act_builder = lambda: nn.GELU()
+    elif act_type == 'relu':
+        act_builder = lambda: nn.ReLU(inplace=True)
+    elif act_type == 'leakyrelu':
+        act_builder = lambda: nn.LeakyReLU(negative_slope=0.01, inplace=True)
+    elif act_type == 'swish':
+        act_builder = lambda: nn.SiLU(inplace=True)
+    else:
+        act_builder = lambda: nn.Identity()
+    return act_builder
+class ResidualBlock(nn.Module):
+    """Implements conv->PReLU->norm n-times"""
+    def __init__(self, channels, kernel_size, dilation, n=2, norm_type='bn', dropout=0.0,
+                 c_multiple=2, ln_eps=1e-12, act_type='gelu'):
+        super(ResidualBlock, self).__init__()
+        norm_builder = get_norm_builder(norm_type, channels, ln_eps)
+        act_builder = get_act_builder(act_type)
+        self.blocks = [
+            nn.Sequential(
+                norm_builder(),
+                nn.Conv1d(channels, c_multiple * channels, kernel_size, dilation=dilation,
+                          padding=(dilation * (kernel_size - 1)) // 2),
+                LambdaLayer(lambda x: x * kernel_size ** -0.5),
+                act_builder(),
+                nn.Conv1d(c_multiple * channels, channels, 1, dilation=dilation),
+            )
+            for i in range(n)
+        ]
+        self.blocks = nn.ModuleList(self.blocks)
+        self.dropout = dropout
+    def forward(self, x):
+        nonpadding = (x.abs().sum(1) > 0).float()[:, None, :]
+        for b in self.blocks:
+            x_ = b(x)
+            if self.dropout > 0 and self.training:
+                x_ = F.dropout(x_, self.dropout, training=self.training)
+            x = x + x_
+            x = x * nonpadding
+        return x
+class ConvBlocks(nn.Module):
+    """Decodes the expanded phoneme encoding into spectrograms"""
+    def __init__(self, hidden_size, out_dims, dilations, kernel_size,
+                 norm_type='ln', layers_in_block=2, c_multiple=2,
+                 dropout=0.0, ln_eps=1e-5,
+                 init_weights=True, is_BTC=True, num_layers=None, post_net_kernel=3, act_type='gelu'):
+        super(ConvBlocks, self).__init__()
+        self.is_BTC = is_BTC
+        if num_layers is not None:
+            dilations = [1] * num_layers
+        self.res_blocks = nn.Sequential(
+            *[ResidualBlock(hidden_size, kernel_size, d,
+                            n=layers_in_block, norm_type=norm_type, c_multiple=c_multiple,
+                            dropout=dropout, ln_eps=ln_eps, act_type=act_type)
+              for d in dilations],
+        )
+        norm = get_norm_builder(norm_type, hidden_size, ln_eps)()
+        self.last_norm = norm
+        self.post_net1 = nn.Conv1d(hidden_size, out_dims, kernel_size=post_net_kernel,
+                                   padding=post_net_kernel // 2)
+        if init_weights:
+            self.apply(init_weights_func)
+    def forward(self, x, nonpadding=None):
+        """
+        :param x: [B, T, H]
+        :return:  [B, T, H]
+        """
+        if self.is_BTC:
+            x = x.transpose(1, 2)
+        if nonpadding is None:
+            nonpadding = (x.abs().sum(1) > 0).float()[:, None, :]
+        elif self.is_BTC:
+            nonpadding = nonpadding.transpose(1, 2)
+        x = self.res_blocks(x) * nonpadding
+        x = self.last_norm(x) * nonpadding
+        x = self.post_net1(x) * nonpadding
+        if self.is_BTC:
+            x = x.transpose(1, 2)
+        return x
+class TextConvEncoder(ConvBlocks):
+    def __init__(self, dict_size, hidden_size, out_dims, dilations, kernel_size,
+                 norm_type='ln', layers_in_block=2, c_multiple=2,
+                 dropout=0.0, ln_eps=1e-5, init_weights=True, num_layers=None, post_net_kernel=3):
+        super().__init__(hidden_size, out_dims, dilations, kernel_size,
+                         norm_type, layers_in_block, c_multiple,
+                         dropout, ln_eps, init_weights, num_layers=num_layers,
+                         post_net_kernel=post_net_kernel)
+        self.embed_tokens = Embedding(dict_size, hidden_size, 0)
+        self.embed_scale = math.sqrt(hidden_size)
+    def forward(self, txt_tokens):
+        """
+        :param txt_tokens: [B, T]
+        :return: {
+            'encoder_out': [B x T x C]
+        }
+        """
+        x = self.embed_scale * self.embed_tokens(txt_tokens)
+        return super().forward(x)
+class ConditionalConvBlocks(ConvBlocks):
+    def __init__(self, hidden_size, c_cond, c_out, dilations, kernel_size,
+                 norm_type='ln', layers_in_block=2, c_multiple=2,
+                 dropout=0.0, ln_eps=1e-5, init_weights=True, is_BTC=True, num_layers=None):
+        super().__init__(hidden_size, c_out, dilations, kernel_size,
+                         norm_type, layers_in_block, c_multiple,
+                         dropout, ln_eps, init_weights, is_BTC=False, num_layers=num_layers)
+        self.g_prenet = nn.Conv1d(c_cond, hidden_size, 3, padding=1)
+        self.is_BTC_ = is_BTC
+        if init_weights:
+            self.g_prenet.apply(init_weights_func)
+    def forward(self, x, cond, nonpadding=None):
+        if self.is_BTC_:
+            x = x.transpose(1, 2)
+            cond = cond.transpose(1, 2)
+            if nonpadding is not None:
+                nonpadding = nonpadding.transpose(1, 2)
+        if nonpadding is None:
+            nonpadding = x.abs().sum(1)[:, None]
+        x = x + self.g_prenet(cond)
+        x = x * nonpadding
+        x = super(ConditionalConvBlocks, self).forward(x)  # input needs to be BTC
+        if self.is_BTC_:
+            x = x.transpose(1, 2)
+        return x

preprocess/tools/note_transcription/modules/commons/layers.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import torch
+from torch import nn
+from torch.autograd import Function
+class LayerNorm(torch.nn.LayerNorm):
+    """Layer normalization module.
+    :param int nout: output dim size
+    :param int dim: dimension to be normalized
+    """
+    def __init__(self, nout, dim=-1, eps=1e-5):
+        """Construct an LayerNorm object."""
+        super(LayerNorm, self).__init__(nout, eps=eps)
+        self.dim = dim
+    def forward(self, x):
+        """Apply layer normalization.
+        :param torch.Tensor x: input tensor
+        :return: layer normalized tensor
+        :rtype torch.Tensor
+        """
+        if self.dim == -1:
+            return super(LayerNorm, self).forward(x)
+        return super(LayerNorm, self).forward(x.transpose(1, -1)).transpose(1, -1)
+class Reshape(nn.Module):
+    def __init__(self, *args):
+        super(Reshape, self).__init__()
+        self.shape = args
+    def forward(self, x):
+        return x.view(self.shape)
+class Permute(nn.Module):
+    def __init__(self, *args):
+        super(Permute, self).__init__()
+        self.args = args
+    def forward(self, x):
+        return x.permute(self.args)
+def Linear(in_features, out_features, bias=True, init_type='xavier'):
+    m = nn.Linear(in_features, out_features, bias)
+    if init_type == 'xavier':
+        nn.init.xavier_uniform_(m.weight)
+    elif init_type == 'kaiming':
+        nn.init.kaiming_normal_(m.weight, mode='fan_in')
+    if bias:
+        nn.init.constant_(m.bias, 0.)
+    return m
+def Embedding(num_embeddings, embedding_dim, padding_idx=None, init_type='normal'):
+    m = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx)
+    if init_type == 'normal':
+        nn.init.normal_(m.weight, mean=0, std=embedding_dim ** -0.5)
+    elif init_type == 'kaiming':
+        nn.init.kaiming_normal_(m.weight, mode='fan_in')
+    if padding_idx is not None:
+        nn.init.constant_(m.weight[padding_idx], 0)
+    return m
+class GradientReverseFunction(Function):
+    @staticmethod
+    def forward(ctx, input, coeff=1.):
+        ctx.coeff = coeff
+        output = input * 1.0
+        return output
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.neg() * ctx.coeff, None
+class GRL(nn.Module):
+    def __init__(self):
+        super(GRL, self).__init__()
+    def forward(self, *input):
+        return GradientReverseFunction.apply(*input)

preprocess/tools/note_transcription/modules/commons/rel_transformer.py ADDED Viewed

	@@ -0,0 +1,378 @@

+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+from .layers import Embedding
+def convert_pad_shape(pad_shape):
+    l = pad_shape[::-1]
+    pad_shape = [item for sublist in l for item in sublist]
+    return pad_shape
+def shift_1d(x):
+    x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
+    return x
+def sequence_mask(length, max_length=None):
+    if max_length is None:
+        max_length = length.max()
+    x = torch.arange(max_length, dtype=length.dtype, device=length.device)
+    return x.unsqueeze(0) < length.unsqueeze(1)
+class Encoder(nn.Module):
+    def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, kernel_size=1, p_dropout=0.,
+                 window_size=None, block_length=None, pre_ln=False, **kwargs):
+        super().__init__()
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.block_length = block_length
+        self.pre_ln = pre_ln
+        self.drop = nn.Dropout(p_dropout)
+        self.attn_layers = nn.ModuleList()
+        self.norm_layers_1 = nn.ModuleList()
+        self.ffn_layers = nn.ModuleList()
+        self.norm_layers_2 = nn.ModuleList()
+        for i in range(self.n_layers):
+            self.attn_layers.append(
+                MultiHeadAttention(hidden_channels, hidden_channels, n_heads, window_size=window_size,
+                                   p_dropout=p_dropout, block_length=block_length))
+            self.norm_layers_1.append(LayerNorm(hidden_channels))
+            self.ffn_layers.append(
+                FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout))
+            self.norm_layers_2.append(LayerNorm(hidden_channels))
+        if pre_ln:
+            self.last_ln = LayerNorm(hidden_channels)
+    def forward(self, x, x_mask):
+        attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+        for i in range(self.n_layers):
+            x = x * x_mask
+            x_ = x
+            if self.pre_ln:
+                x = self.norm_layers_1[i](x)
+            y = self.attn_layers[i](x, x, attn_mask)
+            y = self.drop(y)
+            x = x_ + y
+            if not self.pre_ln:
+                x = self.norm_layers_1[i](x)
+            x_ = x
+            if self.pre_ln:
+                x = self.norm_layers_2[i](x)
+            y = self.ffn_layers[i](x, x_mask)
+            y = self.drop(y)
+            x = x_ + y
+            if not self.pre_ln:
+                x = self.norm_layers_2[i](x)
+        if self.pre_ln:
+            x = self.last_ln(x)
+        x = x * x_mask
+        return x
+class MultiHeadAttention(nn.Module):
+    def __init__(self, channels, out_channels, n_heads, window_size=None, heads_share=True, p_dropout=0.,
+                 block_length=None, proximal_bias=False, proximal_init=False):
+        super().__init__()
+        assert channels % n_heads == 0
+        self.channels = channels
+        self.out_channels = out_channels
+        self.n_heads = n_heads
+        self.window_size = window_size
+        self.heads_share = heads_share
+        self.block_length = block_length
+        self.proximal_bias = proximal_bias
+        self.p_dropout = p_dropout
+        self.attn = None
+        self.k_channels = channels // n_heads
+        self.conv_q = nn.Conv1d(channels, channels, 1)
+        self.conv_k = nn.Conv1d(channels, channels, 1)
+        self.conv_v = nn.Conv1d(channels, channels, 1)
+        if window_size is not None:
+            n_heads_rel = 1 if heads_share else n_heads
+            rel_stddev = self.k_channels ** -0.5
+            self.emb_rel_k = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
+            self.emb_rel_v = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
+        self.conv_o = nn.Conv1d(channels, out_channels, 1)
+        self.drop = nn.Dropout(p_dropout)
+        nn.init.xavier_uniform_(self.conv_q.weight)
+        nn.init.xavier_uniform_(self.conv_k.weight)
+        if proximal_init:
+            self.conv_k.weight.data.copy_(self.conv_q.weight.data)
+            self.conv_k.bias.data.copy_(self.conv_q.bias.data)
+        nn.init.xavier_uniform_(self.conv_v.weight)
+    def forward(self, x, c, attn_mask=None):
+        q = self.conv_q(x)
+        k = self.conv_k(c)
+        v = self.conv_v(c)
+        x, self.attn = self.attention(q, k, v, mask=attn_mask)
+        x = self.conv_o(x)
+        return x
+    def attention(self, query, key, value, mask=None):
+        # reshape [b, d, t] -> [b, n_h, t, d_k]
+        b, d, t_s, t_t = (*key.size(), query.size(2))
+        query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
+        key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.k_channels)
+        if self.window_size is not None:
+            assert t_s == t_t, "Relative attention is only available for self-attention."
+            key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
+            rel_logits = self._matmul_with_relative_keys(query, key_relative_embeddings)
+            rel_logits = self._relative_position_to_absolute_position(rel_logits)
+            scores_local = rel_logits / math.sqrt(self.k_channels)
+            scores = scores + scores_local
+        if self.proximal_bias:
+            assert t_s == t_t, "Proximal bias is only available for self-attention."
+            scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype)
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -1e4)
+            if self.block_length is not None:
+                block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
+                scores = scores * block_mask + -1e4 * (1 - block_mask)
+        p_attn = F.softmax(scores, dim=-1)  # [b, n_h, t_t, t_s]
+        p_attn = self.drop(p_attn)
+        output = torch.matmul(p_attn, value)
+        if self.window_size is not None:
+            relative_weights = self._absolute_position_to_relative_position(p_attn)
+            value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s)
+            output = output + self._matmul_with_relative_values(relative_weights, value_relative_embeddings)
+        output = output.transpose(2, 3).contiguous().view(b, d, t_t)  # [b, n_h, t_t, d_k] -> [b, d, t_t]
+        return output, p_attn
+    def _matmul_with_relative_values(self, x, y):
+        """
+        x: [b, h, l, m]
+        y: [h or 1, m, d]
+        ret: [b, h, l, d]
+        """
+        ret = torch.matmul(x, y.unsqueeze(0))
+        return ret
+    def _matmul_with_relative_keys(self, x, y):
+        """
+        x: [b, h, l, d]
+        y: [h or 1, m, d]
+        ret: [b, h, l, m]
+        """
+        ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
+        return ret
+    def _get_relative_embeddings(self, relative_embeddings, length):
+        max_relative_position = 2 * self.window_size + 1
+        # Pad first before slice to avoid using cond ops.
+        pad_length = max(length - (self.window_size + 1), 0)
+        slice_start_position = max((self.window_size + 1) - length, 0)
+        slice_end_position = slice_start_position + 2 * length - 1
+        if pad_length > 0:
+            padded_relative_embeddings = F.pad(
+                relative_embeddings,
+                convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]))
+        else:
+            padded_relative_embeddings = relative_embeddings
+        used_relative_embeddings = padded_relative_embeddings[:, slice_start_position:slice_end_position]
+        return used_relative_embeddings
+    def _relative_position_to_absolute_position(self, x):
+        """
+        x: [b, h, l, 2*l-1]
+        ret: [b, h, l, l]
+        """
+        batch, heads, length, _ = x.size()
+        # Concat columns of pad to shift from relative to absolute indexing.
+        x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
+        # Concat extra elements so to add up to shape (len+1, 2*len-1).
+        x_flat = x.view([batch, heads, length * 2 * length])
+        x_flat = F.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [0, length - 1]]))
+        # Reshape and slice out the padded elements.
+        x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[:, :, :length, length - 1:]
+        return x_final
+    def _absolute_position_to_relative_position(self, x):
+        """
+        x: [b, h, l, l]
+        ret: [b, h, l, 2*l-1]
+        """
+        batch, heads, length, _ = x.size()
+        # padd along column
+        x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]]))
+        x_flat = x.view([batch, heads, length ** 2 + length * (length - 1)])
+        # add 0's in the beginning that will skew the elements after reshape
+        x_flat = F.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
+        x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
+        return x_final
+    def _attention_bias_proximal(self, length):
+        """Bias for self-attention to encourage attention to close positions.
+        Args:
+          length: an integer scalar.
+        Returns:
+          a Tensor with shape [1, 1, length, length]
+        """
+        r = torch.arange(length, dtype=torch.float32)
+        diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
+        return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
+class FFN(nn.Module):
+    def __init__(self, in_channels, out_channels, filter_channels, kernel_size, p_dropout=0., activation=None):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.filter_channels = filter_channels
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.activation = activation
+        self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size, padding=kernel_size // 2)
+        self.conv_2 = nn.Conv1d(filter_channels, out_channels, 1)
+        self.drop = nn.Dropout(p_dropout)
+    def forward(self, x, x_mask):
+        x = self.conv_1(x * x_mask)
+        if self.activation == "gelu":
+            x = x * torch.sigmoid(1.702 * x)
+        else:
+            x = torch.relu(x)
+        x = self.drop(x)
+        x = self.conv_2(x * x_mask)
+        return x * x_mask
+class LayerNorm(nn.Module):
+    def __init__(self, channels, eps=1e-4):
+        super().__init__()
+        self.channels = channels
+        self.eps = eps
+        self.gamma = nn.Parameter(torch.ones(channels))
+        self.beta = nn.Parameter(torch.zeros(channels))
+    def forward(self, x):
+        n_dims = len(x.shape)
+        mean = torch.mean(x, 1, keepdim=True)
+        variance = torch.mean((x - mean) ** 2, 1, keepdim=True)
+        x = (x - mean) * torch.rsqrt(variance + self.eps)
+        shape = [1, -1] + [1] * (n_dims - 2)
+        x = x * self.gamma.view(*shape) + self.beta.view(*shape)
+        return x
+class ConvReluNorm(nn.Module):
+    def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout):
+        super().__init__()
+        self.in_channels = in_channels
+        self.hidden_channels = hidden_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.n_layers = n_layers
+        self.p_dropout = p_dropout
+        assert n_layers > 1, "Number of layers should be larger than 0."
+        self.conv_layers = nn.ModuleList()
+        self.norm_layers = nn.ModuleList()
+        self.conv_layers.append(nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size // 2))
+        self.norm_layers.append(LayerNorm(hidden_channels))
+        self.relu_drop = nn.Sequential(
+            nn.ReLU(),
+            nn.Dropout(p_dropout))
+        for _ in range(n_layers - 1):
+            self.conv_layers.append(nn.Conv1d(hidden_channels, hidden_channels, kernel_size, padding=kernel_size // 2))
+            self.norm_layers.append(LayerNorm(hidden_channels))
+        self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
+        self.proj.weight.data.zero_()
+        self.proj.bias.data.zero_()
+    def forward(self, x, x_mask):
+        x_org = x
+        for i in range(self.n_layers):
+            x = self.conv_layers[i](x * x_mask)
+            x = self.norm_layers[i](x)
+            x = self.relu_drop(x)
+        x = x_org + self.proj(x)
+        return x * x_mask
+class RelTransformerEncoder(nn.Module):
+    def __init__(self,
+                 n_vocab,
+                 out_channels,
+                 hidden_channels,
+                 filter_channels,
+                 n_heads,
+                 n_layers,
+                 kernel_size,
+                 p_dropout=0.0,
+                 window_size=4,
+                 block_length=None,
+                 prenet=True,
+                 pre_ln=True,
+                 ):
+        super().__init__()
+        self.n_vocab = n_vocab
+        self.out_channels = out_channels
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.block_length = block_length
+        self.prenet = prenet
+        if n_vocab > 0:
+            self.emb = Embedding(n_vocab, hidden_channels, padding_idx=0)
+        if prenet:
+            self.pre = ConvReluNorm(hidden_channels, hidden_channels, hidden_channels,
+                                    kernel_size=5, n_layers=3, p_dropout=0)
+        self.encoder = Encoder(
+            hidden_channels,
+            filter_channels,
+            n_heads,
+            n_layers,
+            kernel_size,
+            p_dropout,
+            window_size=window_size,
+            block_length=block_length,
+            pre_ln=pre_ln,
+        )
+    def forward(self, x, x_mask=None):
+        if self.n_vocab > 0:
+            x_lengths = (x > 0).long().sum(-1)
+            x = self.emb(x) * math.sqrt(self.hidden_channels)  # [b, t, h]
+        else:
+            x_lengths = (x.abs().sum(-1) > 0).long().sum(-1)
+        x = torch.transpose(x, 1, -1)  # [b, h, t]
+        x_mask = torch.unsqueeze(sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
+        if self.prenet:
+            x = self.pre(x, x_mask)
+        x = self.encoder(x, x_mask)
+        return x.transpose(1, 2)

preprocess/tools/note_transcription/modules/commons/rnn.py ADDED Viewed

	@@ -0,0 +1,261 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+class PreNet(nn.Module):
+    def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
+        super().__init__()
+        self.fc1 = nn.Linear(in_dims, fc1_dims)
+        self.fc2 = nn.Linear(fc1_dims, fc2_dims)
+        self.p = dropout
+    def forward(self, x):
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = F.dropout(x, self.p, training=self.training)
+        x = self.fc2(x)
+        x = F.relu(x)
+        x = F.dropout(x, self.p, training=self.training)
+        return x
+class HighwayNetwork(nn.Module):
+    def __init__(self, size):
+        super().__init__()
+        self.W1 = nn.Linear(size, size)
+        self.W2 = nn.Linear(size, size)
+        self.W1.bias.data.fill_(0.)
+    def forward(self, x):
+        x1 = self.W1(x)
+        x2 = self.W2(x)
+        g = torch.sigmoid(x2)
+        y = g * F.relu(x1) + (1. - g) * x
+        return y
+class BatchNormConv(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel, relu=True):
+        super().__init__()
+        self.conv = nn.Conv1d(in_channels, out_channels, kernel, stride=1, padding=kernel // 2, bias=False)
+        self.bnorm = nn.BatchNorm1d(out_channels)
+        self.relu = relu
+    def forward(self, x):
+        x = self.conv(x)
+        x = F.relu(x) if self.relu is True else x
+        return self.bnorm(x)
+class ConvNorm(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
+                 padding=None, dilation=1, bias=True, w_init_gain='linear'):
+        super(ConvNorm, self).__init__()
+        if padding is None:
+            assert (kernel_size % 2 == 1)
+            padding = int(dilation * (kernel_size - 1) / 2)
+        self.conv = torch.nn.Conv1d(in_channels, out_channels,
+                                    kernel_size=kernel_size, stride=stride,
+                                    padding=padding, dilation=dilation,
+                                    bias=bias)
+        torch.nn.init.xavier_uniform_(
+            self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
+    def forward(self, signal):
+        conv_signal = self.conv(signal)
+        return conv_signal
+class CBHG(nn.Module):
+    def __init__(self, K, in_channels, channels, proj_channels, num_highways):
+        super().__init__()
+        # List of all rnns to call `flatten_parameters()` on
+        self._to_flatten = []
+        self.bank_kernels = [i for i in range(1, K + 1)]
+        self.conv1d_bank = nn.ModuleList()
+        for k in self.bank_kernels:
+            conv = BatchNormConv(in_channels, channels, k)
+            self.conv1d_bank.append(conv)
+        self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
+        self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
+        self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
+        # Fix the highway input if necessary
+        if proj_channels[-1] != channels:
+            self.highway_mismatch = True
+            self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
+        else:
+            self.highway_mismatch = False
+        self.highways = nn.ModuleList()
+        for i in range(num_highways):
+            hn = HighwayNetwork(channels)
+            self.highways.append(hn)
+        self.rnn = nn.GRU(channels, channels, batch_first=True, bidirectional=True)
+        self._to_flatten.append(self.rnn)
+        # Avoid fragmentation of RNN parameters and associated warning
+        self._flatten_parameters()
+    def forward(self, x):
+        # Although we `_flatten_parameters()` on init, when using DataParallel
+        # the model gets replicated, making it no longer guaranteed that the
+        # weights are contiguous in GPU memory. Hence, we must call it again
+        self._flatten_parameters()
+        # Save these for later
+        residual = x
+        seq_len = x.size(-1)
+        conv_bank = []
+        # Convolution Bank
+        for conv in self.conv1d_bank:
+            c = conv(x)  # Convolution
+            conv_bank.append(c[:, :, :seq_len])
+        # Stack along the channel axis
+        conv_bank = torch.cat(conv_bank, dim=1)
+        # dump the last padding to fit residual
+        x = self.maxpool(conv_bank)[:, :, :seq_len]
+        # Conv1d projections
+        x = self.conv_project1(x)
+        x = self.conv_project2(x)
+        # Residual Connect
+        x = x + residual
+        # Through the highways
+        x = x.transpose(1, 2)
+        if self.highway_mismatch is True:
+            x = self.pre_highway(x)
+        for h in self.highways:
+            x = h(x)
+        # And then the RNN
+        x, _ = self.rnn(x)
+        return x
+    def _flatten_parameters(self):
+        """Calls `flatten_parameters` on all the rnns used by the WaveRNN. Used
+        to improve efficiency and avoid PyTorch yelling at us."""
+        [m.flatten_parameters() for m in self._to_flatten]
+class TacotronEncoder(nn.Module):
+    def __init__(self, embed_dims, num_chars, cbhg_channels, K, num_highways, dropout):
+        super().__init__()
+        self.embedding = nn.Embedding(num_chars, embed_dims)
+        self.pre_net = PreNet(embed_dims, embed_dims, embed_dims, dropout=dropout)
+        self.cbhg = CBHG(K=K, in_channels=cbhg_channels, channels=cbhg_channels,
+                         proj_channels=[cbhg_channels, cbhg_channels],
+                         num_highways=num_highways)
+        self.proj_out = nn.Linear(cbhg_channels * 2, cbhg_channels)
+    def forward(self, x):
+        x = self.embedding(x)
+        x = self.pre_net(x)
+        x.transpose_(1, 2)
+        x = self.cbhg(x)
+        x = self.proj_out(x)
+        return x
+class RNNEncoder(nn.Module):
+    def __init__(self, num_chars, embedding_dim, n_convolutions=3, kernel_size=5):
+        super(RNNEncoder, self).__init__()
+        self.embedding = nn.Embedding(num_chars, embedding_dim, padding_idx=0)
+        convolutions = []
+        for _ in range(n_convolutions):
+            conv_layer = nn.Sequential(
+                ConvNorm(embedding_dim,
+                         embedding_dim,
+                         kernel_size=kernel_size, stride=1,
+                         padding=int((kernel_size - 1) / 2),
+                         dilation=1, w_init_gain='relu'),
+                nn.BatchNorm1d(embedding_dim))
+            convolutions.append(conv_layer)
+        self.convolutions = nn.ModuleList(convolutions)
+        self.lstm = nn.LSTM(embedding_dim, int(embedding_dim / 2), 1,
+                            batch_first=True, bidirectional=True)
+    def forward(self, x):
+        input_lengths = (x > 0).sum(-1)
+        input_lengths = input_lengths.cpu().numpy()
+        x = self.embedding(x)
+        x = x.transpose(1, 2)  # [B, H, T]
+        for conv in self.convolutions:
+            x = F.dropout(F.relu(conv(x)), 0.5, self.training) + x
+        x = x.transpose(1, 2)  # [B, T, H]
+        # pytorch tensor are not reversible, hence the conversion
+        x = nn.utils.rnn.pack_padded_sequence(x, input_lengths, batch_first=True, enforce_sorted=False)
+        self.lstm.flatten_parameters()
+        outputs, _ = self.lstm(x)
+        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True)
+        return outputs
+class DecoderRNN(torch.nn.Module):
+    def __init__(self, hidden_size, decoder_rnn_dim, dropout):
+        super(DecoderRNN, self).__init__()
+        self.in_conv1d = nn.Sequential(
+            torch.nn.Conv1d(
+                in_channels=hidden_size,
+                out_channels=hidden_size,
+                kernel_size=9, padding=4,
+            ),
+            torch.nn.ReLU(),
+            torch.nn.Conv1d(
+                in_channels=hidden_size,
+                out_channels=hidden_size,
+                kernel_size=9, padding=4,
+            ),
+        )
+        self.ln = nn.LayerNorm(hidden_size)
+        if decoder_rnn_dim == 0:
+            decoder_rnn_dim = hidden_size * 2
+        self.rnn = torch.nn.LSTM(
+            input_size=hidden_size,
+            hidden_size=decoder_rnn_dim,
+            num_layers=1,
+            batch_first=True,
+            bidirectional=True,
+            dropout=dropout
+        )
+        self.rnn.flatten_parameters()
+        self.conv1d = torch.nn.Conv1d(
+            in_channels=decoder_rnn_dim * 2,
+            out_channels=hidden_size,
+            kernel_size=3,
+            padding=1,
+        )
+    def forward(self, x):
+        input_masks = x.abs().sum(-1).ne(0).data[:, :, None]
+        input_lengths = input_masks.sum([-1, -2])
+        input_lengths = input_lengths.cpu().numpy()
+        x = self.in_conv1d(x.transpose(1, 2)).transpose(1, 2)
+        x = self.ln(x)
+        x = nn.utils.rnn.pack_padded_sequence(x, input_lengths, batch_first=True, enforce_sorted=False)
+        self.rnn.flatten_parameters()
+        x, _ = self.rnn(x)  # [B, T, C]
+        x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
+        x = x * input_masks
+        pre_mel = self.conv1d(x.transpose(1, 2)).transpose(1, 2)  # [B, T, C]
+        pre_mel = pre_mel * input_masks
+        return pre_mel

preprocess/tools/note_transcription/modules/commons/transformer.py ADDED Viewed

	@@ -0,0 +1,751 @@

+import math
+import torch
+from torch import nn
+from torch.nn import Parameter, Linear
+from .layers import LayerNorm, Embedding
+from ...utils.nn.seq_utils import (
+    get_incremental_state,
+    set_incremental_state,
+    softmax,
+    make_positions,
+)
+import torch.nn.functional as F
+DEFAULT_MAX_SOURCE_POSITIONS = 2000
+DEFAULT_MAX_TARGET_POSITIONS = 2000
+class SinusoidalPositionalEmbedding(nn.Module):
+    """This module produces sinusoidal positional embeddings of any length.
+    Padding symbols are ignored.
+    """
+    def __init__(self, embedding_dim, padding_idx, init_size=1024):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.padding_idx = padding_idx
+        self.weights = SinusoidalPositionalEmbedding.get_embedding(
+            init_size,
+            embedding_dim,
+            padding_idx,
+        )
+        self.register_buffer('_float_tensor', torch.FloatTensor(1))
+    @staticmethod
+    def get_embedding(num_embeddings, embedding_dim, padding_idx=None):
+        """Build sinusoidal embeddings.
+        This matches the implementation in tensor2tensor, but differs slightly
+        from the description in Section 3.5 of "Attention Is All You Need".
+        """
+        half_dim = embedding_dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -emb)
+        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)
+        if embedding_dim % 2 == 1:
+            # zero pad
+            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)
+        if padding_idx is not None:
+            emb[padding_idx, :] = 0
+        return emb
+    def forward(self, input, incremental_state=None, timestep=None, positions=None, **kwargs):
+        """Input is expected to be of size [bsz x seqlen]."""
+        bsz, seq_len = input.shape[:2]
+        max_pos = self.padding_idx + 1 + seq_len
+        if self.weights is None or max_pos > self.weights.size(0):
+            # recompute/expand embeddings if needed
+            self.weights = SinusoidalPositionalEmbedding.get_embedding(
+                max_pos,
+                self.embedding_dim,
+                self.padding_idx,
+            )
+        self.weights = self.weights.to(self._float_tensor)
+        if incremental_state is not None:
+            # positions is the same for every token when decoding a single step
+            pos = timestep.view(-1)[0] + 1 if timestep is not None else seq_len
+            return self.weights[self.padding_idx + pos, :].expand(bsz, 1, -1)
+        positions = make_positions(input, self.padding_idx) if positions is None else positions
+        return self.weights.index_select(0, positions.view(-1)).view(bsz, seq_len, -1).detach()
+    def max_positions(self):
+        """Maximum number of supported positions."""
+        return int(1e5)  # an arbitrary large number
+class TransformerFFNLayer(nn.Module):
+    def __init__(self, hidden_size, filter_size, padding="SAME", kernel_size=1, dropout=0., act='gelu'):
+        super().__init__()
+        self.kernel_size = kernel_size
+        self.dropout = dropout
+        self.act = act
+        if padding == 'SAME':
+            self.ffn_1 = nn.Conv1d(hidden_size, filter_size, kernel_size, padding=kernel_size // 2)
+        elif padding == 'LEFT':
+            self.ffn_1 = nn.Sequential(
+                nn.ConstantPad1d((kernel_size - 1, 0), 0.0),
+                nn.Conv1d(hidden_size, filter_size, kernel_size)
+            )
+        self.ffn_2 = Linear(filter_size, hidden_size)
+    def forward(self, x, incremental_state=None):
+        # x: T x B x C
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_input' in saved_state:
+                prev_input = saved_state['prev_input']
+                x = torch.cat((prev_input, x), dim=0)
+            x = x[-self.kernel_size:]
+            saved_state['prev_input'] = x
+            self._set_input_buffer(incremental_state, saved_state)
+        x = self.ffn_1(x.permute(1, 2, 0)).permute(2, 0, 1)
+        x = x * self.kernel_size ** -0.5
+        if incremental_state is not None:
+            x = x[-1:]
+        if self.act == 'gelu':
+            x = F.gelu(x)
+        if self.act == 'relu':
+            x = F.relu(x)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = self.ffn_2(x)
+        return x
+    def _get_input_buffer(self, incremental_state):
+        return get_incremental_state(
+            self,
+            incremental_state,
+            'f',
+        ) or {}
+    def _set_input_buffer(self, incremental_state, buffer):
+        set_incremental_state(
+            self,
+            incremental_state,
+            'f',
+            buffer,
+        )
+    def clear_buffer(self, incremental_state):
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_input' in saved_state:
+                del saved_state['prev_input']
+            self._set_input_buffer(incremental_state, saved_state)
+class MultiheadAttention(nn.Module):
+    def __init__(self, embed_dim, num_heads, kdim=None, vdim=None, dropout=0., bias=True,
+                 add_bias_kv=False, add_zero_attn=False, self_attention=False,
+                 encoder_decoder_attention=False):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.qkv_same_dim = self.kdim == embed_dim and self.vdim == embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+        self.scaling = self.head_dim ** -0.5
+        self.self_attention = self_attention
+        self.encoder_decoder_attention = encoder_decoder_attention
+        assert not self.self_attention or self.qkv_same_dim, 'Self-attention requires query, key and ' \
+                                                             'value to be of the same size'
+        if self.qkv_same_dim:
+            self.in_proj_weight = Parameter(torch.Tensor(3 * embed_dim, embed_dim))
+        else:
+            self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
+            self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
+            self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
+        if bias:
+            self.in_proj_bias = Parameter(torch.Tensor(3 * embed_dim))
+        else:
+            self.register_parameter('in_proj_bias', None)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.Tensor(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.Tensor(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+        self.add_zero_attn = add_zero_attn
+        self.reset_parameters()
+        self.enable_torch_version = False
+        if hasattr(F, "multi_head_attention_forward"):
+            self.enable_torch_version = True
+        else:
+            self.enable_torch_version = False
+        self.last_attn_probs = None
+    def reset_parameters(self):
+        if self.qkv_same_dim:
+            nn.init.xavier_uniform_(self.in_proj_weight)
+        else:
+            nn.init.xavier_uniform_(self.k_proj_weight)
+            nn.init.xavier_uniform_(self.v_proj_weight)
+            nn.init.xavier_uniform_(self.q_proj_weight)
+        nn.init.xavier_uniform_(self.out_proj.weight)
+        if self.in_proj_bias is not None:
+            nn.init.constant_(self.in_proj_bias, 0.)
+            nn.init.constant_(self.out_proj.bias, 0.)
+        if self.bias_k is not None:
+            nn.init.xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            nn.init.xavier_normal_(self.bias_v)
+    def forward(
+            self,
+            query, key, value,
+            key_padding_mask=None,
+            incremental_state=None,
+            need_weights=True,
+            static_kv=False,
+            attn_mask=None,
+            before_softmax=False,
+            need_head_weights=False,
+            enc_dec_attn_constraint_mask=None,
+            reset_attn_weight=None
+    ):
+        """Input shape: Time x Batch x Channel
+        Args:
+            key_padding_mask (ByteTensor, optional): mask to exclude
+                keys that are pads, of shape `(batch, src_len)`, where
+                padding elements are indicated by 1s.
+            need_weights (bool, optional): return the attention weights,
+                averaged over heads (default: False).
+            attn_mask (ByteTensor, optional): typically used to
+                implement causal attention, where the mask prevents the
+                attention from looking forward in time (default: None).
+            before_softmax (bool, optional): return the raw attention
+                weights and values before the attention softmax.
+            need_head_weights (bool, optional): return the attention
+                weights for each head. Implies *need_weights*. Default:
+                return the average attention weights over all heads.
+        """
+        if need_head_weights:
+            need_weights = True
+        tgt_len, bsz, embed_dim = query.size()
+        assert embed_dim == self.embed_dim
+        assert list(query.size()) == [tgt_len, bsz, embed_dim]
+        if self.enable_torch_version and incremental_state is None and not static_kv and reset_attn_weight is None:
+            if self.qkv_same_dim:
+                return F.multi_head_attention_forward(query, key, value,
+                                                      self.embed_dim, self.num_heads,
+                                                      self.in_proj_weight,
+                                                      self.in_proj_bias, self.bias_k, self.bias_v,
+                                                      self.add_zero_attn, self.dropout,
+                                                      self.out_proj.weight, self.out_proj.bias,
+                                                      self.training, key_padding_mask, need_weights,
+                                                      attn_mask)
+            else:
+                return F.multi_head_attention_forward(query, key, value,
+                                                      self.embed_dim, self.num_heads,
+                                                      torch.empty([0]),
+                                                      self.in_proj_bias, self.bias_k, self.bias_v,
+                                                      self.add_zero_attn, self.dropout,
+                                                      self.out_proj.weight, self.out_proj.bias,
+                                                      self.training, key_padding_mask, need_weights,
+                                                      attn_mask, use_separate_proj_weight=True,
+                                                      q_proj_weight=self.q_proj_weight,
+                                                      k_proj_weight=self.k_proj_weight,
+                                                      v_proj_weight=self.v_proj_weight)
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_key' in saved_state:
+                # previous time steps are cached - no need to recompute
+                # key and value if they are static
+                if static_kv:
+                    assert self.encoder_decoder_attention and not self.self_attention
+                    key = value = None
+        else:
+            saved_state = None
+        if self.self_attention:
+            # self-attention
+            q, k, v = self.in_proj_qkv(query)
+        elif self.encoder_decoder_attention:
+            # encoder-decoder attention
+            q = self.in_proj_q(query)
+            if key is None:
+                assert value is None
+                k = v = None
+            else:
+                k = self.in_proj_k(key)
+                v = self.in_proj_v(key)
+        else:
+            q = self.in_proj_q(query)
+            k = self.in_proj_k(key)
+            v = self.in_proj_v(value)
+        q *= self.scaling
+        if self.bias_k is not None:
+            assert self.bias_v is not None
+            k = torch.cat([k, self.bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, self.bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, key_padding_mask.new_zeros(key_padding_mask.size(0), 1)], dim=1)
+        q = q.contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if k is not None:
+            k = k.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if v is not None:
+            v = v.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if saved_state is not None:
+            # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)
+            if 'prev_key' in saved_state:
+                prev_key = saved_state['prev_key'].view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    k = prev_key
+                else:
+                    k = torch.cat((prev_key, k), dim=1)
+            if 'prev_value' in saved_state:
+                prev_value = saved_state['prev_value'].view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    v = prev_value
+                else:
+                    v = torch.cat((prev_value, v), dim=1)
+            if 'prev_key_padding_mask' in saved_state and saved_state['prev_key_padding_mask'] is not None:
+                prev_key_padding_mask = saved_state['prev_key_padding_mask']
+                if static_kv:
+                    key_padding_mask = prev_key_padding_mask
+                else:
+                    key_padding_mask = torch.cat((prev_key_padding_mask, key_padding_mask), dim=1)
+            saved_state['prev_key'] = k.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state['prev_value'] = v.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state['prev_key_padding_mask'] = key_padding_mask
+            self._set_input_buffer(incremental_state, saved_state)
+        src_len = k.size(1)
+        # This is part of a workaround to get around fork/join parallelism
+        # not supporting Optional types.
+        if key_padding_mask is not None and key_padding_mask.shape == torch.Size([]):
+            key_padding_mask = None
+        if key_padding_mask is not None:
+            assert key_padding_mask.size(0) == bsz
+            assert key_padding_mask.size(1) == src_len
+        if self.add_zero_attn:
+            src_len += 1
+            k = torch.cat([k, k.new_zeros((k.size(0), 1) + k.size()[2:])], dim=1)
+            v = torch.cat([v, v.new_zeros((v.size(0), 1) + v.size()[2:])], dim=1)
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, torch.zeros(key_padding_mask.size(0), 1).type_as(key_padding_mask)], dim=1)
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
+        assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+        if attn_mask is not None:
+            if len(attn_mask.shape) == 2:
+                attn_mask = attn_mask.unsqueeze(0)
+            elif len(attn_mask.shape) == 3:
+                attn_mask = attn_mask[:, None].repeat([1, self.num_heads, 1, 1]).reshape(
+                    bsz * self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights + attn_mask
+        if enc_dec_attn_constraint_mask is not None:  # bs x head x L_kv
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.masked_fill(
+                enc_dec_attn_constraint_mask.unsqueeze(2).bool(),
+                -1e8,
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        if key_padding_mask is not None:
+            # don't attend to padding symbols
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.masked_fill(
+                key_padding_mask.unsqueeze(1).unsqueeze(2),
+                -1e8,
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        attn_logits = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+        if before_softmax:
+            return attn_weights, v
+        attn_weights_float = softmax(attn_weights, dim=-1)
+        attn_weights = attn_weights_float.type_as(attn_weights)
+        attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)
+        if reset_attn_weight is not None:
+            if reset_attn_weight:
+                self.last_attn_probs = attn_probs.detach()
+            else:
+                assert self.last_attn_probs is not None
+                attn_probs = self.last_attn_probs
+        attn = torch.bmm(attn_probs, v)
+        assert list(attn.size()) == [bsz * self.num_heads, tgt_len, self.head_dim]
+        attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+        attn = self.out_proj(attn)
+        if need_weights:
+            attn_weights = attn_weights_float.view(bsz, self.num_heads, tgt_len, src_len).transpose(1, 0)
+            if not need_head_weights:
+                # average attention weights over heads
+                attn_weights = attn_weights.mean(dim=0)
+        else:
+            attn_weights = None
+        return attn, (attn_weights, attn_logits)
+    def in_proj_qkv(self, query):
+        return self._in_proj(query).chunk(3, dim=-1)
+    def in_proj_q(self, query):
+        if self.qkv_same_dim:
+            return self._in_proj(query, end=self.embed_dim)
+        else:
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[:self.embed_dim]
+            return F.linear(query, self.q_proj_weight, bias)
+    def in_proj_k(self, key):
+        if self.qkv_same_dim:
+            return self._in_proj(key, start=self.embed_dim, end=2 * self.embed_dim)
+        else:
+            weight = self.k_proj_weight
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[self.embed_dim:2 * self.embed_dim]
+            return F.linear(key, weight, bias)
+    def in_proj_v(self, value):
+        if self.qkv_same_dim:
+            return self._in_proj(value, start=2 * self.embed_dim)
+        else:
+            weight = self.v_proj_weight
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[2 * self.embed_dim:]
+            return F.linear(value, weight, bias)
+    def _in_proj(self, input, start=0, end=None):
+        weight = self.in_proj_weight
+        bias = self.in_proj_bias
+        weight = weight[start:end, :]
+        if bias is not None:
+            bias = bias[start:end]
+        return F.linear(input, weight, bias)
+    def _get_input_buffer(self, incremental_state):
+        return get_incremental_state(
+            self,
+            incremental_state,
+            'attn_state',
+        ) or {}
+    def _set_input_buffer(self, incremental_state, buffer):
+        set_incremental_state(
+            self,
+            incremental_state,
+            'attn_state',
+            buffer,
+        )
+    def apply_sparse_mask(self, attn_weights, tgt_len, src_len, bsz):
+        return attn_weights
+    def clear_buffer(self, incremental_state=None):
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_key' in saved_state:
+                del saved_state['prev_key']
+            if 'prev_value' in saved_state:
+                del saved_state['prev_value']
+            self._set_input_buffer(incremental_state, saved_state)
+class EncSALayer(nn.Module):
+    def __init__(self, c, num_heads, dropout, attention_dropout=0.1,
+                 relu_dropout=0.1, kernel_size=9, padding='SAME', act='gelu'):
+        super().__init__()
+        self.c = c
+        self.dropout = dropout
+        self.num_heads = num_heads
+        if num_heads > 0:
+            self.layer_norm1 = LayerNorm(c)
+            self.self_attn = MultiheadAttention(
+                self.c, num_heads, self_attention=True, dropout=attention_dropout, bias=False)
+        self.layer_norm2 = LayerNorm(c)
+        self.ffn = TransformerFFNLayer(
+            c, 4 * c, kernel_size=kernel_size, dropout=relu_dropout, padding=padding, act=act)
+    def forward(self, x, encoder_padding_mask=None, **kwargs):
+        layer_norm_training = kwargs.get('layer_norm_training', None)
+        if layer_norm_training is not None:
+            self.layer_norm1.training = layer_norm_training
+            self.layer_norm2.training = layer_norm_training
+        if self.num_heads > 0:
+            residual = x
+            x = self.layer_norm1(x)
+            x, _, = self.self_attn(
+                query=x,
+                key=x,
+                value=x,
+                key_padding_mask=encoder_padding_mask
+            )
+            x = F.dropout(x, self.dropout, training=self.training)
+            x = residual + x
+            x = x * (1 - encoder_padding_mask.float()).transpose(0, 1)[..., None]
+        residual = x
+        x = self.layer_norm2(x)
+        x = self.ffn(x)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        x = x * (1 - encoder_padding_mask.float()).transpose(0, 1)[..., None]
+        return x
+class DecSALayer(nn.Module):
+    def __init__(self, c, num_heads, dropout, attention_dropout=0.1, relu_dropout=0.1,
+                 kernel_size=9, act='gelu'):
+        super().__init__()
+        self.c = c
+        self.dropout = dropout
+        self.layer_norm1 = LayerNorm(c)
+        self.self_attn = MultiheadAttention(
+            c, num_heads, self_attention=True, dropout=attention_dropout, bias=False
+        )
+        self.layer_norm2 = LayerNorm(c)
+        self.encoder_attn = MultiheadAttention(
+            c, num_heads, encoder_decoder_attention=True, dropout=attention_dropout, bias=False,
+        )
+        self.layer_norm3 = LayerNorm(c)
+        self.ffn = TransformerFFNLayer(
+            c, 4 * c, padding='LEFT', kernel_size=kernel_size, dropout=relu_dropout, act=act)
+    def forward(
+            self,
+            x,
+            encoder_out=None,
+            encoder_padding_mask=None,
+            incremental_state=None,
+            self_attn_mask=None,
+            self_attn_padding_mask=None,
+            attn_out=None,
+            reset_attn_weight=None,
+            **kwargs,
+    ):
+        layer_norm_training = kwargs.get('layer_norm_training', None)
+        if layer_norm_training is not None:
+            self.layer_norm1.training = layer_norm_training
+            self.layer_norm2.training = layer_norm_training
+            self.layer_norm3.training = layer_norm_training
+        residual = x
+        x = self.layer_norm1(x)
+        x, _ = self.self_attn(
+            query=x,
+            key=x,
+            value=x,
+            key_padding_mask=self_attn_padding_mask,
+            incremental_state=incremental_state,
+            attn_mask=self_attn_mask
+        )
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        attn_logits = None
+        if encoder_out is not None or attn_out is not None:
+            residual = x
+            x = self.layer_norm2(x)
+        if encoder_out is not None:
+            x, attn = self.encoder_attn(
+                query=x,
+                key=encoder_out,
+                value=encoder_out,
+                key_padding_mask=encoder_padding_mask,
+                incremental_state=incremental_state,
+                static_kv=True,
+                enc_dec_attn_constraint_mask=get_incremental_state(self, incremental_state,
+                                                                   'enc_dec_attn_constraint_mask'),
+                reset_attn_weight=reset_attn_weight
+            )
+            attn_logits = attn[1]
+        elif attn_out is not None:
+            x = self.encoder_attn.in_proj_v(attn_out)
+        if encoder_out is not None or attn_out is not None:
+            x = F.dropout(x, self.dropout, training=self.training)
+            x = residual + x
+        residual = x
+        x = self.layer_norm3(x)
+        x = self.ffn(x, incremental_state=incremental_state)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        return x, attn_logits
+    def clear_buffer(self, input, encoder_out=None, encoder_padding_mask=None, incremental_state=None):
+        self.encoder_attn.clear_buffer(incremental_state)
+        self.ffn.clear_buffer(incremental_state)
+    def set_buffer(self, name, tensor, incremental_state):
+        return set_incremental_state(self, incremental_state, name, tensor)
+class TransformerEncoderLayer(nn.Module):
+    def __init__(self, hidden_size, dropout, kernel_size=9, num_heads=2):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.num_heads = num_heads
+        self.op = EncSALayer(
+            hidden_size, num_heads, dropout=dropout,
+            attention_dropout=0.0, relu_dropout=dropout,
+            kernel_size=kernel_size)
+    def forward(self, x, **kwargs):
+        return self.op(x, **kwargs)
+class TransformerDecoderLayer(nn.Module):
+    def __init__(self, hidden_size, dropout, kernel_size=9, num_heads=2):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.num_heads = num_heads
+        self.op = DecSALayer(
+            hidden_size, num_heads, dropout=dropout,
+            attention_dropout=0.0, relu_dropout=dropout,
+            kernel_size=kernel_size)
+    def forward(self, x, **kwargs):
+        return self.op(x, **kwargs)
+    def clear_buffer(self, *args):
+        return self.op.clear_buffer(*args)
+    def set_buffer(self, *args):
+        return self.op.set_buffer(*args)
+class FFTBlocks(nn.Module):
+    def __init__(self, hidden_size, num_layers, ffn_kernel_size=9, dropout=0.0,
+                 num_heads=2, use_pos_embed=True, use_last_norm=True,
+                 use_pos_embed_alpha=True):
+        super().__init__()
+        self.num_layers = num_layers
+        embed_dim = self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.use_pos_embed = use_pos_embed
+        self.use_last_norm = use_last_norm
+        if use_pos_embed:
+            self.max_source_positions = DEFAULT_MAX_TARGET_POSITIONS
+            self.padding_idx = 0
+            self.pos_embed_alpha = nn.Parameter(torch.Tensor([1])) if use_pos_embed_alpha else 1
+            self.embed_positions = SinusoidalPositionalEmbedding(
+                embed_dim, self.padding_idx, init_size=DEFAULT_MAX_TARGET_POSITIONS,
+            )
+        self.layers = nn.ModuleList([])
+        self.layers.extend([
+            TransformerEncoderLayer(self.hidden_size, self.dropout,
+                                    kernel_size=ffn_kernel_size, num_heads=num_heads)
+            for _ in range(self.num_layers)
+        ])
+        if self.use_last_norm:
+            self.layer_norm = nn.LayerNorm(embed_dim)
+        else:
+            self.layer_norm = None
+    def forward(self, x, padding_mask=None, attn_mask=None, return_hiddens=False):
+        """
+        :param x: [B, T, C]
+        :param padding_mask: [B, T]
+        :return: [B, T, C] or [L, B, T, C]
+        """
+        padding_mask = x.abs().sum(-1).eq(0).data if padding_mask is None else padding_mask
+        nonpadding_mask_TB = 1 - padding_mask.transpose(0, 1).float()[:, :, None]  # [T, B, 1]
+        if self.use_pos_embed:
+            positions = self.pos_embed_alpha * self.embed_positions(x[..., 0])
+            x = x + positions
+            x = F.dropout(x, p=self.dropout, training=self.training)
+        # B x T x C -> T x B x C
+        x = x.transpose(0, 1) * nonpadding_mask_TB
+        hiddens = []
+        for layer in self.layers:
+            x = layer(x, encoder_padding_mask=padding_mask, attn_mask=attn_mask) * nonpadding_mask_TB
+            hiddens.append(x)
+        if self.use_last_norm:
+            x = self.layer_norm(x) * nonpadding_mask_TB
+        if return_hiddens:
+            x = torch.stack(hiddens, 0)  # [L, T, B, C]
+            x = x.transpose(1, 2)  # [L, B, T, C]
+        else:
+            x = x.transpose(0, 1)  # [B, T, C]
+        return x
+class FastSpeechEncoder(FFTBlocks):
+    def __init__(self, dict_size, hidden_size=256, num_layers=4, kernel_size=9, num_heads=2,
+                 dropout=0.0):
+        super().__init__(hidden_size, num_layers, kernel_size, num_heads=num_heads,
+                         use_pos_embed=False, dropout=dropout)  # use_pos_embed_alpha for compatibility
+        self.embed_tokens = Embedding(dict_size, hidden_size, 0)
+        self.embed_scale = math.sqrt(hidden_size)
+        self.padding_idx = 0
+        self.embed_positions = SinusoidalPositionalEmbedding(
+            hidden_size, self.padding_idx, init_size=DEFAULT_MAX_TARGET_POSITIONS,
+        )
+    def forward(self, txt_tokens, attn_mask=None):
+        """
+        :param txt_tokens: [B, T]
+        :return: {
+            'encoder_out': [B x T x C]
+        }
+        """
+        encoder_padding_mask = txt_tokens.eq(self.padding_idx).data
+        x = self.forward_embedding(txt_tokens)  # [B, T, H]
+        if self.num_layers > 0:
+            x = super(FastSpeechEncoder, self).forward(x, encoder_padding_mask, attn_mask=attn_mask)
+        return x
+    def forward_embedding(self, txt_tokens):
+        # embed tokens and positions
+        x = self.embed_scale * self.embed_tokens(txt_tokens)
+        positions = self.embed_positions(txt_tokens)
+        x = x + positions
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        return x
+class FastSpeechDecoder(FFTBlocks):
+    def __init__(self, hidden_size=256, num_layers=4, kernel_size=9, num_heads=2):
+        super().__init__(hidden_size, num_layers, kernel_size, num_heads=num_heads)

preprocess/tools/note_transcription/modules/commons/wavenet.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import torch
+from torch import nn
+from packaging import version
+def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
+    n_channels_int = n_channels[0]
+    in_act = input_a + input_b
+    t_act = torch.tanh(in_act[:, :n_channels_int, :])
+    s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
+    acts = t_act * s_act
+    return acts
+jit_fused_add_tanh_sigmoid_multiply = fused_add_tanh_sigmoid_multiply
+def script_function():
+    if version.parse(torch.__version__) >= version.parse('2.0'):
+        global jit_fused_add_tanh_sigmoid_multiply
+        jit_fused_add_tanh_sigmoid_multiply = torch.jit.script(fused_add_tanh_sigmoid_multiply)
+class WN(torch.nn.Module):
+    def __init__(self, hidden_size, kernel_size, dilation_rate, n_layers, c_cond=0,
+                 p_dropout=0, share_cond_layers=False, is_BTC=False):
+        super(WN, self).__init__()
+        assert (kernel_size % 2 == 1)
+        assert (hidden_size % 2 == 0)
+        self.is_BTC = is_BTC
+        self.hidden_size = hidden_size
+        self.kernel_size = kernel_size
+        self.dilation_rate = dilation_rate
+        self.n_layers = n_layers
+        self.gin_channels = c_cond
+        self.p_dropout = p_dropout
+        self.share_cond_layers = share_cond_layers
+        self.in_layers = torch.nn.ModuleList()
+        self.res_skip_layers = torch.nn.ModuleList()
+        self.drop = nn.Dropout(p_dropout)
+        if c_cond != 0 and not share_cond_layers:
+            cond_layer = torch.nn.Conv1d(c_cond, 2 * hidden_size * n_layers, 1)
+            self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
+        for i in range(n_layers):
+            dilation = dilation_rate ** i
+            padding = int((kernel_size * dilation - dilation) / 2)
+            in_layer = torch.nn.Conv1d(hidden_size, 2 * hidden_size, kernel_size,
+                                       dilation=dilation, padding=padding)
+            in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
+            self.in_layers.append(in_layer)
+            # last one is not necessary
+            if i < n_layers - 1:
+                res_skip_channels = 2 * hidden_size
+            else:
+                res_skip_channels = hidden_size
+            res_skip_layer = torch.nn.Conv1d(hidden_size, res_skip_channels, 1)
+            res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
+            self.res_skip_layers.append(res_skip_layer)
+        script_function()
+    def forward(self, x, nonpadding=None, cond=None):
+        if self.is_BTC:
+            x = x.transpose(1, 2)
+            cond = cond.transpose(1, 2) if cond is not None else None
+            nonpadding = nonpadding.transpose(1, 2) if nonpadding is not None else None
+        if nonpadding is None:
+            nonpadding = 1
+        output = torch.zeros_like(x)
+        n_channels_tensor = torch.IntTensor([self.hidden_size])
+        if cond is not None and not self.share_cond_layers:
+            cond = self.cond_layer(cond)
+        for i in range(self.n_layers):
+            x_in = self.in_layers[i](x)
+            x_in = self.drop(x_in)
+            if cond is not None:
+                cond_offset = i * 2 * self.hidden_size
+                cond_l = cond[:, cond_offset:cond_offset + 2 * self.hidden_size, :]
+            else:
+                cond_l = torch.zeros_like(x_in)
+            if version.parse(torch.__version__) >= version.parse('2.0'):
+                acts = jit_fused_add_tanh_sigmoid_multiply(x_in, cond_l, n_channels_tensor)
+            else:
+                acts = fused_add_tanh_sigmoid_multiply(x_in, cond_l, n_channels_tensor)
+            res_skip_acts = self.res_skip_layers[i](acts)
+            if i < self.n_layers - 1:
+                x = (x + res_skip_acts[:, :self.hidden_size, :]) * nonpadding
+                output = output + res_skip_acts[:, self.hidden_size:, :]
+            else:
+                output = output + res_skip_acts
+        output = output * nonpadding
+        if self.is_BTC:
+            output = output.transpose(1, 2)
+        return output
+    def remove_weight_norm(self):
+        def remove_weight_norm(m):
+            try:
+                nn.utils.remove_weight_norm(m)
+            except ValueError:  # this module didn't have weight norm
+                return
+        self.apply(remove_weight_norm)

preprocess/tools/note_transcription/modules/pe/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Pitch extractor modules for ROSVOT."""