Spaces:

JeffreyZhou798
/

SolfegeScore-Singer-01

Paused

App Files Files Community

JeffreyZhou798 commited on 27 days ago

Commit

7ee408c

verified ·

1 Parent(s): 65e3901

Upload 111 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

preprocess/README.md +155 -0
preprocess/pipeline.py +161 -0
preprocess/requirements.txt +34 -0
preprocess/tools/__init__.py +53 -0
preprocess/tools/f0_extraction.py +527 -0
preprocess/tools/g2p.py +72 -0
preprocess/tools/lyric_transcription.py +283 -0
preprocess/tools/midi_editor/README.md +170 -0
preprocess/tools/midi_editor/README_CN.md +170 -0
preprocess/tools/midi_editor/eslint.config.js +23 -0
preprocess/tools/midi_editor/index.html +13 -0
preprocess/tools/midi_editor/package-lock.json +0 -0
preprocess/tools/midi_editor/package.json +39 -0
preprocess/tools/midi_editor/postcss.config.js +6 -0
preprocess/tools/midi_editor/public/vite.svg +1 -0
preprocess/tools/midi_editor/src/App.css +834 -0
preprocess/tools/midi_editor/src/App.tsx +675 -0
preprocess/tools/midi_editor/src/components/AudioTrack.tsx +182 -0
preprocess/tools/midi_editor/src/components/LyricTable.tsx +301 -0
preprocess/tools/midi_editor/src/components/PianoRoll.tsx +704 -0
preprocess/tools/midi_editor/src/constants.ts +8 -0
preprocess/tools/midi_editor/src/i18n.ts +196 -0
preprocess/tools/midi_editor/src/index.css +37 -0
preprocess/tools/midi_editor/src/lib/midi.ts +224 -0
preprocess/tools/midi_editor/src/main.tsx +10 -0
preprocess/tools/midi_editor/src/store/useMidiStore.ts +78 -0
preprocess/tools/midi_editor/src/types.ts +17 -0
preprocess/tools/midi_editor/tailwind.config.js +33 -0
preprocess/tools/midi_editor/tsconfig.app.json +28 -0
preprocess/tools/midi_editor/tsconfig.json +7 -0
preprocess/tools/midi_editor/tsconfig.node.json +26 -0
preprocess/tools/midi_editor/vite.config.ts +7 -0
preprocess/tools/midi_parser.py +598 -0
preprocess/tools/note_transcription/__init__.py +0 -0
preprocess/tools/note_transcription/model.py +531 -0
preprocess/tools/note_transcription/modules/__init__.py +1 -0
preprocess/tools/note_transcription/modules/commons/__init__.py +1 -0
preprocess/tools/note_transcription/modules/commons/conformer/__init__.py +1 -0
preprocess/tools/note_transcription/modules/commons/conformer/conformer.py +96 -0
preprocess/tools/note_transcription/modules/commons/conformer/espnet_positional_embedding.py +113 -0
preprocess/tools/note_transcription/modules/commons/conformer/espnet_transformer_attn.py +198 -0
preprocess/tools/note_transcription/modules/commons/conformer/layers.py +260 -0
preprocess/tools/note_transcription/modules/commons/conv.py +175 -0
preprocess/tools/note_transcription/modules/commons/layers.py +85 -0
preprocess/tools/note_transcription/modules/commons/rel_transformer.py +378 -0
preprocess/tools/note_transcription/modules/commons/rnn.py +261 -0
preprocess/tools/note_transcription/modules/commons/transformer.py +751 -0
preprocess/tools/note_transcription/modules/commons/wavenet.py +109 -0
preprocess/tools/note_transcription/modules/pe/__init__.py +1 -0
preprocess/tools/note_transcription/modules/pe/rmvpe/__init__.py +6 -0

preprocess/README.md ADDED Viewed

	@@ -0,0 +1,155 @@

+# 🎵 SoulX-Singer-Preprocess
+This part offers a comprehensive **singing transcription and editing toolkit** for real-world music audio. It provides the pipeline from vocal extraction to high-level annotation optimized for SVS dataset construction. By integrating state-of-the-art models, it transforms raw audio into structured singing data and supports the **customizable creation and editing of lyric-aligned MIDI scores**.
+## ✨ Features
+The toolkit includes the following core modules:
+- 🎤 **Clean Dry Vocal Extraction**
+  Extracts the lead vocal track from polyphonic music audio and dereverberation.
+- 📝 **Lyrics Transcription**
+  Automatically transcribes lyrics from clean vocal.
+- 🎶 **Note Transcription**
+  Converts singing voice into note-level representations for SVS.
+- 🎼 **MIDI Editor**
+  Supports customizable creation and editing of MIDI scores integrated with lyrics.
+## 🔧 Python Environment
+Before running the pipeline, set up the Python environment as follows:
+1. **Install Conda** (if not already installed): https://docs.conda.io/en/latest/miniconda.html
+2. **Activate or create a conda environment** (recommended Python 3.10):
+   - If you already have the `soulxsinger` environment:
+     ```bash
+     conda activate soulxsinger
+     ```
+   - Otherwise, create it first:
+     ```bash
+     conda create -n soulxsinger -y python=3.10
+     conda activate soulxsinger
+     ```
+3. **Install dependencies** from the `preprocess` directory:
+   ```bash
+   cd preprocess
+   pip install -r requirements.txt
+   ```
+## 📁 Data Preparation
+Before running the pipeline, prepare the following inputs:
+- **Prompt audio**
+  Reference audio that provides timbre and style
+- **Target audio**
+  Original vocal or music audio to be processed and transcribed.
+Configure the corresponding parameters in:
+```
+example/preprocess.sh
+```
+Typical configuration includes:
+- Input / output paths
+- Module enable switches
+## 🚀 Usage
+After configuring `preprocess.sh`, run the transcription pipeline with:
+```bash
+bash example/preprocess.sh
+```
+The script will automatically execute the following steps:
+1. **Vocal separation and dereverberation**
+2. **F0 extraction and voice activity detection (VAD)**
+3. **Lyrics transcription**
+4. **Note transcription**
+---
+After the pipeline completes, you will obtain **SoulX-Singer–style metadata** that can be directly used for Singing Voice Synthesis (SVS).
+**Output paths:**
+- The final metadata (**JSON file**) is written **in the same directory as your input audio**, with the **same filename** (e.g. `audio.mp3` → `audio.json`)
+- All **intermediate results** (separated vocal and accompaniment, F0, VAD outputs, etc.) are also saved under the configured **`save_dir`**.
+⚠️ **Important Note**
+Transcription errors—especially in **lyrics** and **note annotations**—can significantly affect the final SVS quality. We **strongly recommend manually reviewing and correcting** the generated metadata before inference.
+To support this, we provide a **MIDI Editor** for editing lyrics, phoneme alignment, note pitches, and durations. The workflow is:
+**Export metadata to MIDI** → edit in the MIDI Editor → **Import edited MIDI back to metadata** for SVS.
+---
+#### Step 1: Metadata → MIDI (for editing)
+Convert SoulX-Singer metadata to a MIDI file so you can open it in the MIDI Editor:
+```bash
+preprocess_root=example/transcriptions/music
+python -m preprocess.tools.midi_parser \
+    --meta2midi \
+    --meta "${preprocess_root}/metadata.json" \
+    --midi "${preprocess_root}/vocal.mid"
+```
+#### Step 2: Edit in the MIDI Editor
+Open the MIDI Editor (see [MIDI Editor Tutorial](tools/midi_editor/README.md)), load `vocal.mid`, and correct lyrics, pitches, or durations as needed. Save the result as e.g. `vocal_edited.mid`.
+#### Step 3: MIDI → Metadata (for SoulX-Singer inference)
+Convert the edited MIDI back into SoulX-Singer-style metadata (and cut wavs) for SVS:
+```bash
+python -m preprocess.tools.midi_parser \
+    --midi2meta \
+    --midi "${preprocess_root}/vocal_edited.mid" \
+    --meta "${preprocess_root}/edit_metadata.json" \
+    --vocal "${preprocess_root}/vocal.wav" \
+```
+Use `edit_metadata.json` (and the wavs under `edit_cut_wavs`) as the target metadata in your inference pipeline.
+## 🔗 References & Dependencies
+This project builds upon the following excellent open-source works:
+### 🎧 Vocal Separation & Dereverberation
+- [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
+- [Lead Vocal Separation](https://huggingface.co/becruily/mel-band-roformer-karaoke)
+- [Vocal Dereverberation](https://huggingface.co/anvuew/dereverb_mel_band_roformer)
+### 🎼 F0 Extraction
+- [RMVPE](https://github.com/Dream-High/RMVPE)
+### 📝 Lyrics Transcription (ASR)
+- [Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)
+- [Parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)
+### 🎶 Note Transcription
+- [ROSVOT](https://github.com/RickyL-2000/ROSVOT)
+We sincerely thank the authors of these repositories for their exceptional open-source contributions, which have been fundamental to the development of this toolkit.

preprocess/pipeline.py ADDED Viewed

	@@ -0,0 +1,161 @@

+import json
+import shutil
+import soundfile as sf
+from pathlib import Path
+import librosa
+from preprocess.utils import convert_metadata, merge_short_segments
+from preprocess.tools import (
+    F0Extractor,
+    VocalDetector,
+    VocalSeparator,
+    NoteTranscriber,
+    LyricTranscriber,
+)
+class PreprocessPipeline:
+    def __init__(self, device: str, language: str, save_dir: str, vocal_sep: bool = True, max_merge_duration: int = 60000, midi_transcribe: bool = True):
+        self.device = device
+        self.language = language
+        self.save_dir = save_dir
+        self.vocal_sep = vocal_sep
+        self.max_merge_duration = max_merge_duration
+        self.midi_transcribe = midi_transcribe
+        if vocal_sep:
+            self.vocal_separator = VocalSeparator(
+                sep_model_path="pretrained_models/SoulX-Singer-Preprocess/mel-band-roformer-karaoke/mel_band_roformer_karaoke_becruily.ckpt",
+                sep_config_path="pretrained_models/SoulX-Singer-Preprocess/mel-band-roformer-karaoke/config_karaoke_becruily.yaml",
+                der_model_path="pretrained_models/SoulX-Singer-Preprocess/dereverb_mel_band_roformer/dereverb_mel_band_roformer_anvuew_sdr_19.1729.ckpt",
+                der_config_path="pretrained_models/SoulX-Singer-Preprocess/dereverb_mel_band_roformer/dereverb_mel_band_roformer_anvuew.yaml",
+                device=device
+            )
+        else:
+            self.vocal_separator = None
+        self.f0_extractor = F0Extractor(
+            model_path="pretrained_models/SoulX-Singer-Preprocess/rmvpe/rmvpe.pt",
+            device=device,
+        )
+        if self.midi_transcribe:
+            self.vocal_detector = VocalDetector(
+                cut_wavs_output_dir=  f"{save_dir}/cut_wavs",
+            )
+            self.lyric_transcriber = LyricTranscriber(
+                zh_model_path="pretrained_models/SoulX-Singer-Preprocess/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
+                en_model_path="pretrained_models/SoulX-Singer-Preprocess/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2.nemo",
+                device=device
+            )
+            self.note_transcriber = NoteTranscriber(
+                rosvot_model_path="pretrained_models/SoulX-Singer-Preprocess/rosvot/rosvot/model.pt",
+                rwbd_model_path="pretrained_models/SoulX-Singer-Preprocess/rosvot/rwbd/model.pt",
+                device=device
+            )
+        else:
+            self.vocal_detector = None
+            self.lyric_transcriber = None
+            self.note_transcriber = None
+    def run(
+        self,
+        audio_path: str,
+        vocal_sep: bool = None,
+        max_merge_duration: int = None,
+        language: str = None,
+    ) -> None:
+        vocal_sep = self.vocal_sep if vocal_sep is None else vocal_sep
+        max_merge_duration = self.max_merge_duration if max_merge_duration is None else max_merge_duration
+        language = self.language if language is None else language
+        output_dir = Path(self.save_dir)
+        output_dir.mkdir(parents=True, exist_ok=True)
+        if vocal_sep:
+            # Perform vocal/accompaniment separation
+            sep = self.vocal_separator.process(audio_path)
+            vocal = sep.vocals_dereverbed.T
+            acc = sep.accompaniment.T
+            sample_rate = sep.sample_rate
+            vocal_path = output_dir / "vocal.wav"
+            acc_path = output_dir / "acc.wav"
+            sf.write(vocal_path, vocal, sample_rate)
+            sf.write(acc_path, acc, sample_rate)
+        else:
+            # Use the original audio as vocal source (no separation)
+            vocal, sample_rate = librosa.load(audio_path, sr=None, mono=True)
+            vocal_path = output_dir / "vocal.wav"
+            sf.write(vocal_path, vocal, sample_rate)
+        vocal_f0 = self.f0_extractor.process(str(vocal_path), f0_path=str(vocal_path).replace(".wav", "_f0.npy"))
+        if not self.midi_transcribe or self.vocal_detector is None or self.lyric_transcriber is None or self.note_transcriber is None:
+            return
+        segments = self.vocal_detector.process(str(vocal_path), f0=vocal_f0)
+        metadata = []
+        for seg in segments:
+            self.f0_extractor.process(seg["wav_fn"], f0_path=seg["wav_fn"].replace(".wav", "_f0.npy"))
+            words, durs = self.lyric_transcriber.process(
+                seg["wav_fn"], language
+            )
+            seg["words"] = words
+            seg["word_durs"] = durs
+            seg["language"] = language
+            metadata.append(
+                self.note_transcriber.process(seg, segment_info=seg)
+            )
+        merged = merge_short_segments(
+            vocal,
+            sample_rate,
+            metadata,
+            output_dir / "long_cut_wavs",
+            max_duration_ms=max_merge_duration,
+        )
+        final_metadata = []
+        for item in merged:
+            self.f0_extractor.process(item.wav_fn, f0_path=item.wav_fn.replace(".wav", "_f0.npy"))
+            final_metadata.append(convert_metadata(item))
+        with open(output_dir / "metadata.json", "w", encoding="utf-8") as f:
+            json.dump(final_metadata, f, ensure_ascii=False, indent=2)
+        shutil.copy(output_dir / "metadata.json", audio_path.replace(".wav", ".json").replace(".mp3", ".json").replace(".flac", ".json"))
+def main(args):
+    pipeline = PreprocessPipeline(
+        device=args.device,
+        language=args.language,
+        save_dir=args.save_dir,
+        vocal_sep=args.vocal_sep,
+        max_merge_duration=args.max_merge_duration,
+        midi_transcribe=args.midi_transcribe,
+    )
+    pipeline.run(
+        audio_path=args.audio_path,
+        language=args.language,
+    )
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--audio_path", type=str, required=True, help="Path to the input audio file")
+    parser.add_argument("--save_dir", type=str, required=True, help="Directory to save the output files")
+    parser.add_argument("--language", type=str, default="Mandarin", help="Language of the audio")
+    parser.add_argument("--device", type=str, default="cuda:0", help="Device to run the models on")
+    parser.add_argument("--vocal_sep", type=str, default="True", help="Whether to perform vocal separation")
+    parser.add_argument("--max_merge_duration", type=int, default=60000, help="Maximum merged segment duration in milliseconds")
+    parser.add_argument("--midi_transcribe", type=str, default="True", help="Whether to do MIDI transcription")
+    args = parser.parse_args()
+    args.vocal_sep = args.vocal_sep.lower() == "true"
+    args.midi_transcribe = args.midi_transcribe.lower() == "true"
+    main(args)

preprocess/requirements.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+beartype==0.22.9
+einops==0.8.2
+funasr==1.3.0
+g2p_en==2.1.0
+g2pM==0.1.2.5
+librosa==0.11.0
+loralib==0.1.2
+matplotlib==3.10.8
+mido==1.3.3
+ml_collections==1.1.0
+nemo_toolkit==2.6.1
+nltk==3.9.2
+numba==0.63.1
+numpy==2.2.6
+omegaconf==2.3.0
+packaging==24.2
+praat-parselmouth==0.4.7
+pretty_midi==0.2.11
+pyloudnorm==0.2.0
+pyworld==0.3.5
+rotary_embedding_torch==0.8.9
+sageattention==1.0.6
+scikit_learn==1.7.2
+scipy==1.15.3
+six==1.17.0
+setuptools==81.0.0
+scikit_image==0.25.2
+soundfile==0.13.1
+ToJyutping==3.2.0
+torch==2.10.0
+torchaudio==2.10.0
+tqdm==4.67.1
+wandb==0.24.2
+webrtcvad==2.0.10

preprocess/tools/__init__.py ADDED Viewed

	@@ -0,0 +1,53 @@

+"""Preprocess tools.
+This package provides a thin, stable import surface for common preprocess components.
+Examples:
+    from preprocess.tools import (
+        F0Extractor,
+        PitchExtractor,
+        VocalDetectionModel,
+        VocalSeparationModel,
+        VocalExtractionModel,
+        NoteTranscriptionModel,
+        LyricTranscriptionModel,
+    )
+Note:
+    Keep these imports lightweight. If a tool pulls heavy dependencies at import time,
+    consider switching to lazy imports.
+"""
+from __future__ import annotations
+# Core tools
+from .f0_extraction import F0Extractor
+from .vocal_detection import VocalDetector
+# Some tools may live outside this package in different layouts across branches.
+# Keep the public surface stable while avoiding hard import failures.
+try:
+    from .vocal_separation.model import VocalSeparator  # type: ignore
+except Exception:  # pragma: no cover
+    VocalSeparator = None  # type: ignore
+try:
+    from .note_transcription.model import NoteTranscriber  # type: ignore
+except Exception:  # pragma: no cover
+    NoteTranscriber = None  # type: ignore
+try:
+    from .lyric_transcription import LyricTranscriber
+except Exception:  # pragma: no cover
+    LyricTranscriber = None  # type: ignore
+__all__ = [
+    "F0Extractor",
+    "VocalDetector",
+]
+if VocalSeparator is not None:
+    __all__.append("VocalSeparator")
+if LyricTranscriber is not None:
+    __all__.append("LyricTranscriber")
+if NoteTranscriber is not None:
+    __all__.append("NoteTranscriber")

preprocess/tools/f0_extraction.py ADDED Viewed

	@@ -0,0 +1,527 @@

+# https://github.com/Dream-High/RMVPE
+import math
+import time
+import librosa
+import numpy as np
+from librosa.filters import mel
+from scipy.interpolate import interp1d
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class BiGRU(nn.Module):
+    def __init__(self, input_features, hidden_features, num_layers):
+        super(BiGRU, self).__init__()
+        self.gru = nn.GRU(
+            input_features,
+            hidden_features,
+            num_layers=num_layers,
+            batch_first=True,
+            bidirectional=True,
+        )
+    def forward(self, x):
+        return self.gru(x)[0]
+class ConvBlockRes(nn.Module):
+    def __init__(self, in_channels, out_channels, momentum=0.01):
+        super(ConvBlockRes, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+            nn.Conv2d(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=(1, 1),
+                padding=(1, 1),
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+        )
+        if in_channels != out_channels:
+            self.shortcut = nn.Conv2d(in_channels, out_channels, (1, 1))
+    def forward(self, x):
+        if not hasattr(self, "shortcut"):
+            return self.conv(x) + x
+        else:
+            return self.conv(x) + self.shortcut(x)
+class ResEncoderBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, n_blocks=1, momentum=0.01):
+        super(ResEncoderBlock, self).__init__()
+        self.n_blocks = n_blocks
+        self.conv = nn.ModuleList()
+        self.conv.append(ConvBlockRes(in_channels, out_channels, momentum))
+        for i in range(n_blocks - 1):
+            self.conv.append(ConvBlockRes(out_channels, out_channels, momentum))
+        self.kernel_size = kernel_size
+        if self.kernel_size is not None:
+            self.pool = nn.AvgPool2d(kernel_size=kernel_size)
+    def forward(self, x):
+        for conv in self.conv:
+            x = conv(x)
+        if self.kernel_size is not None:
+            return x, self.pool(x)
+        else:
+            return x
+class Encoder(nn.Module):
+    def __init__(self, in_channels, in_size, n_encoders, kernel_size, n_blocks, out_channels=16, momentum=0.01):
+        super(Encoder, self).__init__()
+        self.n_encoders = n_encoders
+        self.bn = nn.BatchNorm2d(in_channels, momentum=momentum)
+        self.layers = nn.ModuleList()
+        self.latent_channels = []
+        for i in range(self.n_encoders):
+            self.layers.append(
+                ResEncoderBlock(in_channels, out_channels, kernel_size, n_blocks, momentum=momentum)
+            )
+            self.latent_channels.append([out_channels, in_size])
+            in_channels = out_channels
+            out_channels *= 2
+            in_size //= 2
+        self.out_size = in_size
+        self.out_channel = out_channels
+    def forward(self, x):
+        concat_tensors = []
+        x = self.bn(x)
+        for layer in self.layers:
+            t, x = layer(x)
+            concat_tensors.append(t)
+        return x, concat_tensors
+class Intermediate(nn.Module):
+    def __init__(self, in_channels, out_channels, n_inters, n_blocks, momentum=0.01):
+        super(Intermediate, self).__init__()
+        self.n_inters = n_inters
+        self.layers = nn.ModuleList()
+        self.layers.append(ResEncoderBlock(in_channels, out_channels, None, n_blocks, momentum))
+        for i in range(self.n_inters - 1):
+            self.layers.append(ResEncoderBlock(out_channels, out_channels, None, n_blocks, momentum))
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+        return x
+class ResDecoderBlock(nn.Module):
+    def __init__(self, in_channels, out_channels, stride, n_blocks=1, momentum=0.01):
+        super(ResDecoderBlock, self).__init__()
+        out_padding = (0, 1) if stride == (1, 2) else (1, 1)
+        self.n_blocks = n_blocks
+        self.conv1 = nn.Sequential(
+            nn.ConvTranspose2d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=(3, 3),
+                stride=stride,
+                padding=(1, 1),
+                output_padding=out_padding,
+                bias=False,
+            ),
+            nn.BatchNorm2d(out_channels, momentum=momentum),
+            nn.ReLU(),
+        )
+        self.conv2 = nn.ModuleList()
+        self.conv2.append(ConvBlockRes(out_channels * 2, out_channels, momentum))
+        for i in range(n_blocks - 1):
+            self.conv2.append(ConvBlockRes(out_channels, out_channels, momentum))
+    def forward(self, x, concat_tensor):
+        x = self.conv1(x)
+        x = torch.cat((x, concat_tensor), dim=1)
+        for conv2 in self.conv2:
+            x = conv2(x)
+        return x
+class Decoder(nn.Module):
+    def __init__(self, in_channels, n_decoders, stride, n_blocks, momentum=0.01):
+        super(Decoder, self).__init__()
+        self.layers = nn.ModuleList()
+        self.n_decoders = n_decoders
+        for i in range(self.n_decoders):
+            out_channels = in_channels // 2
+            self.layers.append(
+                ResDecoderBlock(in_channels, out_channels, stride, n_blocks, momentum)
+            )
+            in_channels = out_channels
+    def forward(self, x, concat_tensors):
+        for i, layer in enumerate(self.layers):
+            x = layer(x, concat_tensors[-1 - i])
+        return x
+class DeepUnet(nn.Module):
+    def __init__(self, kernel_size, n_blocks, en_de_layers=5, inter_layers=4, in_channels=1, en_out_channels=16):
+        super(DeepUnet, self).__init__()
+        self.encoder = Encoder(in_channels, 128, en_de_layers, kernel_size, n_blocks, en_out_channels)
+        self.intermediate = Intermediate(
+            self.encoder.out_channel // 2,
+            self.encoder.out_channel,
+            inter_layers,
+            n_blocks,
+        )
+        self.decoder = Decoder(self.encoder.out_channel, en_de_layers, kernel_size, n_blocks)
+    def forward(self, x):
+        x, concat_tensors = self.encoder(x)
+        x = self.intermediate(x)
+        x = self.decoder(x, concat_tensors)
+        return x
+class E2E(nn.Module):
+    def __init__(self, n_blocks, n_gru, kernel_size, en_de_layers=5, inter_layers=4, in_channels=1, en_out_channels=16):
+        super(E2E, self).__init__()
+        self.unet = DeepUnet(kernel_size, n_blocks, en_de_layers, inter_layers, in_channels, en_out_channels)
+        self.cnn = nn.Conv2d(en_out_channels, 3, (3, 3), padding=(1, 1))
+        if n_gru:
+            self.fc = nn.Sequential(
+                BiGRU(3 * 128, 256, n_gru),
+                nn.Linear(512, 360),
+                nn.Dropout(0.25),
+                nn.Sigmoid(),
+            )
+        else:
+            self.fc = nn.Sequential(
+                nn.Linear(3 * 128, 360),
+                nn.Dropout(0.25),
+                nn.Sigmoid()
+            )
+    def forward(self, mel):
+        mel = mel.transpose(-1, -2).unsqueeze(1)
+        x = self.cnn(self.unet(mel)).transpose(1, 2).flatten(-2)
+        x = self.fc(x)
+        return x
+class MelSpectrogram(torch.nn.Module):
+    def __init__(self, is_half, n_mel_channels, sampling_rate, win_length, hop_length,
+                 n_fft=None, mel_fmin=0, mel_fmax=None, clamp=1e-5):
+        super().__init__()
+        n_fft = win_length if n_fft is None else n_fft
+        self.hann_window = {}
+        mel_basis = mel(
+            sr=sampling_rate,
+            n_fft=n_fft,
+            n_mels=n_mel_channels,
+            fmin=mel_fmin,
+            fmax=mel_fmax,
+            htk=True,
+        )
+        mel_basis = torch.from_numpy(mel_basis).float()
+        self.register_buffer("mel_basis", mel_basis)
+        self.n_fft = win_length if n_fft is None else n_fft
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.sampling_rate = sampling_rate
+        self.n_mel_channels = n_mel_channels
+        self.clamp = clamp
+        self.is_half = is_half
+    def forward(self, audio, keyshift=0, speed=1, center=True):
+        factor = 2 ** (keyshift / 12)
+        n_fft_new = int(np.round(self.n_fft * factor))
+        win_length_new = int(np.round(self.win_length * factor))
+        hop_length_new = int(np.round(self.hop_length * speed))
+        keyshift_key = str(keyshift) + "_" + str(audio.device)
+        if keyshift_key not in self.hann_window:
+            self.hann_window[keyshift_key] = torch.hann_window(win_length_new).to(audio.device)
+        fft = torch.stft(
+            audio,
+            n_fft=n_fft_new,
+            hop_length=hop_length_new,
+            win_length=win_length_new,
+            window=self.hann_window[keyshift_key],
+            center=center,
+            return_complex=True,
+        )
+        magnitude = torch.sqrt(fft.real.pow(2) + fft.imag.pow(2))
+        if keyshift != 0:
+            size = self.n_fft // 2 + 1
+            resize = magnitude.size(1)
+            if resize < size:
+                magnitude = F.pad(magnitude, (0, 0, 0, size - resize))
+            magnitude = magnitude[:, :size, :] * self.win_length / win_length_new
+        mel_output = torch.matmul(self.mel_basis, magnitude)
+        if self.is_half:
+            mel_output = mel_output.half()
+        log_mel_spec = torch.log(torch.clamp(mel_output, min=self.clamp))
+        return log_mel_spec
+class RMVPE:
+    def __init__(self, model_path: str, is_half, device=None):
+        self.is_half = is_half
+        if device is None:
+            device = "cuda:0" if torch.cuda.is_available() else "cpu"
+        self.device = torch.device(device) if isinstance(device, str) else device
+        self.mel_extractor = MelSpectrogram(
+            is_half=is_half,
+            n_mel_channels=128,
+            sampling_rate=16000,
+            win_length=1024,
+            hop_length=160,
+            n_fft=None,
+            mel_fmin=30,
+            mel_fmax=8000
+        ).to(self.device)
+        model = E2E(n_blocks=4, n_gru=1, kernel_size=(2, 2))
+        ckpt = torch.load(model_path, map_location=self.device)
+        model.load_state_dict(ckpt)
+        model.eval()
+        if is_half:
+            model = model.half()
+        else:
+            model = model.float()
+        self.model = model.to(self.device)
+        cents_mapping = 20 * np.arange(360) + 1997.3794084376191
+        self.cents_mapping = np.pad(cents_mapping, (4, 4))  # 368
+    def mel2hidden(self, mel):
+        with torch.no_grad():
+            n_frames = mel.shape[-1]
+            n_pad = 32 * ((n_frames - 1) // 32 + 1) - n_frames
+            if n_pad > 0:
+                mel = F.pad(mel, (0, n_pad), mode="constant")
+            mel = mel.half() if self.is_half else mel.float()
+            hidden = self.model(mel)
+            return hidden[:, :n_frames]
+    def decode(self, hidden, thred=0.03):
+        cents_pred = self.to_local_average_cents(hidden, thred=thred)
+        f0 = 10 * (2 ** (cents_pred / 1200))
+        f0[f0 == 10] = 0
+        return f0
+    def infer_from_audio(self, audio, thred=0.03):
+        if not torch.is_tensor(audio):
+            audio = torch.from_numpy(audio)
+        mel = self.mel_extractor(audio.float().to(self.device).unsqueeze(0), center=True)
+        hidden = self.mel2hidden(mel)
+        hidden = hidden.squeeze(0).cpu().numpy()
+        if self.is_half:
+            hidden = hidden.astype("float32")
+        f0 = self.decode(hidden, thred=thred)
+        return f0
+    def to_local_average_cents(self, salience, thred=0.05):
+        center = np.argmax(salience, axis=1)
+        salience = np.pad(salience, ((0, 0), (4, 4)))
+        center += 4
+        todo_salience = []
+        todo_cents_mapping = []
+        starts = center - 4
+        ends = center + 5
+        for idx in range(salience.shape[0]):
+            todo_salience.append(salience[:, starts[idx]:ends[idx]][idx])
+            todo_cents_mapping.append(self.cents_mapping[starts[idx]:ends[idx]])
+        todo_salience = np.array(todo_salience)
+        todo_cents_mapping = np.array(todo_cents_mapping)
+        product_sum = np.sum(todo_salience * todo_cents_mapping, 1)
+        weight_sum = np.sum(todo_salience, 1)
+        devided = product_sum / weight_sum
+        maxx = np.max(salience, axis=1)
+        devided[maxx <= thred] = 0
+        return devided
+class F0Extractor:
+    """Extract frame-level f0 from singing voice.
+    Wrapper around an RMVPE network that:
+        1) loads the checkpoint once in ``__init__``
+        2) exposes a simple :py:meth:`process` API and optionally saves ``*_f0.npy``.
+    """
+    def __init__(
+        self,
+        model_path: str,
+        device: str = "cpu",
+        *,
+        is_half: bool = False,
+        input_sr: int = 16000,
+        target_sr: int = 24000,
+        hop_size: int = 480,
+        max_duration: float = 300,
+        thred: float = 0.03,
+        verbose: bool = True,
+    ):
+        """Initialize the f0 extractor.
+        Args:
+            model_path: Path to RMVPE checkpoint.
+            device: Torch device string, e.g. ``"cuda:0"`` / ``"cpu"``.
+            is_half: Whether to run the model in fp16.
+            input_sr: Input resample rate used by RMVPE frontend.
+            target_sr: Target sample rate for the output f0 grid.
+            hop_size: Target hop size for the output f0 grid.
+            max_duration: Max duration (seconds) for interpolation grid.
+            thred: Voicing threshold used when decoding salience.
+            verbose: Whether to print verbose logs.
+        """
+        self.model_path = model_path
+        self.input_sr = input_sr
+        self.target_sr = target_sr
+        self.hop_size = hop_size
+        self.max_duration = max_duration
+        self.thred = thred
+        self.verbose = verbose
+        self.model = RMVPE(model_path, is_half=is_half, device=device)
+        if self.verbose:
+            print(
+                "[f0 extraction] init success:",
+                f"device={device}",
+                f"model_path={model_path}",
+                f"is_half={is_half}",
+                f"input_sr={input_sr}",
+                f"target_sr={target_sr}",
+                f"hop_size={hop_size}",
+                f"thred={thred}",
+            )
+    @staticmethod
+    def interpolate_f0(
+        f0_16k: np.ndarray,
+        original_length: int,
+        original_sr: int,
+        *,
+        target_sr: int = 48000,
+        hop_size: int = 256,
+        max_duration: float = 20.0,
+    ) -> np.ndarray:
+        """Interpolate f0 from RMVPE's 16k hop grid to target mel hop grid."""
+        mel_target_sr = target_sr
+        mel_hop_size = hop_size
+        mel_max_duration = max_duration
+        batch_max_length = int(mel_max_duration * mel_target_sr / mel_hop_size)
+        duration_in_seconds = original_length / original_sr
+        effective_target_length = int(duration_in_seconds * mel_target_sr)
+        original_frames = math.ceil(effective_target_length / mel_hop_size)
+        target_frames = min(original_frames, batch_max_length)
+        rmvpe_hop = 160
+        t_16k = np.arange(len(f0_16k)) * (rmvpe_hop / 16000.0)
+        t_target = np.arange(target_frames) * (mel_hop_size / float(mel_target_sr))
+        if len(f0_16k) > 0:
+            f_interp = interp1d(
+                t_16k,
+                f0_16k,
+                kind="linear",
+                bounds_error=False,
+                fill_value=0.0,
+                assume_sorted=True,
+            )
+            f0 = f_interp(t_target)
+        else:
+            f0 = np.zeros(target_frames)
+        if len(f0) != target_frames:
+            f0 = (
+                f0[:target_frames]
+                if len(f0) > target_frames
+                else np.pad(f0, (0, target_frames - len(f0)), "constant")
+            )
+        return f0
+    def process(self, audio_path: str, *, f0_path: str | None = None, verbose: Optional[bool] = None) -> np.ndarray:
+        """Run f0 extraction for a single wav.
+        Args:
+            audio_path: Path to the input wav file.
+            f0_path: if is not None, save the f0 data to this path.
+            verbose: Override instance-level verbose flag for this call.
+        Returns:
+            np.ndarray: shape ``[T]``, f0 in Hz (0 for unvoiced).
+        """
+        verbose = self.verbose if verbose is None else verbose
+        if verbose:
+            print(f"[f0 extraction] process: start: {audio_path}")
+            t0 = time.time()
+        audio, _ = librosa.load(audio_path, sr=self.input_sr)
+        f0_16k = self.model.infer_from_audio(audio, thred=self.thred)
+        f0 = self.interpolate_f0(
+            f0_16k,
+            original_length=audio.shape[-1],
+            original_sr=self.input_sr,
+            target_sr=self.target_sr,
+            hop_size=self.hop_size,
+            max_duration=self.max_duration,
+        )
+        if verbose:
+            dt = time.time() - t0
+            voiced_ratio = float(np.mean(f0 > 0)) if len(f0) else 0.0
+            print(
+                "[f0 extraction] process: done:",
+                f"frames={len(f0)}",
+                f"voiced_ratio={voiced_ratio:.3f}",
+                f"time={dt:.3f}s",
+            )
+        if f0_path is not None:
+            np.save(f0_path, f0)
+        return f0
+if __name__ == "__main__":
+    model_path = (
+        "pretrained_models/SoulX-Singer-Preprocess/rmvpe/rmvpe.pt"
+    )
+    audio_path = "example/audio/zh_prompt.mp3"
+    pe = F0Extractor(
+        model_path,
+        device="cuda",
+    )
+    f0 = pe.process(audio_path, f0_path="example/audio/zh_prompt_f0.npy")

preprocess/tools/g2p.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import re
+import ToJyutping
+from g2pM import G2pM
+from g2p_en import G2p as G2pE
+_EN_WORD_RE = re.compile(r"^[A-Za-z]+(?:'[A-Za-z]+)*$")
+_ZH_WORD_RE = re.compile(r"[\u4e00-\u9fff]")
+EN_FLAG = "en_"
+YUE_FLAG = "yue_"
+ZH_FLAG = "zh_"
+g2p_zh = G2pM()
+g2p_en = G2pE()
+def is_chinese_char(word: str) -> bool:
+    if len(word) != 1:
+        return False
+    return bool(_ZH_WORD_RE.fullmatch(word))
+def is_english_word(word: str) -> bool:
+    if not word:
+        return False
+    return bool(_EN_WORD_RE.fullmatch(word))
+def g2p_cantonese(sent):
+    return ToJyutping.get_jyutping_list(sent)       # with tone
+def g2p_mandarin(sent):
+    return g2p_zh(sent, tone=True, char_split=False)
+def g2p_english(word):
+    return g2p_en(word)
+def g2p_transform(words, lang):
+    zh_words = []
+    transformed_words = [0] * len(words)
+    for idx, w in enumerate(words):
+        if w == "<SP>":
+            transformed_words[idx] = w
+            continue
+        w = w.replace("?", "").replace(".", "").replace("!", "").replace(",", "")
+        if is_chinese_char(w):
+            zh_words.append([idx, w])
+        else:
+            if is_english_word(w):
+                w = EN_FLAG + "-".join(g2p_english(w.lower()))
+            else:
+                w = "<SP>"
+        transformed_words[idx] = w
+    sent = "".join([k[1] for k in zh_words])
+    # zh (zh and yue) transformer to g2p
+    if len(sent) > 0:
+        if lang == "Cantonese":
+            g2pm_rst = g2p_cantonese(sent)       # with tone
+            g2pm_rst = [YUE_FLAG + k[1] for k in g2pm_rst]
+        else:
+            g2pm_rst = g2p_mandarin(sent)
+            g2pm_rst = [ZH_FLAG + k for k in g2pm_rst]
+        for p, w in zip([k[0] for k in zh_words], g2pm_rst):
+            transformed_words[p] = w
+    return transformed_words

preprocess/tools/lyric_transcription.py ADDED Viewed

	@@ -0,0 +1,283 @@

+# https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary
+# https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
+import os
+import re
+import time
+from typing import Any, Dict, List, Tuple
+import librosa
+import numpy as np
+from funasr import AutoModel
+def _build_words_with_gaps(raw_words, raw_timestamps, wav_fn: str):
+    words, word_durs = [], []
+    prev = 0.0
+    for w, t in zip(raw_words, raw_timestamps):
+        s, e = float(t[0]), float(t[1])
+        if s > prev:
+            words.append("<SP>")
+            word_durs.append(s - prev)
+        words.append(w)
+        word_durs.append(e - s)
+        prev = e
+    wav_len = librosa.get_duration(filename=wav_fn)
+    if wav_len > prev:
+        if len(words) == 0:
+            words.append("<SP>")
+            word_durs.append(wav_len)
+            return words, word_durs
+        if words[-1] != "<SP>":
+            words.append("<SP>")
+            word_durs.append(wav_len - prev)
+        else:
+            word_durs[-1] += wav_len - prev
+    return words, word_durs
+def _word_dur_post_process(words, word_durs, f0):
+    """Post-process word durations using f0 to better place silences.
+    """
+    # f0 time grid parameters
+    sr = 24000  # f0 sample rate
+    hop_length = 480  # f0 hop length
+    # Convert word durations (seconds) to frame boundaries on the f0 grid.
+    boundaries = np.cumsum([
+        0,
+        *[
+            int(dur * sr / hop_length)
+            for dur in word_durs
+        ],
+    ]).tolist()
+    sil_tolerance = 5   # tolerance frames for silence detection
+    ext_tolerance = 5   # tolerance frames for vocal extension
+    new_words: list[str] = []
+    new_word_durs: list[float] = []
+    if words:
+        new_words.append(words[0])
+        new_word_durs.append(word_durs[0])
+    for i in range(1, len(words)):
+        word = words[i]
+        if word == "<SP>":
+            start_frame = boundaries[i]
+            end_frame = boundaries[i + 1]
+            num_frames = end_frame - start_frame
+            frame_idx = start_frame
+            # Find first region with at least 5 consecutive "unvoiced" frames.
+            unvoiced_count = 0
+            while frame_idx < end_frame:
+                if f0[frame_idx] <= 1:  # unvoiced
+                    unvoiced_count += 1
+                    if unvoiced_count >= sil_tolerance:
+                        frame_idx -= sil_tolerance - 1  # back to the last voiced frame
+                        break
+                else:
+                    unvoiced_count = 0
+                frame_idx += 1
+            voice_frames = frame_idx - start_frame
+            if voice_frames >= int(num_frames * 0.9):  # over 90% voiced
+                # Treat the whole "<SP>" as silence and merge into previous word.
+                new_word_durs[-1] += word_durs[i]
+            elif voice_frames >= ext_tolerance:  # over 5 frames voiced
+                # Split the "<SP>" into two parts: leading silence and tail kept as "<SP>".
+                dur = voice_frames * hop_length / sr
+                new_word_durs[-1] += dur
+                new_words.append("<SP>")
+                new_word_durs.append(word_durs[i] - dur)
+            else:
+                # Too short to adjust, keep as-is.
+                new_words.append(word)
+                new_word_durs.append(word_durs[i])
+        else:
+            new_words.append(word)
+            new_word_durs.append(word_durs[i])
+    return new_words, new_word_durs
+class _ASRZhModel:
+    """Mandarin/Cantonese ASR wrapper."""
+    def __init__(self, model_path: str, device: str):
+        self.model = AutoModel(
+            model=model_path,
+            disable_update=True,
+            device=device,
+        )
+    def process(self, wav_fn):
+        out = self.model.generate(wav_fn, output_timestamp=True)[0]
+        raw_words = out["text"].replace("@", "").split(" ")
+        raw_timestamps = [[t[0] / 1000, t[1] / 1000] for t in out["timestamp"]]
+        words, word_durs = _build_words_with_gaps(raw_words, raw_timestamps, wav_fn)
+        f0_path = os.path.splitext(wav_fn)[0] + "_f0.npy"
+        if os.path.exists(f0_path):
+            words, word_durs = _word_dur_post_process(
+                words, word_durs, np.load(f0_path)
+            )
+        return words, word_durs
+class _ASREnModel:
+    """English ASR wrapper for NeMo Parakeet-TDT."""
+    def __init__(self, model_path: str, device: str):
+        try:
+            import nemo.collections.asr as nemo_asr  # type: ignore
+        except Exception as e:  # pragma: no cover
+            raise ImportError(
+                "NeMo (nemo_toolkit) is required for ASR English but is not available in this Python env. "
+                "Install it in the active environment, then retry."
+            ) from e
+        self.model = nemo_asr.models.ASRModel.restore_from(
+            restore_path=model_path,
+            map_location=device,
+        )
+        self.model.eval()
+    @staticmethod
+    def _clean_word(word: str) -> str:
+        return re.sub(r"[\?\.,:]", "", word).strip()
+    @staticmethod
+    def _extract_word_segments(output: Any) -> List[Dict[str, Any]]:
+        ts = getattr(output, "timestamp", None)
+        if not ts or not isinstance(ts, dict):
+            return []
+        word_ts = ts.get("word")
+        return word_ts if isinstance(word_ts, list) else []
+    def process(self, wav_fn: str) -> Tuple[List[str], List[float]]:
+        outputs = self.model.transcribe(
+            [wav_fn],
+            timestamps=True,
+            batch_size=1,
+            num_workers=0,
+        )
+        output = outputs[0] if outputs else None
+        raw_words: List[str] = []
+        raw_timestamps: List[List[float]] = []
+        if output is not None:
+            for w in self._extract_word_segments(output):
+                s, e = float(w.get("start", 0.0)), float(w.get("end", 0.0))
+                word = self._clean_word(str(w.get("word", "")))
+                if word:
+                    raw_words.append(word)
+                    raw_timestamps.append([s, e])
+        words, durs = _build_words_with_gaps(raw_words, raw_timestamps, wav_fn)
+        f0_path = os.path.splitext(wav_fn)[0] + "_f0.npy"
+        if os.path.exists(f0_path):
+            words, durs = _word_dur_post_process(
+                words, durs, np.load(f0_path)
+            )
+        return words, durs
+class LyricTranscriber:
+    """Transcribe lyrics from singing voice segment
+    """
+    def __init__(
+        self,
+        zh_model_path: str,
+        en_model_path: str,
+        device: str = "cuda",
+        *,
+        verbose: bool = True,
+    ):
+        """Initialize lyric transcriber.
+        Args:
+            zh_model_path (str): Path to the Chinese model file.
+            en_model_path (str): Path to the English model file.
+            device (str): Device to use for tensor operations.
+            verbose (bool): Whether to print verbose logs.
+        """
+        self.verbose = verbose
+        self.device = device
+        self.zh_model_path = zh_model_path
+        self.en_model_path = en_model_path
+        if self.verbose:
+            print(
+                "[lyric transcription] init: start:",
+                f"device={device}",
+                f"model_path={zh_model_path}",
+            )
+        # Always initialize Chinese ASR.
+        self.zh_model = _ASRZhModel(device=device, model_path=zh_model_path)
+        # English ASR will be lazily initialized on first English request to avoid long waiting cost when importing NeMo
+        self.en_model = None
+        if self.verbose:
+            print("[lyric transcription] init: success")
+    def process(self, wav_fn, language: str | None = "Mandarin", *, verbose: bool | None = None):
+        """ Lyric transcriber process
+        Args:
+            wav_fn (str): Path to the audio file.
+            language (str | None): Language of the audio. Defaults to "Mandarin". Supports "Mandarin", "Cantonese" and "English".
+            verbose (bool | None): Whether to print verbose logs. Defaults to None.
+        """
+        v = self.verbose if verbose is None else verbose
+        if language not in {"Mandarin", "Cantonese", "English"}:
+            raise ValueError(f"Unsupported language: {language}, should be one of ['Mandarin', 'Cantonese', 'English']")
+        if v:
+            print(f"[lyric transcription] process: start: wav_fn={wav_fn} language={language}")
+            t0 = time.time()
+        lang = (language or "auto").lower()
+        if lang in {"english"}:
+            if self.en_model is None:
+                # Lazy-load NeMo model only when English is actually used.
+                if v:
+                    print("[lyric transcription] init English ASR start, please make sure NeMo is installed and wait for a while")
+                self.en_model = _ASREnModel(model_path=self.en_model_path, device=self.device)
+                if v:
+                    print("[lyric transcription] init English ASR success")
+            out = self.en_model.process(wav_fn)
+        else:
+            out = self.zh_model.process(wav_fn)
+        if v:
+            words, durs = out
+            n_words = len(words) if isinstance(words, list) else 0
+            dur_sum = float(sum(durs)) if isinstance(durs, list) else 0.0
+            dt = time.time() - t0
+            print(
+                "[lyric transcription] process: done:",
+                f"n_words={n_words}",
+                f"dur_sum={dur_sum:.3f}s",
+                f"time={dt:.3f}s",
+            )
+        return out
+if __name__ == "__main__":
+    m = LyricTranscriber(
+        zh_model_path="pretrained_models/SoulX-Singer-Preprocess/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
+        en_model_path="pretrained_models/SoulX-Singer-Preprocess/parakeet-tdt-0.6b-v2/parakeet-tdt-0.6b-v2.nemo",
+        device="cuda"
+    )
+    print(m.process("example/audio/zh_prompt.mp3", language="Mandarin"))
+    print(m.process("example/audio/en_prompt.mp3", language="English"))

preprocess/tools/midi_editor/README.md ADDED Viewed

	@@ -0,0 +1,170 @@

+# 🎹 MIDI Editor - Web-based Singing MIDI Editor
+[English](README.md) | [简体中文](README_CN.md)
+A full-featured web MIDI editor for singing voice preprocess. It supports real-time drag editing of MIDI notes, lyric editing, audio waveform alignment, and importing/exporting MIDI files with lyrics.
+![MIDI Editor](https://img.shields.io/badge/React-19.2-blue) ![TypeScript](https://img.shields.io/badge/TypeScript-5.9-blue) ![Vite](https://img.shields.io/badge/Vite-7.2-purple)
+## ✨ Features
+### 🎼 Piano Roll Editing
+- **Visual note editing**: Full range from C1 to C8 with intuitive piano key layout
+- **Drag operations**:
+  - Move notes: drag note blocks to adjust position and pitch
+  - Resize start: drag the left edge to adjust start time
+  - Resize end: drag the right edge to adjust end time
+- **Quick pitch adjust**:
+  - Command/Ctrl + Up/Down to nudge selected note pitch
+  - Use the Transpose control in the toolbar to shift all notes at once
+- **Double-click to add**: Add new notes quickly in empty areas
+- **Piano key preview**: Click a key on the left to audition the pitch
+### 🔍 Zoom & Navigation
+- **Horizontal zoom**
+- **Vertical zoom**
+- **Dynamic snapping**: finer snap granularity at higher zoom (min 0.01s)
+- **Auto scroll**: keep the playhead visible during playback
+### 📝 Lyric Editing
+- **Inline editing**: edit lyrics for each note in the side list
+- **Batch fill**: enter lyrics and auto-fill notes in order
+- **Fill from selection**: start batch fill from the currently selected note
+- **Precise fields**: edit PITCH, START, and END directly
+- **Confirm edits**: press Enter or click ✓ to confirm, avoiding accidental changes
+### 🎵 Audio Alignment
+- **Waveform display**: import audio to display waveform, synced with the MIDI timeline
+- **Formats**: MP3, WAV, OGG, FLAC, M4A, AAC
+- **Sync playback**: play audio and MIDI together with independent volume control
+- **Click to seek**: click waveform or timeline to seek
+### ⚠️ Overlap Detection
+- **Visual highlight**: overlapping notes blink in red
+- **One-click fix**: remove all overlaps automatically
+### 📥 Import & Export
+- **MIDI import**: parse standard MIDI files with automatic lyric metadata extraction
+- **MIDI export**: export MIDI files with lyric information
+### 🎨 UI & UX
+- **Theme toggle**: light and dark modes
+- **Responsive layout**: adapts to window size
+- **SVG grid**: cross-browser grid rendering
+- **Status feedback**: real-time state and error tips
+## 🚀 Quick Start
+### Requirements
+- Node.js 18+
+- npm or yarn
+### Install
+```bash
+# Install dependencies
+npm install
+# Start dev server
+npm run dev
+# Expose to LAN
+npm run dev -- --host 0.0.0.0
+```
+### Build
+```bash
+# Build for production
+npm run build
+# Preview build
+npm run preview
+```
+## 📖 Usage
+### Basic Workflow
+1. **Import MIDI**: click Import MIDI and select a .mid file
+2. **Edit notes**: drag notes in the piano roll to adjust time and pitch
+3. **Add lyrics**: edit lyrics in the right-side list, or use batch fill
+4. **Align audio** (optional): import reference audio for side-by-side editing
+5. **Export**: click Export MIDI to save
+### Shortcuts
+| Action | Description |
+|------|------|
+| Double-click piano roll | Add a new note |
+| Double-click note | Edit lyric |
+| Drag note | Move note and pitch |
+| Drag note edges | Resize note |
+| Backspace / Delete | Delete selected note |
+| Enter | Confirm value edits |
+| Escape | Cancel value edits |
+| Ctrl(Command) + Wheel | Horizontal zoom |
+| Ctrl(Command) + Shift(Option) + Wheel | Vertical zoom |
+### Playback Controls
+| Button | Description |
+|------|------|
+| ⏮ | Go to start |
+| ⏪ | Back 2 seconds |
+| ▶ / ⏸ | Play / Pause |
+| ⏩ | Forward 2 seconds |
+| ⏭ | Go to end |
+| Selection | Play selected region |
+## 🛠 Tech Stack
+- **Frontend**: React 19 + TypeScript
+- **Build**: Vite 7
+- **State**: Zustand
+- **Audio**: Tone.js
+- **Waveform**: WaveSurfer.js
+- **MIDI**: @tonejs/midi
+- **Styles**: CSS with custom variables
+## 📁 Project Structure
+```
+.
+├── eslint.config.js
+├── index.html
+├── package.json
+├── postcss.config.js
+├── README.md
+├── README_CN.md
+├── tailwind.config.js
+├── tsconfig.app.json
+├── tsconfig.json
+├── tsconfig.node.json
+├── vite.config.ts
+├── public/
+└── src/
+    ├── App.css              # Main styles (theme variables, layout, components)
+    ├── App.tsx               # Main app component (transport, import/export, transpose)
+    ├── constants.ts          # Constants (grid width, row height, pitch range)
+    ├── i18n.ts               # Internationalization (zh/en translations, smart lyric tokenizer)
+    ├── index.css             # Global styles (Tailwind, root font, theme gradients)
+    ├── main.tsx              # React entry point
+    ├── types.ts              # Type definitions (NoteEvent, TimeSignature, etc.)
+    ├── components/
+    │   ├── AudioTrack.tsx    # Audio waveform display component
+    │   ├── LyricTable.tsx    # Lyric editing table component
+    │   └── PianoRoll.tsx     # Piano roll editor component
+    ├── lib/
+    │   └── midi.ts           # MIDI import/export utilities (UTF-8 lyric encoding)
+    └── store/
+        └── useMidiStore.ts   # Zustand state management
+```

preprocess/tools/midi_editor/README_CN.md ADDED Viewed

	@@ -0,0 +1,170 @@

+# 🎹 MIDI Editor - 网页端歌声 MIDI 编辑器
+[English](README.md) | [简体中文](README_CN.md)
+一个功能完整的网页端歌声 MIDI 文件编辑器。支持实时拖拽调整 MIDI 音符、歌词编辑、音频波形对齐，以及导入导出含歌词的 MIDI 文件。
+![MIDI Editor](https://img.shields.io/badge/React-19.2-blue) ![TypeScript](https://img.shields.io/badge/TypeScript-5.9-blue) ![Vite](https://img.shields.io/badge/Vite-7.2-purple)
+## ✨ 功能特性
+### 🎼 钢琴卷帘编辑
+- **可视化音符编辑**：支持 C1-C8 全音域显示，直观的钢琴键布局
+- **拖拽操作**：
+  - 移动音符：拖拽音符块调整位置和音高
+  - 调整音头：拖拽音符左边缘调整开始时间
+  - 调整音尾：拖拽音符右边缘调整结束时间
+- **快捷音高调整**：
+  - Command/Ctrl + 上/下键调整选中音符的音高
+  - 通过功能区的移调功能来整体移动音高
+- **双击添加**：在钢琴卷帘空白处双击快速添加新音符
+- **钢琴键试听**：点击左侧钢琴键可试听对应音高
+### 🔍 缩放与导航
+- **水平缩放**
+- **垂直缩放**
+- **动态精度**：缩放越大，音符调整的 snap 粒度越精细（最小 0.01 秒）
+- **自动滚动**：播放时播放头自动保持可见
+### 📝 歌词编辑
+- **实时编辑**：右侧列表直接编辑每个音符的歌词
+- **批量填充**：输入一段歌词，按字顺序自动填充到音符
+- **从选中开始**：批量填充可从当前选中的音符开始
+- **精确调整**：可直接编辑 PITCH（音高）、START（开始时间）、END（结束时间）
+- **确认机制**：修改数值后按 Enter 或点击 ✓ 确认，避免误操作
+### 🎵 音频对齐
+- **波形显示**：导入音频后显示波形，与 MIDI 同步滚动
+- **格式支持**：MP3、WAV、OGG、FLAC、M4A、AAC
+- **同步播放**：音频与 MIDI 同步播放，可分别调整音量大小
+- **点击定位**：点击波形或时间尺可快速定位播放位置
+### ⚠️ 重叠检测
+- **可视化标注**：时间重叠的音符显示为红色并闪烁
+- **一键修复**：点击消除重叠按钮自动修复所有重叠
+### 📥 导入导出
+- **MIDI 导入**：支持标准 MIDI 文件，自动解析歌词元数据
+- **MIDI 导出**：导出包含歌词信息的 MIDI 文件
+### 🎨 界面特性
+- **主题切换**：支持浅色/深色主题
+- **响应式布局**：自适应窗口大小
+- **SVG 网格**：跨浏览器兼容的网格渲染
+- **状态提示**：实时显示操作状态和错误信息
+## 🚀 快速开始
+### 环境要求
+- Node.js 18+
+- npm 或 yarn
+### 安装
+```bash
+# 安装依赖
+npm install
+# 启动开发服务器
+npm run dev
+# 在局域网启动
+npm run dev -- --host 0.0.0.0
+```
+### 构建
+```bash
+# 构建生产版本
+npm run build
+# 预览构建结果
+npm run preview
+```
+## 📖 使用指南
+### 基本工作流
+1. **导入 MIDI**：点击导入 MIDI 按钮选择 .mid 文件
+2. **编辑音符**：在钢琴卷帘中拖拽调整音符位置和时长
+3. **添加歌词**：在右侧列表中输入句级别的歌词或单字编辑
+4. **对齐音频**（可选）：导入参考音频进行对照编辑
+5. **导出文件**：点击导出含歌词 MIDI 保存文件
+### 快捷操作
+| 操作 | 说明 |
+|------|------|
+| 双击钢琴卷帘 | 添加新音符 |
+| 双击音符 | 修改歌词 |
+| 拖拽音符 | 移动音符位置/音高 |
+| 拖拽音符边缘 | 调整音符时长 |
+| Backspace / Delete | 删除选中音符 |
+| Enter | 确认数值修改 |
+| Escape | 取消数值修改 |
+| Ctrl(Command) + 滚轮 | 水平缩放 |
+| Ctrl(Command) + Shift(Option) + 滚轮 | 垂直缩放 |
+### 播放控制
+| 按钮 | 功能 |
+|------|------|
+| ⏮ | 回到开头 |
+| ⏪ | 后退 2 秒 |
+| ▶ / ⏸ | 播放 / 暂停 |
+| ⏩ | 前进 2 秒 |
+| ⏭ | 跳到结尾 |
+| 选定区域 | 播放选定区域 |
+## 🛠 技术栈
+- **前端框架**：React 19 + TypeScript
+- **构建工具**：Vite 7
+- **状态管理**：Zustand
+- **音频引擎**：Tone.js
+- **波形显示**：WaveSurfer.js
+- **MIDI 解析**：@tonejs/midi
+- **样式**：CSS（自定义变量主题）
+## 📁 项目结构
+```
+.
+├── eslint.config.js
+├── index.html
+├── package.json
+├── postcss.config.js
+├── README.md
+├── README_CN.md
+├── tailwind.config.js
+├── tsconfig.app.json
+├── tsconfig.json
+├── tsconfig.node.json
+├── vite.config.ts
+├── public/
+└── src/
+    ├── App.css              # 主样式（含主题变量、布局、组件样式）
+    ├── App.tsx               # 主应用组件（走带、导入导出、移调等）
+    ├── constants.ts          # 常量定义（网格宽度、行高、音域范围）
+    ├── i18n.ts               # 国际化（中英文翻译、歌词智能分词器）
+    ├── index.css             # ���局样式（Tailwind、根字体、主题渐变）
+    ├── main.tsx              # React 入口
+    ├── types.ts              # 类型定义（NoteEvent、TimeSignature 等）
+    ├── components/
+    │   ├── AudioTrack.tsx    # 音频波形显示组件
+    │   ├── LyricTable.tsx    # 歌词编辑表格组件
+    │   └── PianoRoll.tsx     # 钢琴卷帘编辑器组件
+    ├── lib/
+    │   └── midi.ts           # MIDI 导入导出工具（含 UTF-8 歌词编解码）
+    └── store/
+        └── useMidiStore.ts   # Zustand 状态管理
+```

preprocess/tools/midi_editor/eslint.config.js ADDED Viewed

	@@ -0,0 +1,23 @@

+import js from '@eslint/js'
+import globals from 'globals'
+import reactHooks from 'eslint-plugin-react-hooks'
+import reactRefresh from 'eslint-plugin-react-refresh'
+import tseslint from 'typescript-eslint'
+import { defineConfig, globalIgnores } from 'eslint/config'
+export default defineConfig([
+  globalIgnores(['dist']),
+  {
+    files: ['**/*.{ts,tsx}'],
+    extends: [
+      js.configs.recommended,
+      tseslint.configs.recommended,
+      reactHooks.configs.flat.recommended,
+      reactRefresh.configs.vite,
+    ],
+    languageOptions: {
+      ecmaVersion: 2020,
+      globals: globals.browser,
+    },
+  },
+])

preprocess/tools/midi_editor/index.html ADDED Viewed

	@@ -0,0 +1,13 @@

+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>SoulX-Singer MIDI Editor</title>
+  </head>
+  <body>
+    <div id="root"></div>
+    <script type="module" src="/src/main.tsx"></script>
+  </body>
+</html>

preprocess/tools/midi_editor/package-lock.json ADDED Viewed

The diff for this file is too large to render. See raw diff

preprocess/tools/midi_editor/package.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "name": "midi-editor",
+  "private": true,
+  "version": "0.0.0",
+  "type": "module",
+  "scripts": {
+    "dev": "vite",
+    "build": "tsc -b && vite build",
+    "lint": "eslint .",
+    "preview": "vite preview"
+  },
+  "dependencies": {
+    "@tonejs/midi": "^2.0.28",
+    "class-variance-authority": "^0.7.1",
+    "nanoid": "^5.1.6",
+    "react": "^19.2.0",
+    "react-dom": "^19.2.0",
+    "tone": "^15.1.22",
+    "wavesurfer.js": "^7.12.1",
+    "zustand": "^5.0.10"
+  },
+  "devDependencies": {
+    "@eslint/js": "^9.39.1",
+    "@types/node": "^24.10.1",
+    "@types/react": "^19.2.5",
+    "@types/react-dom": "^19.2.3",
+    "@vitejs/plugin-react": "^5.1.1",
+    "autoprefixer": "^10.4.20",
+    "eslint": "^9.39.1",
+    "eslint-plugin-react-hooks": "^7.0.1",
+    "eslint-plugin-react-refresh": "^0.4.24",
+    "globals": "^16.5.0",
+    "postcss": "^8.4.47",
+    "tailwindcss": "^3.4.15",
+    "typescript": "~5.9.3",
+    "typescript-eslint": "^8.46.4",
+    "vite": "^7.2.4"
+  }
+}

preprocess/tools/midi_editor/postcss.config.js ADDED Viewed

	@@ -0,0 +1,6 @@

+export default {
+  plugins: {
+    tailwindcss: {},
+    autoprefixer: {},
+  },
+}

preprocess/tools/midi_editor/public/vite.svg ADDED Viewed

preprocess/tools/midi_editor/src/App.css ADDED Viewed

	@@ -0,0 +1,834 @@

+.app-shell {
+  padding: 24px;
+  color: var(--text-primary);
+  width: 100%;
+  max-width: 100%;
+  margin: 0;
+  height: 100vh;
+  max-height: 100vh;
+  display: flex;
+  flex-direction: column;
+  overflow: hidden;
+  box-sizing: border-box;
+}
+.topbar {
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  gap: 24px;
+  background: var(--panel-strong);
+  border: 1px solid var(--border-subtle);
+  border-radius: 16px;
+  padding: 20px 24px;
+  box-shadow: var(--shadow-panel);
+}
+.topbar h1 {
+  margin: 4px 0 0 0;
+  font-size: 26px;
+  letter-spacing: -0.5px;
+}
+.eyebrow {
+  margin: 0;
+  text-transform: uppercase;
+  font-size: 12px;
+  letter-spacing: 2px;
+  color: var(--text-muted);
+}
+.muted {
+  margin: 6px 0 0 0;
+  color: var(--text-muted);
+}
+.actions {
+  display: flex;
+  gap: 10px;
+  align-items: center;
+}
+.transpose-group {
+  display: flex;
+  align-items: center;
+}
+.transpose-select {
+  padding: 10px 10px;
+  border-radius: 12px;
+  border: 1px solid var(--border-soft);
+  background: var(--button-soft-bg);
+  color: var(--button-soft-text);
+  font-weight: 600;
+  font-size: 14px;
+  cursor: pointer;
+  outline: none;
+  appearance: none;
+  -webkit-appearance: none;
+  background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='10' height='6'%3E%3Cpath d='M0 0l5 6 5-6z' fill='%23888'/%3E%3C/svg%3E");
+  background-repeat: no-repeat;
+  background-position: right 10px center;
+  padding-right: 26px;
+  transition: transform 140ms ease, box-shadow 140ms ease, background 140ms ease;
+}
+.transpose-select:hover {
+  transform: translateY(-1px);
+}
+.transpose-select:focus {
+  border-color: var(--accent);
+}
+.icon-toggle {
+  width: 40px;
+  height: 40px;
+  border-radius: 999px;
+  border: 1px solid var(--border-soft);
+  background: var(--button-ghost-bg);
+  color: var(--button-ghost-text);
+  display: inline-flex;
+  align-items: center;
+  justify-content: center;
+  font-size: 18px;
+  cursor: pointer;
+}
+.icon-toggle:hover {
+  transform: translateY(-1px);
+}
+.lang-label {
+  font-size: 14px;
+  font-weight: 700;
+  line-height: 1;
+}
+.audio-bar {
+  margin-top: 14px;
+  padding: 12px 16px;
+  border-radius: 14px;
+  background: var(--panel-strong);
+  border: 1px solid var(--border-subtle);
+  display: flex;
+  align-items: center;
+  justify-content: space-between;
+  gap: 16px;
+}
+.audio-left {
+  display: flex;
+  align-items: center;
+  gap: 12px;
+}
+.audio-hint {
+  color: var(--text-muted);
+  font-size: 12px;
+}
+.audio-right {
+  display: flex;
+  align-items: center;
+  gap: 20px;
+}
+.volume-control {
+  display: flex;
+  align-items: center;
+  gap: 8px;
+}
+.volume-label {
+  font-size: 12px;
+  color: var(--text-muted);
+  min-width: 32px;
+}
+.volume-slider {
+  width: 80px;
+  height: 4px;
+  cursor: pointer;
+  accent-color: var(--accent);
+}
+.volume-value {
+  font-size: 11px;
+  color: var(--text-muted);
+  min-width: 36px;
+  text-align: right;
+}
+.toggle {
+  display: inline-flex;
+  align-items: center;
+  gap: 8px;
+  font-size: 13px;
+  color: var(--text-primary);
+}
+.panel {
+  margin-top: 18px;
+  background: var(--panel);
+  border: 1px solid var(--border-subtle);
+  border-radius: 16px;
+  padding: 18px;
+  box-shadow: var(--shadow-panel);
+  display: flex;
+  flex-direction: column;
+  flex: 1;
+  min-height: 0;
+  overflow: hidden;
+}
+.panel-split {
+  display: grid;
+  grid-template-columns: minmax(0, 1fr) 360px;
+  gap: 16px;
+  align-items: stretch;
+  flex: 1;
+  min-height: 0;
+  max-height: 100%;
+  overflow: hidden;
+}
+.panel-main {
+  min-width: 0;
+  display: flex;
+  flex-direction: column;
+  min-height: 0;
+  max-height: 100%;
+  overflow: hidden;
+}
+.panel-side {
+  display: flex;
+  flex-direction: column;
+  gap: 16px;
+  width: 360px;
+  max-width: 360px;
+  /* Use absolute positioning to enforce height */
+  position: relative;
+  overflow: hidden;
+}
+.controls {
+  display: grid;
+  grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
+  gap: 14px;
+  align-items: center;
+  background: var(--panel-soft);
+  padding: 12px 14px;
+  border-radius: 12px;
+  border: 1px solid var(--border-soft);
+  flex-shrink: 0;
+}
+.controls label {
+  display: block;
+  font-size: 12px;
+  text-transform: uppercase;
+  letter-spacing: 1px;
+  color: var(--text-muted);
+  margin-bottom: 4px;
+}
+.controls input[type='number'] {
+  width: 100%;
+  padding: 10px 12px;
+  border-radius: 10px;
+  border: 1px solid var(--border-soft);
+  background: var(--input-bg);
+  color: var(--text-primary);
+}
+.timesig {
+  display: flex;
+  align-items: center;
+  gap: 6px;
+}
+.timesig span {
+  font-weight: 700;
+  color: var(--text-muted);
+}
+.transport {
+  display: flex;
+  gap: 6px;
+  align-items: center;
+  grid-column: 1 / -1;
+}
+.transport button {
+  padding: 6px 10px !important;
+  font-size: 13px !important;
+  min-width: 0;
+}
+.status {
+  grid-column: 1 / -1;
+  color: var(--text-muted);
+  font-size: 13px;
+}
+.transport-divider {
+  width: 1px;
+  height: 20px;
+  background: var(--border-soft);
+  margin: 0 2px;
+  flex-shrink: 0;
+}
+.selection-btn {
+  font-size: 12px !important;
+  padding: 6px 10px !important;
+  white-space: nowrap;
+}
+.selection-btn.active {
+  background: var(--accent) !important;
+  color: white !important;
+}
+.button,
+.actions button,
+.transport button,
+.ghost,
+.primary,
+.json-btn,
+.soft {
+  cursor: pointer;
+  border-radius: 12px;
+  border: 1px solid transparent;
+  padding: 10px 14px;
+  font-weight: 600;
+  transition: transform 140ms ease, box-shadow 140ms ease, background 140ms ease, border 140ms ease;
+  color: #0f1528;
+}
+.ghost {
+  background: var(--button-ghost-bg);
+  color: var(--button-ghost-text);
+  border-color: var(--border-soft);
+}
+.primary {
+  background: linear-gradient(135deg, var(--accent), var(--accent-strong));
+  color: var(--button-primary-text);
+  box-shadow: 0 8px 26px rgba(72, 228, 194, 0.2);
+}
+.json-btn {
+  background: linear-gradient(135deg, #f59e0b, #d97706);
+  color: #fff;
+  box-shadow: 0 8px 26px rgba(245, 158, 11, 0.2);
+}
+.soft {
+  background: var(--button-soft-bg);
+  color: var(--button-soft-text);
+  border: 1px solid var(--border-soft);
+}
+.ghost:disabled,
+.primary:disabled,
+.json-btn:disabled,
+.soft:disabled {
+  opacity: 0.6;
+  cursor: not-allowed;
+}
+.ghost:hover,
+.primary:hover,
+.json-btn:hover,
+.soft:hover {
+  transform: translateY(-1px);
+}
+.piano-shell {
+  border-radius: 12px;
+  background: var(--panel-strong);
+  border: 1px solid var(--border-subtle);
+  overflow: hidden;
+  flex: 1;
+  min-height: 0;
+  max-height: 100%;
+  display: flex;
+  flex-direction: column;
+}
+.ruler {
+  position: relative;
+  height: 32px;
+  background: var(--panel-soft);
+  border-bottom: 1px solid var(--border-soft);
+  min-width: 100%;
+}
+.ruler-shell {
+  display: flex;
+}
+.ruler-spacer {
+  background: var(--panel-soft);
+  border-bottom: 1px solid var(--border-soft);
+  height: 32px;
+}
+.ruler-scroll {
+  overflow: hidden;
+  flex: 1;
+  height: 32px;
+  cursor: pointer;
+}
+.measure-mark {
+  position: absolute;
+  top: 0;
+  height: 100%;
+  display: flex;
+  flex-direction: column;
+  align-items: flex-start;
+  font-size: 10px;
+  color: var(--text-muted);
+  padding-left: 4px;
+  border-left: 1px solid var(--border-soft);
+}
+.measure-mark span {
+  margin-top: 2px;
+}
+.ruler-playhead {
+  position: absolute;
+  top: 0;
+  width: 2px;
+  height: 100%;
+  background: #ff7043;
+  pointer-events: none;
+  z-index: 10;
+}
+.ruler-scroll.selecting {
+  cursor: crosshair;
+}
+.selection-range {
+  position: absolute;
+  top: 0;
+  height: 100%;
+  background: rgba(66, 165, 245, 0.35);
+  border-left: 2px solid #42a5f5;
+  border-right: 2px solid #42a5f5;
+  pointer-events: none;
+  z-index: 5;
+}
+.grid-selection-range {
+  position: absolute;
+  top: 0;
+  background: rgba(66, 165, 245, 0.15);
+  border-left: 2px dashed #42a5f5;
+  border-right: 2px dashed #42a5f5;
+  pointer-events: none;
+  z-index: 1;
+}
+.roll-body {
+  display: flex;
+  flex: 1;
+  min-height: 0;
+  overflow: hidden;
+}
+.pitch-rail {
+  background: var(--panel-strong);
+  border-right: 1px solid var(--border-subtle);
+  color: var(--text-primary);
+  font-size: 12px;
+  text-align: right;
+  overflow: hidden;
+  flex-shrink: 0;
+  height: 100%;
+}
+.pitch-cell {
+  border-bottom: 1px solid var(--border-soft);
+  display: flex;
+  align-items: center;
+  justify-content: flex-end;
+  padding: 0 4px;
+  font-variant-numeric: tabular-nums;
+  box-sizing: border-box;
+}
+.pitch-white {
+  background: rgba(255, 255, 255, 0.06);
+  color: var(--text-primary);
+}
+.pitch-black {
+  background: rgba(0, 0, 0, 0.35);
+  color: rgba(233, 238, 247, 0.9);
+}
+.pitch-c {
+  background: rgba(100, 150, 255, 0.15);
+  font-weight: 600;
+}
+.pitch-label {
+  font-size: 10px;
+}
+.roll-grid {
+  position: relative;
+  overflow: auto;
+  flex: 1;
+  min-height: 0;
+  background-color: var(--grid-bg);
+}
+.grid-content {
+  background-color: var(--grid-bg);
+}
+.grid-svg {
+  shape-rendering: crispEdges;
+}
+.grid-overlay {
+  position: relative;
+}
+.note-chip {
+  position: absolute;
+  background: linear-gradient(135deg, var(--accent), var(--accent-strong));
+  border-radius: 6px;
+  border: 1px solid rgba(255, 255, 255, 0.16);
+  box-shadow: 0 10px 22px rgba(0, 0, 0, 0.25);
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  color: var(--note-text);
+  font-weight: 700;
+  user-select: none;
+  box-sizing: border-box;
+}
+.note-active {
+  outline: 2px solid #ff7043;
+  z-index: 2;
+}
+.note-overlap {
+  background: linear-gradient(135deg, #ef5350 0%, #ff7043 100%) !important;
+  animation: pulse-overlap 1s ease-in-out infinite;
+}
+/* Selected overlapping note - more visible outline */
+.note-overlap.note-active {
+  outline: 3px solid #1e40af;
+  outline-offset: 1px;
+  box-shadow: 0 0 12px rgba(30, 64, 175, 0.8);
+  animation: none;
+}
+@keyframes pulse-overlap {
+  0%, 100% { opacity: 1; }
+  50% { opacity: 0.7; }
+}
+.playhead {
+  position: absolute;
+  top: 0;
+  width: 2px;
+  background: #ff7043;
+  box-shadow: 0 0 12px rgba(255, 112, 67, 0.6);
+  pointer-events: none;
+  z-index: 20;
+}
+.pitch-rail-inner {
+  will-change: transform;
+}
+.note-label {
+  width: 100%;
+  text-align: center;
+  font-size: 12px;
+  padding: 0 12px;
+  overflow: hidden;
+  text-overflow: ellipsis;
+  white-space: nowrap;
+}
+.note-handle {
+  position: absolute;
+  top: 0;
+  width: 8px;
+  height: 100%;
+  background: rgba(255, 255, 255, 0.25);
+  cursor: ew-resize;
+}
+.note-handle.start {
+  left: 0;
+  border-radius: 6px 0 0 6px;
+}
+.note-handle.end {
+  right: 0;
+  border-radius: 0 6px 6px 0;
+}
+.lyric-container {
+  flex: 1;
+  min-height: 0;
+  position: relative;
+}
+.lyric-card {
+  border: 1px solid rgba(255, 255, 255, 0.06);
+  border-radius: 12px;
+  background: var(--panel-soft);
+  overflow: hidden;
+  display: flex;
+  flex-direction: column;
+  /* Force fixed height with absolute positioning */
+  position: absolute;
+  top: 0;
+  left: 0;
+  right: 0;
+  bottom: 0;
+}
+.lyric-bulk {
+  display: flex;
+  gap: 8px;
+  padding: 10px 12px;
+  border-bottom: 1px solid rgba(255, 255, 255, 0.06);
+  align-items: center;
+}
+.lyric-bulk-input {
+  flex: 1;
+  padding: 8px 10px;
+  border-radius: 10px;
+  border: 1px solid var(--border-soft);
+  background: var(--input-bg);
+  color: var(--text-primary);
+  resize: vertical;
+}
+.lyric-header,
+.lyric-row {
+  display: grid;
+  grid-template-columns: 1.4fr 0.5fr 0.5fr 0.5fr;
+  gap: 8px;
+  padding: 10px 12px;
+  align-items: center;
+}
+.lyric-header {
+  font-size: 12px;
+  text-transform: uppercase;
+  letter-spacing: 1px;
+  color: var(--text-muted);
+  border-bottom: 1px solid rgba(255, 255, 255, 0.06);
+}
+.lyric-list {
+  overflow-y: auto;
+  overflow-x: hidden;
+  flex: 1;
+  min-height: 0;
+}
+.lyric-row {
+  border-bottom: 1px solid rgba(255, 255, 255, 0.04);
+}
+.lyric-row:hover {
+  background: rgba(255, 255, 255, 0.03);
+}
+.lyric-row-active {
+  background: rgba(72, 228, 194, 0.08);
+  border-left: 3px solid #48e4c2;
+}
+.lyric-input {
+  width: 100%;
+  padding: 8px 10px;
+  border-radius: 10px;
+  border: 1px solid var(--border-soft);
+  background: var(--input-bg);
+  color: var(--text-primary);
+}
+.lyric-meta {
+  color: var(--text-muted);
+  font-variant-numeric: tabular-nums;
+}
+.editable-cell {
+  position: relative;
+  display: flex;
+  align-items: center;
+  gap: 2px;
+}
+.lyric-meta-input {
+  width: 100%;
+  padding: 2px 4px;
+  border: 1px solid transparent;
+  border-radius: 4px;
+  background: transparent;
+  color: var(--text-muted);
+  font-size: 12px;
+  font-variant-numeric: tabular-nums;
+  text-align: center;
+  outline: none;
+  transition: border-color 0.15s, background-color 0.15s;
+}
+.lyric-meta-input:hover {
+  background: var(--surface-elevated);
+}
+.lyric-meta-input:focus {
+  border-color: var(--accent);
+  background: var(--surface-elevated);
+  color: var(--text-primary);
+}
+.lyric-meta-dirty {
+  border-color: #f59e0b !important;
+  background: rgba(245, 158, 11, 0.1) !important;
+}
+.confirm-btn {
+  flex-shrink: 0;
+  width: 18px;
+  height: 18px;
+  padding: 0;
+  border: none;
+  border-radius: 4px;
+  background: #22c55e;
+  color: white;
+  font-size: 12px;
+  font-weight: bold;
+  cursor: pointer;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  transition: background 0.15s;
+}
+.confirm-btn:hover {
+  background: #16a34a;
+}
+/* Hide number input spinners */
+.lyric-meta-input::-webkit-outer-spin-button,
+.lyric-meta-input::-webkit-inner-spin-button {
+  -webkit-appearance: none;
+  margin: 0;
+}
+.lyric-meta-input[type=number] {
+  -moz-appearance: textfield;
+}
+.lyric-empty {
+  padding: 16px;
+  color: var(--text-muted);
+  text-align: center;
+}
+.audio-track {
+  display: grid;
+  grid-template-columns: 80px 1fr;
+  gap: 12px;
+  align-items: center;
+  padding: 12px 14px;
+  border-radius: 12px;
+  border: 1px solid var(--border-soft);
+  background: var(--panel-soft);
+  margin-bottom: 12px;
+  flex-shrink: 0;
+}
+.audio-track-label {
+  font-size: 12px;
+  text-transform: uppercase;
+  letter-spacing: 1px;
+  color: var(--text-muted);
+}
+.audio-wave {
+  width: 100%;
+  height: 80px;
+  min-height: 80px;
+}
+:root {
+  --text-primary: #e9eef7;
+  --text-muted: rgba(233, 238, 247, 0.7);
+  --panel: rgba(13, 16, 28, 0.8);
+  --panel-strong: rgba(16, 21, 35, 0.95);
+  --panel-soft: rgba(255, 255, 255, 0.03);
+  --border-subtle: rgba(255, 255, 255, 0.08);
+  --border-soft: rgba(255, 255, 255, 0.12);
+  --input-bg: rgba(255, 255, 255, 0.06);
+  --grid-bg: rgba(14, 18, 30, 0.9);
+  --grid-line-minor: rgba(233, 238, 247, 0.08);
+  --grid-line-major: rgba(233, 238, 247, 0.16);
+  --accent: #48e4c2;
+  --accent-strong: #4b64bc;
+  --note-text: #0b1122;
+  --button-ghost-bg: rgba(233, 238, 247, 0.18);
+  --button-ghost-text: #ffffff;
+  --button-soft-bg: rgba(255, 255, 255, 0.14);
+  --button-soft-text: #ffffff;
+  --button-primary-text: #0b1122;
+  --shadow-panel: 0 18px 40px rgba(0, 0, 0, 0.32);
+}
+:root[data-theme='light'] {
+  --text-primary: #1b2238;
+  --text-muted: rgba(27, 34, 56, 0.7);
+  --panel: rgba(255, 255, 255, 0.9);
+  --panel-strong: rgba(250, 252, 255, 0.98);
+  --panel-soft: rgba(15, 23, 42, 0.04);
+  --border-subtle: rgba(15, 23, 42, 0.12);
+  --border-soft: rgba(15, 23, 42, 0.16);
+  --input-bg: rgba(15, 23, 42, 0.06);
+  --grid-bg: rgba(248, 250, 255, 0.95);
+  --grid-line-minor: rgba(15, 23, 42, 0.12);
+  --grid-line-major: rgba(15, 23, 42, 0.24);
+  --accent: #3f8cff;
+  --accent-strong: #4b64bc;
+  --note-text: #ffffff;
+  --button-ghost-bg: rgba(15, 23, 42, 0.06);
+  --button-ghost-text: #1b2238;
+  --button-soft-bg: rgba(15, 23, 42, 0.06);
+  --button-soft-text: #1b2238;
+  --button-primary-text: #0b1122;
+  --shadow-panel: 0 18px 40px rgba(15, 23, 42, 0.15);
+}
+.sr-only {
+  position: absolute;
+  width: 1px;
+  height: 1px;
+  padding: 0;
+  margin: -1px;
+  overflow: hidden;
+  clip: rect(0, 0, 0, 0);
+  white-space: nowrap;
+  border: 0;
+}

preprocess/tools/midi_editor/src/App.tsx ADDED Viewed

	@@ -0,0 +1,675 @@

+import { useCallback, useEffect, useMemo, useRef, useState } from 'react'
+import * as Tone from 'tone'
+import { PianoRoll } from './components/PianoRoll'
+import { LyricTable } from './components/LyricTable'
+import { AudioTrack } from './components/AudioTrack'
+import { useMidiStore } from './store/useMidiStore'
+import { exportMidi, importMidiFile } from './lib/midi'
+import type { TimeSignature } from './types'
+import type { Lang } from './i18n'
+import { getTranslations } from './i18n'
+import { BASE_GRID_SECOND_WIDTH, BASE_ROW_HEIGHT, LOW_NOTE, HIGH_NOTE } from './constants'
+import './App.css'
+type PlayEvent = {
+  time: number
+  midi: number
+  duration: number
+  velocity: number
+}
+function App() {
+  const {
+    notes,
+    tempo,
+    timeSignature,
+    selectedId,
+    playhead,
+    ppq,
+    addNote,
+    updateNote,
+    removeNote,
+    setNotes,
+    setTempo,
+    setTimeSignature,
+    setPpq,
+    select,
+    setPlayhead,
+  } = useMidiStore()
+  const [lang, setLang] = useState<Lang>('zh')
+  const t = getTranslations(lang)
+  const [status, setStatus] = useState(t.ready)
+  const [isPlaying, setIsPlaying] = useState(false)
+  const [theme, setTheme] = useState<'dark' | 'light'>('light')
+  const [audioUrl, setAudioUrl] = useState<string | null>(null)
+  const [audioDuration, setAudioDuration] = useState(0)
+  const [midiVolume, setMidiVolume] = useState(80) // 0-100
+  const [audioVolume, setAudioVolume] = useState(80) // 0-100
+  const [horizontalZoom, setHorizontalZoom] = useState(1)
+  const [verticalZoom, setVerticalZoom] = useState(1)
+  const [focusLyricId, setFocusLyricId] = useState<string | null>(null)
+  // Selection range for loop playback (in seconds)
+  const [selectionStart, setSelectionStart] = useState<number | null>(null)
+  const [selectionEnd, setSelectionEnd] = useState<number | null>(null)
+  const [isSelectingRange, setIsSelectingRange] = useState(false)
+  const fileInputRef = useRef<HTMLInputElement | null>(null)
+  const audioInputRef = useRef<HTMLInputElement | null>(null)
+  const audioRef = useRef<HTMLAudioElement | null>(null)
+  const partRef = useRef<Tone.Part<PlayEvent> | null>(null)
+  const synthRef = useRef<Tone.PolySynth | null>(null)
+  const rafRef = useRef<number | null>(null)
+  const audioScrollRef = useRef<HTMLDivElement | null>(null)
+  useEffect(() => {
+    return () => {
+      stopPlayback()
+      synthRef.current?.dispose()
+    }
+  }, [])
+  useEffect(() => {
+    document.documentElement.dataset.theme = theme
+  }, [theme])
+  // Update status text when language changes
+  useEffect(() => {
+    setStatus(t.ready)
+  }, [lang])
+  // Sync audio volume - also trigger when audioUrl changes (new audio loaded)
+  useEffect(() => {
+    if (audioRef.current) {
+      audioRef.current.volume = audioVolume / 100
+    }
+  }, [audioVolume, audioUrl])
+  // Sync MIDI synth volume
+  useEffect(() => {
+    if (synthRef.current) {
+      // Convert 0-100 to dB scale (-60 to 0)
+      const dbValue = midiVolume === 0 ? -Infinity : (midiVolume / 100) * 60 - 60
+      synthRef.current.volume.value = dbValue
+    }
+  }, [midiVolume])
+  useEffect(() => {
+    if (!audioUrl) return
+    return () => {
+      URL.revokeObjectURL(audioUrl)
+    }
+  }, [audioUrl])
+  const ensureSynth = async () => {
+    await Tone.start()
+    if (!synthRef.current) {
+      synthRef.current = new Tone.PolySynth(Tone.Synth).toDestination()
+      // Apply current volume
+      const dbValue = midiVolume === 0 ? -Infinity : (midiVolume / 100) * 60 - 60
+      synthRef.current.volume.value = dbValue
+    }
+  }
+  const playPreviewNote = useCallback(async (midi: number) => {
+    await ensureSynth()
+    const frequency = Tone.Frequency(midi, 'midi').toFrequency()
+    synthRef.current?.triggerAttackRelease(frequency, '8n', Tone.now(), 0.7)
+  }, [midiVolume])
+  useEffect(() => {
+    const onKeyDown = (event: KeyboardEvent) => {
+      if (!selectedId) return
+      const target = event.target as HTMLElement | null
+      if (target && ['INPUT', 'TEXTAREA'].includes(target.tagName)) return
+      // Delete note
+      if (event.key === 'Backspace' || event.key === 'Delete') {
+        event.preventDefault()
+        removeNote(selectedId)
+        select(null)
+        return
+      }
+      // Cmd/Ctrl + Up/Down to adjust pitch
+      const isCmdOrCtrl = event.metaKey || event.ctrlKey
+      if (isCmdOrCtrl && (event.key === 'ArrowUp' || event.key === 'ArrowDown')) {
+        event.preventDefault()
+        const selectedNote = notes.find(n => n.id === selectedId)
+        if (!selectedNote) return
+        const delta = event.key === 'ArrowUp' ? 1 : -1
+        const newMidi = Math.max(LOW_NOTE, Math.min(HIGH_NOTE, selectedNote.midi + delta))
+        if (newMidi !== selectedNote.midi) {
+          updateNote(selectedId, { midi: newMidi })
+          playPreviewNote(newMidi)
+        }
+      }
+    }
+    window.addEventListener('keydown', onKeyDown)
+    return () => window.removeEventListener('keydown', onKeyDown)
+  }, [selectedId, notes, removeNote, select, updateNote, playPreviewNote])
+  const noteEvents = useMemo<PlayEvent[]>(
+    () =>
+      notes.map((note) => ({
+        time: (60 / tempo) * note.start,
+        duration: (60 / tempo) * note.duration,
+        midi: note.midi,
+        velocity: note.velocity,
+      })),
+    [notes, tempo],
+  )
+  const beatToSeconds = (beat: number) => beat * (60 / tempo)
+  const secondsToBeat = (seconds: number) => seconds / (60 / tempo)
+  const seekBySeconds = (deltaSeconds: number) => {
+    const maxNoteEnd = notes.reduce((acc, n) => Math.max(acc, n.start + n.duration), 0)
+    const maxBeat = Math.max(secondsToBeat(audioDuration), maxNoteEnd)
+    const nextSeconds = Math.max(0, Math.min(beatToSeconds(maxBeat), beatToSeconds(playhead) + deltaSeconds))
+    seekToBeat(secondsToBeat(nextSeconds))
+  }
+  const gridSecondWidth = BASE_GRID_SECOND_WIDTH * horizontalZoom
+  const rowHeight = BASE_ROW_HEIGHT * verticalZoom
+  // Calculate MIDI content width to sync with audio track
+  const midiContentWidth = useMemo(() => {
+    const noteEndSeconds = notes.reduce((acc, n) => {
+      const endBeat = n.start + n.duration
+      return Math.max(acc, beatToSeconds(endBeat))
+    }, 8)
+    const maxSeconds = Math.max(noteEndSeconds + 10, audioDuration + 10, 30)
+    return maxSeconds * gridSecondWidth
+  }, [notes, audioDuration, gridSecondWidth, beatToSeconds])
+  const seekToBeat = (beat: number) => {
+    setPlayhead(beat)
+    Tone.Transport.seconds = beatToSeconds(beat)
+    if (audioRef.current) {
+      audioRef.current.currentTime = beatToSeconds(beat)
+    }
+  }
+  const schedulePlayback = async () => {
+    if (!notes.length && !audioUrl) return
+    await ensureSynth()
+    partRef.current?.dispose()
+    Tone.Transport.cancel()
+    Tone.Transport.stop()
+    Tone.Transport.bpm.value = tempo
+    // Determine playback range
+    const hasSelection = selectionStart !== null && selectionEnd !== null && selectionEnd > selectionStart
+    const startSeconds = hasSelection ? selectionStart : beatToSeconds(playhead)
+    const endSeconds = hasSelection ? selectionEnd : null
+    Tone.Transport.seconds = startSeconds
+    // Filter notes within selection range if applicable
+    const filteredEvents = hasSelection
+      ? noteEvents.filter(e => e.time >= startSeconds && e.time < endSeconds!)
+      : noteEvents
+    if (filteredEvents.length) {
+      partRef.current = new Tone.Part((time, event) => {
+        if (midiVolume === 0) return
+        const frequency = Tone.Frequency(event.midi, 'midi').toFrequency()
+        synthRef.current?.triggerAttackRelease(frequency, event.duration, time, event.velocity)
+      }, filteredEvents)
+      partRef.current.start(0)
+    }
+    Tone.Transport.start()
+    if (audioRef.current && audioUrl) {
+      audioRef.current.currentTime = startSeconds
+      if (audioVolume > 0) {
+        audioRef.current.play().catch(() => null)
+      }
+    }
+    setIsPlaying(true)
+    setStatus(hasSelection ? t.selectionPlayback : t.playing)
+    const tick = () => {
+      const seconds =
+        audioRef.current && audioUrl && !audioRef.current.paused
+          ? audioRef.current.currentTime
+          : Tone.Transport.seconds
+      // Stop at selection end
+      if (endSeconds !== null && seconds >= endSeconds) {
+        pausePlayback()
+        seekToBeat(secondsToBeat(selectionStart!))
+        setStatus(t.selectionDone)
+        return
+      }
+      const beat = seconds / (60 / tempo)
+      setPlayhead(beat)
+      rafRef.current = requestAnimationFrame(tick)
+    }
+    rafRef.current = requestAnimationFrame(tick)
+  }
+  const stopPlayback = () => {
+    Tone.Transport.stop()
+    Tone.Transport.cancel()
+    partRef.current?.dispose()
+    partRef.current = null
+    setIsPlaying(false)
+    setPlayhead(0)
+    if (audioRef.current) {
+      audioRef.current.pause()
+      audioRef.current.currentTime = 0
+    }
+    if (rafRef.current) {
+      cancelAnimationFrame(rafRef.current)
+      rafRef.current = null
+    }
+  }
+  const pausePlayback = () => {
+    Tone.Transport.stop()
+    partRef.current?.dispose()
+    partRef.current = null
+    setIsPlaying(false)
+    if (audioRef.current) {
+      audioRef.current.pause()
+    }
+    if (rafRef.current) {
+      cancelAnimationFrame(rafRef.current)
+      rafRef.current = null
+    }
+  }
+  const handlePlayToggle = async () => {
+    if (isPlaying) {
+      pausePlayback()
+      setStatus(t.paused)
+    } else {
+      await schedulePlayback()
+    }
+  }
+  const handleImportClick = () => fileInputRef.current?.click()
+  const handleAudioImportClick = () => audioInputRef.current?.click()
+  const handleFileChange = async (event: React.ChangeEvent<HTMLInputElement>) => {
+    const file = event.target.files?.[0]
+    if (!file) return
+    try {
+      const snapshot = await importMidiFile(file)
+      setNotes(snapshot.notes)
+      setTempo(snapshot.tempo)
+      setTimeSignature(snapshot.timeSignature as TimeSignature)
+      setPpq(snapshot.ppq)  // Preserve original ppq for accurate export
+      setStatus(t.imported(file.name))
+    } catch (error) {
+      console.error(error)
+      setStatus(t.importFailed)
+    } finally {
+      event.target.value = ''
+    }
+  }
+  const handleAudioChange = (event: React.ChangeEvent<HTMLInputElement>) => {
+    const file = event.target.files?.[0]
+    if (!file) return
+    // Validate audio file type
+    const validAudioTypes = ['audio/mpeg', 'audio/wav', 'audio/ogg', 'audio/flac', 'audio/mp4', 'audio/aac', 'audio/x-m4a']
+    const validExtensions = ['.mp3', '.wav', '.ogg', '.flac', '.m4a', '.aac']
+    const fileName = file.name.toLowerCase()
+    const isValidType = validAudioTypes.includes(file.type) || file.type.startsWith('audio/')
+    const isValidExtension = validExtensions.some(ext => fileName.endsWith(ext))
+    if (!isValidType && !isValidExtension) {
+      setStatus(t.unsupportedFormat(validExtensions.join(', ')))
+      event.target.value = ''
+      return
+    }
+    const url = URL.createObjectURL(file)
+    setAudioUrl(url)
+    setStatus(t.audioImported(file.name))
+    event.target.value = ''
+  }
+  // Fix overlapping notes by trimming the first note to end where the second begins
+  // Returns the number of fixed overlaps
+  const fixOverlaps = (): number => {
+    const sortedNotes = [...notes].sort((a, b) => a.start - b.start)
+    let fixCount = 0
+    for (let i = 0; i < sortedNotes.length - 1; i++) {
+      const noteA = sortedNotes[i]
+      const noteB = sortedNotes[i + 1]
+      const noteAEnd = noteA.start + noteA.duration
+      // If noteA overlaps with noteB
+      if (noteAEnd > noteB.start) {
+        // Trim noteA to end at noteB's start
+        const newDuration = Math.max(0.01, noteB.start - noteA.start)
+        updateNote(noteA.id, { duration: newDuration })
+        fixCount++
+      }
+    }
+    return fixCount
+  }
+  // UI handler for fix overlaps button
+  const handleFixOverlaps = () => {
+    const fixCount = fixOverlaps()
+    if (fixCount > 0) {
+      setStatus(t.fixedOverlaps(fixCount))
+    } else {
+      setStatus(t.noOverlaps)
+    }
+  }
+  const handleExport = () => {
+    // Auto-fix overlaps before export
+    fixOverlaps()
+    // Get the latest notes from store (after fix, zustand set is synchronous)
+    const latestNotes = useMidiStore.getState().notes
+    const blob = exportMidi({ notes: latestNotes, tempo, timeSignature, ppq })
+    const url = URL.createObjectURL(blob)
+    const anchor = document.createElement('a')
+    anchor.href = url
+    anchor.download = 'vocal-midi.mid'
+    anchor.click()
+    URL.revokeObjectURL(url)
+    setStatus(t.exported)
+  }
+  const handleTranspose = (semitones: number) => {
+    if (semitones === 0 || !notes.length) return
+    for (const note of notes) {
+      const newMidi = Math.max(0, Math.min(127, note.midi + semitones))
+      updateNote(note.id, { midi: newMidi })
+    }
+    setStatus(t.transposed(semitones))
+  }
+  return (
+    <div className="app-shell">
+      <header className="topbar">
+        <div>
+          <p className="eyebrow">{t.eyebrow}</p>
+          <h1>{t.title}</h1>
+          <p className="muted">{t.subtitle}</p>
+        </div>
+        <div className="actions">
+          <button className="primary" onClick={handleImportClick}>
+            {t.importMidi}
+          </button>
+          <button className="primary" onClick={handleExport}>
+            {t.exportMidi}
+          </button>
+          <div className="transpose-group" title={t.transposeTooltip}>
+            <select
+              className="transpose-select"
+              value={0}
+              onChange={(e) => {
+                const val = Number(e.target.value)
+                if (val !== 0) handleTranspose(val)
+                e.target.value = '0'
+              }}
+            >
+              <option value={0}>{t.transpose}</option>
+              {Array.from({ length: 24 }, (_, i) => i - 12)
+                .filter(v => v !== 0)
+                .reverse()
+                .map(v => (
+                  <option key={v} value={v}>
+                    {v > 0 ? `+${v}` : v}
+                  </option>
+                ))}
+            </select>
+          </div>
+          <button className="soft" onClick={handleFixOverlaps} title={t.fixOverlapsTooltip}>
+            {t.fixOverlaps}
+          </button>
+          <button className="icon-toggle" onClick={() => setTheme(theme === 'dark' ? 'light' : 'dark')}>
+            {theme === 'dark' ? (
+              <span className="icon" aria-label={t.switchToLight}>
+                ☀️
+              </span>
+            ) : (
+              <span className="icon" aria-label={t.switchToDark}>
+                🌙
+              </span>
+            )}
+          </button>
+          <button
+            className="icon-toggle"
+            onClick={() => setLang(lang === 'zh' ? 'en' : 'zh')}
+            title={lang === 'zh' ? 'Switch to English' : '切换到中文'}
+          >
+            <span className="lang-label">{lang === 'zh' ? 'EN' : '中'}</span>
+          </button>
+          <input ref={fileInputRef} type="file" accept=".mid,.midi" className="sr-only" onChange={handleFileChange} />
+        </div>
+      </header>
+      <section className="audio-bar">
+        <div className="audio-left">
+          <button className="ghost" onClick={handleAudioImportClick}>
+            {t.importAudio}
+          </button>
+          <input
+            ref={audioInputRef}
+            type="file"
+            accept=".mp3,.wav,.ogg,.flac,.m4a,.aac"
+            className="sr-only"
+            onChange={handleAudioChange}
+          />
+          <span className="audio-hint">{t.audioHint}</span>
+        </div>
+        <div className="audio-right">
+          <div className="volume-control">
+            <span className="volume-label">{t.midiLabel}</span>
+            <input
+              type="range"
+              min={0}
+              max={100}
+              value={midiVolume}
+              onChange={(e) => setMidiVolume(Number(e.target.value))}
+              className="volume-slider"
+            />
+            <span className="volume-value">{midiVolume}%</span>
+          </div>
+          <div className="volume-control">
+            <span className="volume-label">{t.audioLabel}</span>
+            <input
+              type="range"
+              min={0}
+              max={100}
+              value={audioVolume}
+              onChange={(e) => setAudioVolume(Number(e.target.value))}
+              className="volume-slider"
+            />
+            <span className="volume-value">{audioVolume}%</span>
+          </div>
+        </div>
+      </section>
+      <section className="panel panel-split">
+        <div className="panel-main">
+          {audioUrl && (
+            <AudioTrack
+              key={audioUrl}
+              ref={audioScrollRef}
+              audioUrl={audioUrl}
+              muted={audioVolume === 0}
+              onSeek={(seconds) => seekToBeat(secondsToBeat(seconds))}
+              playheadSeconds={beatToSeconds(playhead)}
+              gridSecondWidth={gridSecondWidth}
+              minContentWidth={midiContentWidth}
+            />
+          )}
+          <PianoRoll
+            notes={notes}
+            selectedId={selectedId}
+            timeSignature={timeSignature}
+            tempo={tempo}
+            playhead={playhead}
+            selectionStart={selectionStart}
+            selectionEnd={selectionEnd}
+            onAddNote={addNote}
+            onSelect={select}
+            onUpdateNote={updateNote}
+            onSeek={seekToBeat}
+            onScroll={(left) => {
+              if (audioScrollRef.current) {
+                audioScrollRef.current.scrollLeft = left
+              }
+            }}
+            onZoom={(deltaH, deltaV) => {
+              if (deltaH !== 0) {
+                setHorizontalZoom(prev => Math.max(0.5, prev + deltaH))
+              }
+              if (deltaV !== 0) {
+                setVerticalZoom(prev => Math.max(0.6, Math.min(2.5, prev + deltaV)))
+              }
+            }}
+            onPlayNote={playPreviewNote}
+            onFocusLyric={(noteId) => {
+              select(noteId)
+              setFocusLyricId(noteId)
+            }}
+            onSelectionChange={(start, end) => {
+              setSelectionStart(start)
+              setSelectionEnd(end)
+            }}
+            isSelectingRange={isSelectingRange}
+            audioDuration={audioDuration}
+            gridSecondWidth={gridSecondWidth}
+            rowHeight={rowHeight}
+          />
+        </div>
+        <aside className="panel-side">
+          <div className="controls">
+            <div className="toggle" style={{ justifyContent: 'space-between' }}>
+              <span>{t.horizontalZoom}</span>
+              <input
+                type="range"
+                min={0.5}
+                max={10}
+                step={0.1}
+                value={Math.min(horizontalZoom, 10)}
+                onChange={(e) => setHorizontalZoom(Number(e.target.value))}
+                style={{ width: '140px' }}
+              />
+              <span style={{ width: 42, textAlign: 'right' }}>{horizontalZoom.toFixed(1)}x</span>
+            </div>
+            <div className="toggle" style={{ justifyContent: 'space-between' }}>
+              <span>{t.verticalZoom}</span>
+              <input
+                type="range"
+                min={0.6}
+                max={2.5}
+                step={0.1}
+                value={verticalZoom}
+                onChange={(e) => setVerticalZoom(Number(e.target.value))}
+                style={{ width: '140px' }}
+              />
+              <span style={{ width: 42, textAlign: 'right' }}>{verticalZoom.toFixed(1)}x</span>
+            </div>
+            <div className="transport">
+              <button
+                className="soft"
+                onClick={() => {
+                   setPlayhead(0)
+                   seekToBeat(0)
+                }}
+                title={t.goToStart}
+              >
+                ⏮
+              </button>
+              <button
+                className="soft"
+                onClick={() => seekBySeconds(-2)}
+                title={t.back2s}
+              >
+                ⏪
+              </button>
+              <button
+                 className="primary"
+                 onClick={handlePlayToggle}
+                 disabled={!notes.length && !audioUrl}
+                 title={isPlaying ? t.pause : (selectionStart !== null && selectionEnd !== null ? t.playSelection : t.play)}
+              >
+                {isPlaying ? '⏸' : '▶'}
+              </button>
+              <button
+                className="soft"
+                onClick={() => seekBySeconds(2)}
+                title={t.forward2s}
+              >
+                ⏩
+              </button>
+              <button
+                 className="soft"
+                 onClick={() => {
+                    const maxNoteEnd = notes.reduce((acc, n) => Math.max(acc, n.start + n.duration), 0)
+                    seekToBeat(Math.max(secondsToBeat(audioDuration), maxNoteEnd))
+                 }}
+                 title={t.goToEnd}
+              >
+                ⏭
+              </button>
+              <span className="transport-divider" />
+              <button
+                className={`soft selection-btn ${isSelectingRange ? 'active' : ''}`}
+                onClick={() => {
+                  if (isSelectingRange) {
+                    // Exiting selection mode - auto clear selection
+                    setIsSelectingRange(false)
+                    setSelectionStart(null)
+                    setSelectionEnd(null)
+                  } else {
+                    setIsSelectingRange(true)
+                  }
+                }}
+                title={isSelectingRange ? t.exitSelectMode : t.setRangeTooltip}
+              >
+                {isSelectingRange ? `📍 ${t.selectingRange}` : `📍 ${t.setRange}`}
+              </button>
+            </div>
+            <div className="status">{status}</div>
+          </div>
+          <div className="lyric-container">
+            <LyricTable
+              notes={notes}
+              selectedId={selectedId}
+              tempo={tempo}
+              focusLyricId={focusLyricId}
+              lang={lang}
+              onSelect={select}
+              onUpdate={updateNote}
+              onFocusHandled={() => setFocusLyricId(null)}
+            />
+          </div>
+        </aside>
+      </section>
+      <audio
+        ref={audioRef}
+        src={audioUrl ?? undefined}
+        preload="auto"
+        className="sr-only"
+        onLoadedMetadata={(e) => {
+          setAudioDuration(e.currentTarget.duration)
+          // Ensure volume is set when audio loads
+          e.currentTarget.volume = audioVolume / 100
+        }}
+      />
+    </div>
+  )
+}
+export default App

preprocess/tools/midi_editor/src/components/AudioTrack.tsx ADDED Viewed

	@@ -0,0 +1,182 @@

+import { useEffect, useRef, forwardRef, useState } from 'react'
+import WaveSurfer from 'wavesurfer.js'
+import { PITCH_WIDTH } from '../constants'
+export type AudioTrackProps = {
+  audioUrl: string | null
+  muted: boolean
+  onSeek: (seconds: number) => void
+  mediaElement?: HTMLAudioElement | null
+  playheadSeconds: number
+  gridSecondWidth: number
+  minContentWidth?: number  // Minimum width to match MIDI editor area
+}
+export const AudioTrack = forwardRef<HTMLDivElement, AudioTrackProps>(
+  ({ audioUrl, muted, onSeek, playheadSeconds, gridSecondWidth, minContentWidth = 0 }, ref) => {
+    const containerRef = useRef<HTMLDivElement | null>(null)
+    const waveRef = useRef<WaveSurfer | null>(null)
+    const [waveWidth, setWaveWidth] = useState(0)
+    useEffect(() => {
+      if (!containerRef.current) return
+      if (!audioUrl) {
+        try {
+          waveRef.current?.destroy()
+        } catch {
+          // ignore teardown errors
+        }
+        waveRef.current = null
+        setWaveWidth(0)
+        return
+      }
+      let cancelled = false
+      // Clean up existing instance
+      if (waveRef.current) {
+        try {
+          waveRef.current.destroy()
+        } catch {
+          // ignore teardown errors
+        }
+      }
+      waveRef.current = WaveSurfer.create({
+        container: containerRef.current,
+        waveColor: '#4b64bc',
+        progressColor: '#4b64bc',
+        cursorColor: 'transparent',
+        barWidth: 2,
+        barGap: 2,
+        height: 60,
+        normalize: true,
+        minPxPerSec: gridSecondWidth,
+        interact: false,
+        hideScrollbar: true,
+        autoScroll: false,
+      })
+      waveRef.current.load(audioUrl).catch(() => null)
+      waveRef.current.on('error', () => null)
+      waveRef.current.on('ready', () => {
+        if (cancelled || !waveRef.current) return
+        const duration = waveRef.current.getDuration()
+        const requiredWidth = duration * gridSecondWidth
+        setWaveWidth(requiredWidth)
+      })
+      return () => {
+        cancelled = true
+        try {
+          waveRef.current?.destroy()
+        } catch {
+          // ignore teardown errors
+        }
+        waveRef.current = null
+      }
+    }, [audioUrl, gridSecondWidth])
+    useEffect(() => {
+      if (!waveRef.current) return
+      waveRef.current.setOptions({
+        waveColor: muted ? '#9aa6b2' : '#4b64bc',
+        progressColor: muted ? '#c0c9d4' : '#4b64bc',
+      })
+    }, [muted])
+    if (!audioUrl) return null
+    // Content width should be at least as wide as MIDI editor
+    const contentWidth = Math.max(waveWidth, minContentWidth)
+    return (
+      <div
+        className="audio-track-row"
+        style={{
+          display: 'flex',
+          borderBottom: '1px solid var(--border-soft)',
+          height: '70px',
+          flexShrink: 0
+        }}
+      >
+        <div
+          className="audio-gutter"
+          style={{
+            width: PITCH_WIDTH,
+            flexShrink: 0,
+            background: 'var(--panel-strong)',
+            borderRight: '1px solid var(--border-subtle)',
+            display: 'flex',
+            alignItems: 'center',
+            justifyContent: 'center',
+            fontSize: '11px',
+            color: 'var(--text-muted)',
+            fontWeight: 600,
+          }}
+        >
+          AUDIO
+        </div>
+        {/* Scroll Mask - Controlled by parent via ref */}
+        <div
+          ref={ref}
+          className="audio-scroll-mask"
+          style={{
+            flex: 1,
+            overflow: 'hidden',
+            position: 'relative',
+            background: 'var(--panel-soft)',
+          }}
+          onClick={(e) => {
+            const rect = e.currentTarget.getBoundingClientRect()
+            const scrollMask = e.currentTarget as HTMLDivElement
+            const x = e.clientX - rect.left + scrollMask.scrollLeft
+            const seconds = x / gridSecondWidth
+            onSeek(seconds)
+          }}
+        >
+          {/* Container that matches MIDI editor width */}
+          <div
+            className="audio-content"
+            style={{
+              width: contentWidth > 0 ? contentWidth : '100%',
+              height: '100%',
+              position: 'relative'
+            }}
+          >
+            {/* WaveSurfer container - only as wide as audio */}
+            <div
+              ref={containerRef}
+              className="wave-container"
+              style={{
+                width: waveWidth > 0 ? waveWidth : '100%',
+                height: '100%',
+                position: 'absolute',
+                left: 0,
+                top: 0
+              }}
+            />
+            {/* Custom Playhead */}
+            <div
+              className="audio-playhead"
+              style={{
+                position: 'absolute',
+                top: 0,
+                bottom: 0,
+                width: '2px',
+                background: '#ff7043',
+                boxShadow: '0 0 12px rgba(255, 112, 67, 0.6)',
+                left: playheadSeconds * gridSecondWidth,
+                zIndex: 10,
+                pointerEvents: 'none',
+              }}
+            />
+          </div>
+        </div>
+      </div>
+    )
+  }
+)

preprocess/tools/midi_editor/src/components/LyricTable.tsx ADDED Viewed

	@@ -0,0 +1,301 @@

+import { useEffect, useMemo, useRef, useState } from 'react'
+import type { NoteEvent } from '../types'
+import type { Lang } from '../i18n'
+import { getTranslations, tokenizeLyrics } from '../i18n'
+export type LyricTableProps = {
+  notes: NoteEvent[]
+  selectedId: string | null
+  tempo: number
+  focusLyricId: string | null
+  lang: Lang
+  onSelect: (id: string | null) => void
+  onUpdate: (id: string, patch: Partial<NoteEvent>) => void
+  onScrollToNote?: (noteId: string) => void
+  onFocusHandled?: () => void
+}
+const formatSeconds = (beats: number, tempo: number) => {
+  const seconds = beats * (60 / tempo)
+  return Number.parseFloat(seconds.toFixed(2))
+}
+const secondsToBeats = (seconds: number, tempo: number) => {
+  return seconds * (tempo / 60)
+}
+// Editable cell with confirmation
+function EditableCell({
+  value,
+  noteId,
+  field,
+  tempo,
+  onConfirm,
+  confirmTitle,
+  type = 'number',
+  min,
+  step
+}: {
+  value: number
+  noteId: string
+  field: 'midi' | 'start' | 'end'
+  tempo: number
+  onConfirm: (noteId: string, field: string, value: number) => void
+  confirmTitle?: string
+  type?: string
+  min?: number
+  step?: number
+}) {
+  const displayValue = field === 'midi' ? value : formatSeconds(value, tempo)
+  const [localValue, setLocalValue] = useState(String(displayValue))
+  const [isDirty, setIsDirty] = useState(false)
+  const inputRef = useRef<HTMLInputElement>(null)
+  // Sync with external value when it changes (and not dirty)
+  useEffect(() => {
+    if (!isDirty) {
+      setLocalValue(String(displayValue))
+    }
+  }, [displayValue, isDirty])
+  const handleChange = (e: React.ChangeEvent<HTMLInputElement>) => {
+    setLocalValue(e.target.value)
+    setIsDirty(true)
+  }
+  const handleConfirm = () => {
+    const parsed = parseFloat(localValue)
+    if (!isNaN(parsed)) {
+      if (field === 'midi') {
+        if (parsed >= 0 && parsed <= 127) {
+          onConfirm(noteId, field, Math.round(parsed))
+        }
+      } else {
+        if (parsed >= 0) {
+          onConfirm(noteId, field, secondsToBeats(parsed, tempo))
+        }
+      }
+    }
+    setIsDirty(false)
+  }
+  const handleKeyDown = (e: React.KeyboardEvent) => {
+    if (e.key === 'Enter') {
+      e.preventDefault()
+      handleConfirm()
+      inputRef.current?.blur()
+    } else if (e.key === 'Escape') {
+      setLocalValue(String(displayValue))
+      setIsDirty(false)
+      inputRef.current?.blur()
+    }
+  }
+  const handleBlur = () => {
+    if (isDirty) {
+      // Reset to original on blur without confirm
+      setLocalValue(String(displayValue))
+      setIsDirty(false)
+    }
+  }
+  return (
+    <div className="editable-cell">
+      <input
+        ref={inputRef}
+        className={`lyric-meta-input ${isDirty ? 'lyric-meta-dirty' : ''}`}
+        type={type}
+        min={min}
+        step={step}
+        value={localValue}
+        onChange={handleChange}
+        onKeyDown={handleKeyDown}
+        onBlur={handleBlur}
+        onClick={(e) => e.stopPropagation()}
+      />
+      {isDirty && (
+        <button
+          className="confirm-btn"
+          onMouseDown={(e) => {
+            e.preventDefault() // Prevent input blur
+            e.stopPropagation()
+          }}
+          onClick={(e) => {
+            e.stopPropagation()
+            handleConfirm()
+          }}
+          title={confirmTitle}
+        >
+          ✓
+        </button>
+      )}
+    </div>
+  )
+}
+export function LyricTable({ notes, selectedId, tempo, focusLyricId, lang, onSelect, onUpdate, onScrollToNote, onFocusHandled }: LyricTableProps) {
+  const t = getTranslations(lang)
+  const listRef = useRef<HTMLDivElement | null>(null)
+  const inputRefs = useRef<Map<string, HTMLInputElement>>(new Map())
+  const sorted = useMemo(() => [...notes].sort((a, b) => a.start - b.start), [notes])
+  // Scroll to selected note (no auto-focus on single click)
+  useEffect(() => {
+    if (!selectedId || !listRef.current) return
+    const target = listRef.current.querySelector<HTMLDivElement>(`[data-note-id="${selectedId}"]`)
+    if (target) {
+      target.scrollIntoView({ block: 'nearest', behavior: 'smooth' })
+    }
+  }, [selectedId])
+  // Focus lyric input when requested (double-click on note or click on list row)
+  useEffect(() => {
+    if (!focusLyricId) return
+    const input = inputRefs.current.get(focusLyricId)
+    if (input) {
+      setTimeout(() => {
+        input.focus()
+        input.select()
+      }, 50)
+    }
+    onFocusHandled?.()
+  }, [focusLyricId, onFocusHandled])
+  // Fill lyrics from selected note onwards
+  // Uses smart tokenizer: CJK chars -> one per note, English words -> one per note
+  const handleBulkFill = (bulkText: string) => {
+    if (!sorted.length) return
+    const tokens = tokenizeLyrics(bulkText)
+    if (!tokens.length) return
+    let startIndex = 0
+    if (selectedId) {
+      const selectedIndex = sorted.findIndex(n => n.id === selectedId)
+      if (selectedIndex >= 0) {
+        startIndex = selectedIndex
+      }
+    }
+    let tokenIndex = 0
+    for (let i = startIndex; i < sorted.length && tokenIndex < tokens.length; i++) {
+      onUpdate(sorted[i].id, { lyric: tokens[tokenIndex] })
+      tokenIndex++
+    }
+  }
+  const handleRowClick = (noteId: string) => {
+    onSelect(noteId)
+    onScrollToNote?.(noteId)
+  }
+  const handleFieldConfirm = (noteId: string, field: string, value: number) => {
+    const note = notes.find(n => n.id === noteId)
+    if (!note) return
+    if (field === 'midi') {
+      onUpdate(noteId, { midi: value })
+    } else if (field === 'start') {
+      // Keep END the same, adjust duration accordingly
+      const currentEnd = note.start + note.duration
+      const newDuration = Math.max(0.01, currentEnd - value)
+      onUpdate(noteId, { start: value, duration: newDuration })
+    } else if (field === 'end') {
+      // End changed, update duration
+      const newDuration = Math.max(0.01, value - note.start)
+      onUpdate(noteId, { duration: newDuration })
+    }
+  }
+  return (
+    <div className="lyric-card">
+      <div className="lyric-bulk">
+        <textarea
+          className="lyric-bulk-input"
+          rows={2}
+          placeholder={selectedId ? t.fillPlaceholderSelected : t.fillPlaceholderDefault}
+          onKeyDown={(e) => {
+            if (e.key === 'Enter' && !e.shiftKey) {
+              e.preventDefault()
+              handleBulkFill(e.currentTarget.value)
+            }
+          }}
+        />
+        <button
+          className="soft"
+          type="button"
+          onClick={(e) => {
+            const textarea = e.currentTarget.previousElementSibling as HTMLTextAreaElement
+            handleBulkFill(textarea.value)
+          }}
+        >
+          {t.fillButton.split('\n').map((line, i) => (
+            <span key={i}>{line}{i === 0 && <br/>}</span>
+          ))}
+        </button>
+      </div>
+      <div className="lyric-header" style={{ flexShrink: 0 }}>
+        <div>LYRIC</div>
+        <div>PITCH</div>
+        <div>START</div>
+        <div>END</div>
+      </div>
+      <div className="lyric-list" ref={listRef}>
+        {sorted.map((note) => (
+          <div
+            key={note.id}
+            className={`lyric-row ${selectedId === note.id ? 'lyric-row-active' : ''}`}
+            data-note-id={note.id}
+            onClick={() => handleRowClick(note.id)}
+          >
+            <input
+              ref={(el) => {
+                if (el) {
+                  inputRefs.current.set(note.id, el)
+                } else {
+                  inputRefs.current.delete(note.id)
+                }
+              }}
+              className="lyric-input"
+              value={note.lyric}
+              placeholder={t.lyricPlaceholder}
+              onChange={(event) => onUpdate(note.id, { lyric: event.target.value })}
+              onClick={(e) => e.stopPropagation()}
+            />
+            <EditableCell
+              value={note.midi}
+              noteId={note.id}
+              field="midi"
+              tempo={tempo}
+              onConfirm={handleFieldConfirm}
+              confirmTitle={t.confirmEdit}
+              min={0}
+            />
+            <EditableCell
+              value={note.start}
+              noteId={note.id}
+              field="start"
+              tempo={tempo}
+              onConfirm={handleFieldConfirm}
+              confirmTitle={t.confirmEdit}
+              min={0}
+              step={0.01}
+            />
+            <EditableCell
+              value={note.start + note.duration}
+              noteId={note.id}
+              field="end"
+              tempo={tempo}
+              onConfirm={handleFieldConfirm}
+              confirmTitle={t.confirmEdit}
+              min={0}
+              step={0.01}
+            />
+          </div>
+        ))}
+        {sorted.length === 0 && <div className="lyric-empty">{t.emptyHint}</div>}
+      </div>
+    </div>
+  )
+}

preprocess/tools/midi_editor/src/components/PianoRoll.tsx ADDED Viewed

	@@ -0,0 +1,704 @@

+import { useEffect, useMemo, useRef, useState, useCallback, memo } from 'react'
+import type React from 'react'
+import type { NoteEvent, TimeSignature } from '../types'
+import { PITCH_WIDTH, LOW_NOTE, HIGH_NOTE } from '../constants'
+const midiToName = (midi: number) => {
+  const names = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
+  const octave = Math.floor(midi / 12) - 1
+  return `${names[midi % 12]}${octave}`
+}
+// Memoized note component to prevent unnecessary re-renders
+const NoteChip = memo(function NoteChip({
+  note,
+  left,
+  top,
+  width,
+  height,
+  fontSize,
+  isSelected,
+  isOverlapping,
+  onPointerDown,
+  onDoubleClick,
+}: {
+  note: NoteEvent
+  left: number
+  top: number
+  width: number
+  height: number
+  fontSize: number
+  isSelected: boolean
+  isOverlapping: boolean
+  onPointerDown: (event: React.PointerEvent<HTMLDivElement>, mode: 'move' | 'resize-start' | 'resize-end') => void
+  onDoubleClick: (event: React.MouseEvent<HTMLDivElement>) => void
+}) {
+  return (
+    <div
+      className={`note-chip ${isSelected ? 'note-active' : ''} ${isOverlapping ? 'note-overlap' : ''}`}
+      style={{
+        left,
+        top: top + 1,
+        width,
+        height,
+        willChange: 'transform', // GPU acceleration hint
+      }}
+      onPointerDown={(e) => onPointerDown(e, 'move')}
+      onDoubleClick={onDoubleClick}
+    >
+      <div className="note-label" style={{ fontSize }}>
+        <span>{note.lyric || '\u00a0'}</span>
+      </div>
+      <div className="note-handle start" onPointerDown={(e) => { e.stopPropagation(); onPointerDown(e, 'resize-start') }} />
+      <div className="note-handle end" onPointerDown={(e) => { e.stopPropagation(); onPointerDown(e, 'resize-end') }} />
+    </div>
+  )
+})
+// Dynamic snap based on zoom level - higher zoom = finer snap
+const getSnapSeconds = (gridSecondWidth: number) => {
+  // At base width (80px/s), snap is 0.1s
+  // At 2x zoom (160px/s), snap is 0.05s
+  // At 4x zoom (320px/s), snap is 0.025s
+  // At 8x zoom (640px/s), snap is 0.01s
+  const baseSnap = 0.1
+  const zoomFactor = gridSecondWidth / 80
+  return Math.max(0.01, baseSnap / zoomFactor)
+}
+const snapSeconds = (value: number, gridSecondWidth: number) => {
+  const snap = getSnapSeconds(gridSecondWidth)
+  return Math.max(0, Math.round(value / snap) * snap)
+}
+export type PianoRollProps = {
+  notes: NoteEvent[]
+  selectedId: string | null
+  timeSignature: TimeSignature
+  tempo: number
+  playhead: number // in beats
+  selectionStart: number | null // in seconds
+  selectionEnd: number | null // in seconds
+  onAddNote: (note: Partial<NoteEvent>) => NoteEvent
+  onUpdateNote: (id: string, patch: Partial<NoteEvent>) => void
+  onSelect: (id: string | null) => void
+  onSeek: (beat: number) => void
+  onScroll?: (left: number) => void
+  onZoom?: (deltaH: number, deltaV: number) => void
+  onPlayNote?: (midi: number) => void
+  onFocusLyric?: (noteId: string) => void
+  onSelectionChange?: (start: number | null, end: number | null) => void
+  isSelectingRange?: boolean
+  audioDuration?: number
+  gridSecondWidth: number
+  rowHeight: number
+}
+export function PianoRoll({
+  notes,
+  selectedId,
+  timeSignature: _timeSignature,
+  tempo,
+  playhead,
+  selectionStart,
+  selectionEnd,
+  onAddNote,
+  onSelect,
+  onUpdateNote,
+  onSeek,
+  onScroll,
+  onZoom,
+  onPlayNote,
+  onFocusLyric,
+  onSelectionChange,
+  isSelectingRange = false,
+  audioDuration = 0,
+  gridSecondWidth,
+  rowHeight
+}: PianoRollProps) {
+  const scrollContainerRef = useRef<HTMLDivElement | null>(null)
+  const rulerScrollRef = useRef<HTMLDivElement | null>(null)
+  const [scrollTop, setScrollTop] = useState(0)
+  const [scrollLeft, setScrollLeft] = useState(0)
+  const [viewportWidth, setViewportWidth] = useState(800)
+  const [viewportHeight, setViewportHeight] = useState(400)
+  const dragRef = useRef<{
+    id: string
+    mode: 'move' | 'resize-start' | 'resize-end'
+    originX: number
+    originY: number
+    startSeconds: number
+    durationSeconds: number
+    midi: number
+    lastMidi?: number // Track last midi for pitch change sound
+  } | null>(null)
+  // Selection drag state
+  const selectionDragRef = useRef<{
+    startX: number
+    startSeconds: number
+  } | null>(null)
+  // Store callbacks in refs to avoid stale closures in event handlers
+  const onPlayNoteRef = useRef(onPlayNote)
+  const onUpdateNoteRef = useRef(onUpdateNote)
+  useEffect(() => {
+    onPlayNoteRef.current = onPlayNote
+    onUpdateNoteRef.current = onUpdateNote
+  }, [onPlayNote, onUpdateNote])
+  // Conversion helpers
+  const beatToSeconds = useCallback((beat: number) => beat * (60 / tempo), [tempo])
+  const secondsToBeat = useCallback((seconds: number) => seconds / (60 / tempo), [tempo])
+  // Calculate dimensions
+  const totalRows = HIGH_NOTE - LOW_NOTE + 1
+  const contentHeight = totalRows * rowHeight
+  const [containerWidth, setContainerWidth] = useState(1200)
+  // Track container size
+  useEffect(() => {
+    const container = scrollContainerRef.current
+    if (!container) return
+    const observer = new ResizeObserver((entries) => {
+      for (const entry of entries) {
+        setContainerWidth(entry.contentRect.width)
+        setViewportWidth(entry.contentRect.width)
+        setViewportHeight(entry.contentRect.height)
+      }
+    })
+    observer.observe(container)
+    return () => observer.disconnect()
+  }, [])
+  const maxSeconds = useMemo(() => {
+    const noteEndSeconds = notes.reduce((acc, n) => {
+      const endBeat = n.start + n.duration
+      return Math.max(acc, beatToSeconds(endBeat))
+    }, 8)
+    // Ensure grid extends at least 2x the visible area for smoother scrolling
+    const minSecondsForView = (containerWidth / gridSecondWidth) * 2
+    return Math.max(noteEndSeconds + 10, audioDuration + 10, minSecondsForView, 30)
+  }, [notes, audioDuration, beatToSeconds, containerWidth, gridSecondWidth])
+  const contentWidth = maxSeconds * gridSecondWidth
+  // Drag handlers - use refs to avoid stale closure issues
+  const handlePointerMove = useCallback((event: PointerEvent) => {
+    const drag = dragRef.current
+    if (!drag) return
+    const dxSeconds = (event.clientX - drag.originX) / gridSecondWidth
+    const dy = (event.clientY - drag.originY) / rowHeight
+    if (drag.mode === 'move') {
+      const nextSeconds = snapSeconds(drag.startSeconds + dxSeconds, gridSecondWidth)
+      const nextMidi = Math.min(HIGH_NOTE, Math.max(LOW_NOTE, Math.round(drag.midi - dy)))
+      // Play sound when pitch changes
+      if (nextMidi !== drag.lastMidi && onPlayNoteRef.current) {
+        onPlayNoteRef.current(nextMidi)
+        drag.lastMidi = nextMidi
+      }
+      onUpdateNoteRef.current(drag.id, {
+        start: secondsToBeat(nextSeconds),
+        midi: nextMidi
+      })
+    }
+    if (drag.mode === 'resize-start') {
+      const nextSeconds = snapSeconds(drag.startSeconds + dxSeconds, gridSecondWidth)
+      const delta = drag.startSeconds - nextSeconds
+      const nextDurationSeconds = Math.max(0.05, drag.durationSeconds + delta)
+      onUpdateNoteRef.current(drag.id, {
+        start: secondsToBeat(nextSeconds),
+        duration: secondsToBeat(nextDurationSeconds)
+      })
+    }
+    if (drag.mode === 'resize-end') {
+      const nextDurationSeconds = Math.max(0.05, snapSeconds(drag.durationSeconds + dxSeconds, gridSecondWidth))
+      onUpdateNoteRef.current(drag.id, { duration: secondsToBeat(nextDurationSeconds) })
+    }
+  }, [gridSecondWidth, rowHeight, secondsToBeat])
+  const handlePointerUp = useCallback(() => {
+    dragRef.current = null
+    window.removeEventListener('pointermove', handlePointerMove)
+    window.removeEventListener('pointerup', handlePointerUp)
+  }, [handlePointerMove])
+  useEffect(() => {
+    return () => {
+      window.removeEventListener('pointermove', handlePointerMove)
+      window.removeEventListener('pointerup', handlePointerUp)
+    }
+  }, [handlePointerMove, handlePointerUp])
+  // Scroll sync
+  useEffect(() => {
+    const container = scrollContainerRef.current
+    const ruler = rulerScrollRef.current
+    if (!container || !ruler) return
+    const handleScroll = () => {
+      ruler.scrollLeft = container.scrollLeft
+      setScrollTop(container.scrollTop)
+      setScrollLeft(container.scrollLeft)
+      if (onScroll) onScroll(container.scrollLeft)
+    }
+    container.addEventListener('scroll', handleScroll)
+    return () => container.removeEventListener('scroll', handleScroll)
+  }, [onScroll])
+  // Zoom support via wheel/trackpad
+  // Mac: Cmd+滚轮 (水平缩放), Cmd+Shift+滚轮 (垂直缩放), 或双指捏合
+  // Windows/Linux: Ctrl+滚轮 (水平缩放), Ctrl+Shift+滚轮 (垂直缩放)
+  useEffect(() => {
+    const container = scrollContainerRef.current
+    if (!container || !onZoom) return
+    const handleWheel = (e: WheelEvent) => {
+      // Ctrl (Windows/Linux/捏合) or Cmd (Mac) triggers zoom
+      const isZoomTrigger = e.ctrlKey || e.metaKey
+      if (isZoomTrigger) {
+        e.preventDefault()
+        e.stopPropagation()
+        // Use deltaY for zoom amount, normalize for different input methods
+        // Pinch gestures typically have smaller delta values
+        let delta = -e.deltaY
+        if (Math.abs(delta) > 10) {
+          // Likely a mouse wheel, scale down
+          delta = delta * 0.01
+        } else {
+          // Likely a trackpad pinch, scale appropriately
+          delta = delta * 0.05
+        }
+        // Shift or Alt/Option for vertical zoom, otherwise horizontal
+        if (e.shiftKey || e.altKey) {
+          onZoom(0, delta)
+        } else {
+          onZoom(delta, 0)
+        }
+      }
+    }
+    container.addEventListener('wheel', handleWheel, { passive: false })
+    return () => container.removeEventListener('wheel', handleWheel)
+  }, [onZoom])
+  // Playhead auto-scroll
+  useEffect(() => {
+    if (!scrollContainerRef.current) return
+    const container = scrollContainerRef.current
+    const playheadX = beatToSeconds(playhead) * gridSecondWidth
+    const viewStart = container.scrollLeft
+    const viewEnd = viewStart + container.clientWidth
+    if (playheadX > viewEnd) {
+      container.scrollLeft = playheadX
+    } else if (playheadX < viewStart) {
+      container.scrollLeft = playheadX
+    }
+  }, [playhead, gridSecondWidth, beatToSeconds])
+  // Selection auto-scroll
+  useEffect(() => {
+    if (!scrollContainerRef.current || !selectedId) return
+    const note = notes.find((n) => n.id === selectedId)
+    if (!note) return
+    const container = scrollContainerRef.current
+    const noteX = beatToSeconds(note.start) * gridSecondWidth
+    const noteY = (HIGH_NOTE - note.midi) * rowHeight
+    const viewStart = container.scrollLeft
+    const viewEnd = viewStart + container.clientWidth
+    if (noteX < viewStart + 50 || noteX > viewEnd - 50) {
+      container.scrollLeft = Math.max(0, noteX - container.clientWidth * 0.35)
+    }
+    const viewTop = container.scrollTop
+    const viewBottom = viewTop + container.clientHeight
+    if (noteY < viewTop || noteY > viewBottom - rowHeight) {
+      container.scrollTop = Math.max(0, noteY - container.clientHeight * 0.4)
+    }
+  }, [selectedId, notes, gridSecondWidth, rowHeight, beatToSeconds])
+  const handleGridDoubleClick = (event: React.MouseEvent<HTMLDivElement>) => {
+    // Only add note if clicking on empty space (not on a note)
+    const target = event.target as HTMLElement
+    if (target.closest('.note-chip')) return
+    if (!scrollContainerRef.current) return
+    const container = scrollContainerRef.current
+    const rect = container.getBoundingClientRect()
+    const x = event.clientX - rect.left + container.scrollLeft
+    const y = event.clientY - rect.top + container.scrollTop
+    const seconds = snapSeconds(x / gridSecondWidth, gridSecondWidth)
+    const pitch = Math.min(HIGH_NOTE, Math.max(LOW_NOTE, HIGH_NOTE - Math.floor(y / rowHeight)))
+    const created = onAddNote({
+      start: secondsToBeat(seconds),
+      midi: pitch,
+      duration: secondsToBeat(0.5),
+      lyric: ''
+    })
+    onSelect(created.id)
+  }
+  const startDrag = (
+    event: React.PointerEvent<HTMLDivElement>,
+    note: NoteEvent,
+    mode: 'move' | 'resize-start' | 'resize-end',
+  ) => {
+    event.preventDefault()
+    event.stopPropagation()
+    dragRef.current = {
+      id: note.id,
+      mode,
+      originX: event.clientX,
+      originY: event.clientY,
+      startSeconds: beatToSeconds(note.start),
+      durationSeconds: beatToSeconds(note.duration),
+      midi: note.midi,
+      lastMidi: note.midi, // Initialize last midi
+    }
+    window.addEventListener('pointermove', handlePointerMove)
+    window.addEventListener('pointerup', handlePointerUp)
+    onSelect(note.id)
+    // Play sound when clicking/selecting note
+    if (onPlayNote) {
+      onPlayNote(note.midi)
+    }
+  }
+  // Second-based ruler labels
+  const secondLabels = useMemo(() => {
+    const labels = [] as Array<{ left: number; label: string }>
+    const totalSeconds = Math.ceil(maxSeconds)
+    for (let s = 0; s <= totalSeconds; s += 1) {
+      labels.push({ left: s * gridSecondWidth, label: `${s}s` })
+    }
+    return labels
+  }, [maxSeconds, gridSecondWidth])
+  // Piano keys
+  const pitchRows = useMemo(() => {
+    const rows = [] as Array<{ midi: number; isBlack: boolean; label: string; isC: boolean }>
+    const black = new Set([1, 3, 6, 8, 10])
+    for (let p = HIGH_NOTE; p >= LOW_NOTE; p -= 1) {
+      const name = midiToName(p)
+      const isC = p % 12 === 0
+      rows.push({ midi: p, isBlack: black.has(p % 12), label: name, isC })
+    }
+    return rows
+  }, [])
+  // Detect overlapping notes using optimized sweep line algorithm
+  const overlappingNoteIds = useMemo(() => {
+    if (notes.length < 2) return new Set<string>()
+    const overlapping = new Set<string>()
+    const sortedNotes = [...notes].sort((a, b) => a.start - b.start)
+    const EPSILON = 0.05 // Tolerance for floating point comparison
+    // Use a sliding window approach - more efficient for typical music data
+    // Active notes: notes that haven't ended yet
+    const activeNotes: typeof sortedNotes = []
+    for (const note of sortedNotes) {
+      // Remove notes that have ended before current note starts
+      while (activeNotes.length > 0) {
+        const firstActive = activeNotes[0]
+        const firstActiveEnd = firstActive.start + firstActive.duration
+        if (firstActiveEnd <= note.start + EPSILON) {
+          activeNotes.shift()
+        } else {
+          break
+        }
+      }
+      // Check overlap with remaining active notes
+      for (const activeNote of activeNotes) {
+        const activeEnd = activeNote.start + activeNote.duration
+        if (note.start < activeEnd - EPSILON) {
+          overlapping.add(activeNote.id)
+          overlapping.add(note.id)
+        }
+      }
+      // Add current note to active set (maintain sorted order by end time)
+      const noteEnd = note.start + note.duration
+      let insertIndex = activeNotes.length
+      for (let i = 0; i < activeNotes.length; i++) {
+        const aEnd = activeNotes[i].start + activeNotes[i].duration
+        if (noteEnd < aEnd) {
+          insertIndex = i
+          break
+        }
+      }
+      activeNotes.splice(insertIndex, 0, note)
+    }
+    return overlapping
+  }, [notes])
+  // Calculate visible area with buffer for smooth scrolling
+  const BUFFER_PX = 200 // Render notes slightly outside viewport for smooth scrolling
+  const visibleArea = useMemo(() => {
+    return {
+      left: Math.max(0, scrollLeft - BUFFER_PX),
+      right: scrollLeft + viewportWidth + BUFFER_PX,
+      top: Math.max(0, scrollTop - BUFFER_PX),
+      bottom: scrollTop + viewportHeight + BUFFER_PX,
+    }
+  }, [scrollLeft, scrollTop, viewportWidth, viewportHeight])
+  // Filter notes to only render visible ones (virtualization)
+  const visibleNotes = useMemo(() => {
+    return notes.filter(note => {
+      const noteSeconds = beatToSeconds(note.start)
+      const noteDurationSeconds = beatToSeconds(note.duration)
+      const noteLeft = noteSeconds * gridSecondWidth
+      const noteRight = noteLeft + noteDurationSeconds * gridSecondWidth
+      const noteTop = (HIGH_NOTE - note.midi) * rowHeight
+      const noteBottom = noteTop + rowHeight
+      // Check if note intersects with visible area
+      const horizontallyVisible = noteRight >= visibleArea.left && noteLeft <= visibleArea.right
+      const verticallyVisible = noteBottom >= visibleArea.top && noteTop <= visibleArea.bottom
+      return horizontallyVisible && verticallyVisible
+    })
+  }, [notes, visibleArea, gridSecondWidth, rowHeight, beatToSeconds])
+  // Calculate visible grid lines (virtualization)
+  const visibleGridLines = useMemo(() => {
+    const startSecond = Math.max(0, Math.floor(visibleArea.left / gridSecondWidth) - 1)
+    const endSecond = Math.ceil(visibleArea.right / gridSecondWidth) + 1
+    const startRow = Math.max(0, Math.floor(visibleArea.top / rowHeight) - 1)
+    const endRow = Math.min(totalRows, Math.ceil(visibleArea.bottom / rowHeight) + 1)
+    return {
+      horizontalLines: Array.from({ length: endRow - startRow + 1 }, (_, i) => startRow + i),
+      verticalLines: Array.from({ length: endSecond - startSecond + 1 }, (_, i) => startSecond + i),
+    }
+  }, [visibleArea, gridSecondWidth, rowHeight, totalRows])
+  const playheadSeconds = beatToSeconds(playhead)
+  // Selection drag handlers
+  const handleRulerPointerDown = (event: React.PointerEvent<HTMLDivElement>) => {
+    if (!isSelectingRange) {
+      // Normal click to seek
+      const rect = event.currentTarget.getBoundingClientRect()
+      const x = event.clientX - rect.left + (rulerScrollRef.current?.scrollLeft ?? 0)
+      const seconds = x / gridSecondWidth
+      onSeek(secondsToBeat(seconds))
+      return
+    }
+    // Start selection drag
+    const rect = event.currentTarget.getBoundingClientRect()
+    const x = event.clientX - rect.left + (rulerScrollRef.current?.scrollLeft ?? 0)
+    const seconds = Math.max(0, x / gridSecondWidth)
+    selectionDragRef.current = {
+      startX: event.clientX,
+      startSeconds: seconds,
+    }
+    onSelectionChange?.(seconds, seconds)
+    const handleSelectionMove = (e: PointerEvent) => {
+      if (!selectionDragRef.current) return
+      const currentX = e.clientX - rect.left + (rulerScrollRef.current?.scrollLeft ?? 0)
+      const currentSeconds = Math.max(0, currentX / gridSecondWidth)
+      const start = Math.min(selectionDragRef.current.startSeconds, currentSeconds)
+      const end = Math.max(selectionDragRef.current.startSeconds, currentSeconds)
+      onSelectionChange?.(start, end)
+    }
+    const handleSelectionUp = () => {
+      selectionDragRef.current = null
+      window.removeEventListener('pointermove', handleSelectionMove)
+      window.removeEventListener('pointerup', handleSelectionUp)
+    }
+    window.addEventListener('pointermove', handleSelectionMove)
+    window.addEventListener('pointerup', handleSelectionUp)
+  }
+  return (
+    <div className="piano-shell">
+      {/* Ruler */}
+      <div className="ruler-shell">
+        <div className="ruler-spacer" style={{ width: PITCH_WIDTH, flexShrink: 0 }} />
+        <div
+          ref={rulerScrollRef}
+          className={`ruler-scroll ${isSelectingRange ? 'selecting' : ''}`}
+          onPointerDown={handleRulerPointerDown}
+        >
+          <div className="ruler" style={{ width: contentWidth }}>
+            {secondLabels.map((mark) => (
+              <div key={mark.left} className="measure-mark" style={{ left: mark.left }}>
+                <span>{mark.label}</span>
+              </div>
+            ))}
+            {/* Selection range indicator */}
+            {selectionStart !== null && selectionEnd !== null && selectionEnd > selectionStart && (
+              <div
+                className="selection-range"
+                style={{
+                  left: selectionStart * gridSecondWidth,
+                  width: (selectionEnd - selectionStart) * gridSecondWidth
+                }}
+              />
+            )}
+            {/* Ruler playhead indicator */}
+            <div
+              className="ruler-playhead"
+              style={{ left: playheadSeconds * gridSecondWidth }}
+            />
+          </div>
+        </div>
+      </div>
+      {/* Main content area */}
+      <div className="roll-body">
+        {/* Piano keys - synced with vertical scroll */}
+        <div className="pitch-rail" style={{ width: PITCH_WIDTH }}>
+          <div
+            className="pitch-rail-inner"
+            style={{
+              transform: `translateY(${-scrollTop}px)`,
+              height: contentHeight
+            }}
+          >
+            {pitchRows.map((pitch) => (
+              <div
+                key={pitch.midi}
+                className={`pitch-cell ${pitch.isBlack ? 'pitch-black' : 'pitch-white'} ${pitch.isC ? 'pitch-c' : ''}`}
+                style={{ height: rowHeight, cursor: 'pointer' }}
+                onClick={() => onPlayNote?.(pitch.midi)}
+                onMouseDown={(e) => e.preventDefault()}
+              >
+                <span className="pitch-label">{pitch.label}</span>
+              </div>
+            ))}
+          </div>
+        </div>
+        {/* Scrollable grid area */}
+        <div
+          ref={scrollContainerRef}
+          className="roll-grid"
+          onDoubleClick={handleGridDoubleClick}
+        >
+          <div
+            className="grid-content"
+            style={{
+              width: contentWidth,
+              height: contentHeight,
+              position: 'relative'
+            }}
+          >
+            {/* SVG Grid - virtualized for performance */}
+            <svg
+              className="grid-svg"
+              width={contentWidth}
+              height={contentHeight}
+              style={{ position: 'absolute', top: 0, left: 0, pointerEvents: 'none' }}
+            >
+              {/* Horizontal lines (pitch rows) - only visible ones */}
+              {visibleGridLines.horizontalLines.map(i => (
+                <line
+                  key={`h-${i}`}
+                  x1={visibleArea.left}
+                  y1={i * rowHeight}
+                  x2={visibleArea.right}
+                  y2={i * rowHeight}
+                  stroke="var(--grid-line-minor)"
+                  strokeWidth={1}
+                />
+              ))}
+              {/* Vertical lines (seconds) - only visible ones */}
+              {visibleGridLines.verticalLines.map(i => (
+                <line
+                  key={`v-${i}`}
+                  x1={i * gridSecondWidth}
+                  y1={visibleArea.top}
+                  x2={i * gridSecondWidth}
+                  y2={visibleArea.bottom}
+                  stroke="var(--grid-line-minor)"
+                  strokeWidth={1}
+                />
+              ))}
+            </svg>
+            {/* Selection range in grid */}
+            {selectionStart !== null && selectionEnd !== null && selectionEnd > selectionStart && (
+              <div
+                className="grid-selection-range"
+                style={{
+                  left: selectionStart * gridSecondWidth,
+                  width: (selectionEnd - selectionStart) * gridSecondWidth,
+                  height: contentHeight
+                }}
+              />
+            )}
+            {/* Playhead */}
+            <div
+              className="playhead"
+              style={{
+                left: playheadSeconds * gridSecondWidth,
+                height: contentHeight
+              }}
+            />
+            {/* Notes - virtualized: only render visible notes */}
+            {visibleNotes.map((note) => {
+              const noteSeconds = beatToSeconds(note.start)
+              const noteDurationSeconds = beatToSeconds(note.duration)
+              const left = noteSeconds * gridSecondWidth
+              const top = (HIGH_NOTE - note.midi) * rowHeight
+              const noteWidthPx = Math.max(noteDurationSeconds * gridSecondWidth, 4)
+              const noteHeight = rowHeight - 2
+              const isOverlapping = overlappingNoteIds.has(note.id)
+              // Dynamic font size based on row height (base: 12px at 20px row height)
+              const fontSize = Math.max(10, Math.min(24, rowHeight * 0.6))
+              return (
+                <NoteChip
+                  key={note.id}
+                  note={note}
+                  left={left}
+                  top={top}
+                  width={noteWidthPx}
+                  height={noteHeight}
+                  fontSize={fontSize}
+                  isSelected={selectedId === note.id}
+                  isOverlapping={isOverlapping}
+                  onPointerDown={(event, mode) => startDrag(event, note, mode)}
+                  onDoubleClick={(event) => {
+                    event.stopPropagation()
+                    onFocusLyric?.(note.id)
+                  }}
+                />
+              )
+            })}
+          </div>
+        </div>
+      </div>
+    </div>
+  )
+}

preprocess/tools/midi_editor/src/constants.ts ADDED Viewed

	@@ -0,0 +1,8 @@

+// Base values used for scaling; actual runtime values are derived in components
+export const BASE_GRID_SECOND_WIDTH = 80
+export const BASE_ROW_HEIGHT = 20
+export const PITCH_WIDTH = 60
+// C-1 to C8 range (MIDI note numbers)
+// LOW_NOTE = 0 to support SP markers (pitch=0) in some MIDI files
+export const LOW_NOTE = 0   // C-1 (also supports pitch=0 for SP markers)
+export const HIGH_NOTE = 108 // C8

preprocess/tools/midi_editor/src/i18n.ts ADDED Viewed

	@@ -0,0 +1,196 @@

+export type Lang = 'zh' | 'en'
+const zh = {
+  // Header
+  eyebrow: '歌声 MIDI 编辑器',
+  title: 'SoulX-Singer MIDI Editor',
+  subtitle: '导入、拖拽、实时修改歌词并导出标准 MIDI。',
+  switchToLight: '切换到亮色',
+  switchToDark: '切换到暗色',
+  importJson: '导入 JSON',
+  exportJson: '导出 JSON',
+  importMidi: '导入 MIDI',
+  exportMidi: '导出 MIDI',
+  transpose: '移调',
+  transposeTooltip: '整体升降调：所有音符的音高同步改变',
+  transposed: (n: number) => `已移调 ${n > 0 ? '+' : ''}${n} 半音`,
+  fixOverlaps: '消除重叠',
+  fixOverlapsTooltip: '自动消除重叠：将重叠音符的音尾提前到下一个音的音头',
+  jsonImported: (name: string) => `已从 JSON 载入 ${name}`,
+  jsonImportFailed: 'JSON 导入失败，请确认文件格式正确',
+  jsonExported: '已导出 META JSON 文件',
+  // Audio bar
+  importAudio: '对齐音频导入',
+  audioHint: '导入后显示音频波形并与 MIDI 同步走带',
+  midiLabel: 'MIDI',
+  audioLabel: '音频',
+  // Controls
+  horizontalZoom: '水平缩放',
+  verticalZoom: '垂直缩放',
+  goToStart: '回到开头',
+  back2s: '后退 2 秒',
+  pause: '暂停',
+  playSelection: '播放选区',
+  play: '播放',
+  forward2s: '前进 2 秒',
+  goToEnd: '回到结尾',
+  selectingRange: '选区中',
+  setRange: '设选区',
+  exitSelectMode: '退出选区模式（并清除选区）',
+  setRangeTooltip: '设置选区：在时间轴上拖拽选择播放范围',
+  // Status
+  ready: '准备就绪',
+  selectionPlayback: '选区回放中...',
+  playing: '正在回放...',
+  selectionDone: '选区播放完毕',
+  paused: '已暂停',
+  imported: (name: string) => `已载入 ${name}`,
+  importFailed: '导入失败，请确认文件合法',
+  audioImported: (name: string) => `已载入音频 ${name}`,
+  unsupportedFormat: (exts: string) => `不支持的文件格式，请选择音频文件（${exts}）`,
+  fixedOverlaps: (count: number) => `已修复 ${count} 个重叠音符`,
+  noOverlaps: '没有检测到重叠音符',
+  exported: '已导出包含歌词的 MIDI 文件',
+  // Lyric table
+  fillPlaceholderSelected: '从选中音符开始按词/字填充',
+  fillPlaceholderDefault: '输入歌词，点击按词/字填充',
+  fillButton: '按词\n填充',
+  lyricPlaceholder: '输入歌词',
+  emptyHint: '导入或双击钢琴卷帘以添加音符',
+  confirmEdit: '确认修改 (Enter)',
+}
+const en: typeof zh = {
+  // Header
+  eyebrow: 'Vocal MIDI Editor',
+  title: 'SoulX-Singer MIDI Editor',
+  subtitle: 'Import, drag, edit lyrics in real-time, and export standard MIDI.',
+  switchToLight: 'Switch to light',
+  switchToDark: 'Switch to dark',
+  importJson: 'Import JSON',
+  exportJson: 'Export JSON',
+  importMidi: 'Import MIDI',
+  exportMidi: 'Export MIDI',
+  transpose: 'Transpose',
+  transposeTooltip: 'Transpose all notes up or down by semitones',
+  transposed: (n: number) => `Transposed ${n > 0 ? '+' : ''}${n} semitone(s)`,
+  fixOverlaps: 'Fix Overlaps',
+  fixOverlapsTooltip: 'Auto fix overlaps: trim note end to the start of the next note',
+  jsonImported: (name: string) => `Loaded from JSON ${name}`,
+  jsonImportFailed: 'JSON import failed, please check the file format',
+  jsonExported: 'Exported META JSON file',
+  // Audio bar
+  importAudio: 'Import Audio',
+  audioHint: 'Display audio waveform synced with MIDI transport',
+  midiLabel: 'MIDI',
+  audioLabel: 'Audio',
+  // Controls
+  horizontalZoom: 'H-Zoom',
+  verticalZoom: 'V-Zoom',
+  goToStart: 'Go to start',
+  back2s: 'Back 2s',
+  pause: 'Pause',
+  playSelection: 'Play selection',
+  play: 'Play',
+  forward2s: 'Forward 2s',
+  goToEnd: 'Go to end',
+  selectingRange: 'Selecting',
+  setRange: 'Select',
+  exitSelectMode: 'Exit selection mode (and clear selection)',
+  setRangeTooltip: 'Set selection: drag on the timeline to select playback range',
+  // Status
+  ready: 'Ready',
+  selectionPlayback: 'Playing selection...',
+  playing: 'Playing...',
+  selectionDone: 'Selection playback done',
+  paused: 'Paused',
+  imported: (name: string) => `Loaded ${name}`,
+  importFailed: 'Import failed, please check the file',
+  audioImported: (name: string) => `Loaded audio ${name}`,
+  unsupportedFormat: (exts: string) => `Unsupported format, please select an audio file (${exts})`,
+  fixedOverlaps: (count: number) => `Fixed ${count} overlapping note(s)`,
+  noOverlaps: 'No overlapping notes detected',
+  exported: 'Exported MIDI file with lyrics',
+  // Lyric table
+  fillPlaceholderSelected: 'Fill words from selected note',
+  fillPlaceholderDefault: 'Enter lyrics, click fill button',
+  fillButton: 'Fill\nWords',
+  lyricPlaceholder: 'Type lyric',
+  emptyHint: 'Import or double-click piano roll to add notes',
+  confirmEdit: 'Confirm (Enter)',
+}
+const translations: Record<Lang, typeof zh> = { zh, en }
+export type Translations = typeof zh
+export function getTranslations(lang: Lang): Translations {
+  return translations[lang]
+}
+// Smart tokenizer for lyrics: CJK characters are individual tokens, Latin words are grouped
+function isCJK(char: string): boolean {
+  const code = char.codePointAt(0) || 0
+  return (
+    (code >= 0x4E00 && code <= 0x9FFF) ||   // CJK Unified Ideographs
+    (code >= 0x3400 && code <= 0x4DBF) ||   // CJK Extension A
+    (code >= 0x20000 && code <= 0x2A6DF) || // CJK Extension B
+    (code >= 0x3040 && code <= 0x309F) ||   // Hiragana
+    (code >= 0x30A0 && code <= 0x30FF) ||   // Katakana
+    (code >= 0xAC00 && code <= 0xD7AF)      // Hangul Syllables
+  )
+}
+/**
+ * Tokenize lyrics text for note filling.
+ * - CJK characters: each character becomes one token (one per note)
+ * - Latin/English words: each space-separated word becomes one token (one per note)
+ * - Mixed text is handled correctly
+ *
+ * Examples:
+ *   "你好世界"     -> ["你", "好", "世", "界"]
+ *   "hello world"  -> ["hello", "world"]
+ *   "I love 你"    -> ["I", "love", "你"]
+ *   "something wrong" -> ["something", "wrong"]
+ */
+export function tokenizeLyrics(text: string): string[] {
+  const tokens: string[] = []
+  const cleaned = text.trim()
+  if (!cleaned) return tokens
+  let i = 0
+  while (i < cleaned.length) {
+    const char = cleaned[i]
+    // Skip whitespace
+    if (/\s/.test(char)) {
+      i++
+      continue
+    }
+    // CJK character - each is a separate token
+    if (isCJK(char)) {
+      tokens.push(char)
+      i++
+      continue
+    }
+    // Latin/number/other - collect until whitespace or CJK
+    let word = ''
+    while (i < cleaned.length && !/\s/.test(cleaned[i]) && !isCJK(cleaned[i])) {
+      word += cleaned[i]
+      i++
+    }
+    if (word) tokens.push(word)
+  }
+  return tokens
+}

preprocess/tools/midi_editor/src/index.css ADDED Viewed

	@@ -0,0 +1,37 @@

+@tailwind base;
+@tailwind components;
+@tailwind utilities;
+:root {
+  font-family: 'Space Grotesk', 'IBM Plex Sans', system-ui, sans-serif;
+  color: var(--text-primary);
+  background: radial-gradient(circle at 20% 20%, rgba(72, 228, 194, 0.08), transparent 35%),
+    radial-gradient(circle at 80% 0%, rgba(75, 100, 188, 0.24), transparent 40%),
+    #0f1528;
+  text-rendering: optimizeLegibility;
+  -webkit-font-smoothing: antialiased;
+}
+:root[data-theme='light'] {
+  background: radial-gradient(circle at 20% 20%, rgba(63, 140, 255, 0.08), transparent 35%),
+    radial-gradient(circle at 80% 0%, rgba(75, 100, 188, 0.14), transparent 40%),
+    #f5f7fb;
+}
+* {
+  box-sizing: border-box;
+}
+body {
+  margin: 0;
+  min-height: 100vh;
+  background: transparent;
+}
+#root {
+  min-height: 100vh;
+}
+a {
+  color: inherit;
+}

preprocess/tools/midi_editor/src/lib/midi.ts ADDED Viewed

	@@ -0,0 +1,224 @@

+import { Midi } from '@tonejs/midi'
+import { writeMidi } from 'midi-file'
+import type { MidiData, MidiEvent } from 'midi-file'
+import type { NoteEvent, ProjectSnapshot, TimeSignature } from '../types'
+const DEFAULT_SIGNATURE: TimeSignature = [4, 4]
+// Decode UTF-8 byte string (latin1 encoded) to proper Unicode string
+// This matches: text.encode("latin1").decode("utf-8") in Python
+function decodeUtf8ByteString(byteString: string): string {
+  try {
+    const bytes = new Uint8Array(byteString.length)
+    for (let i = 0; i < byteString.length; i++) {
+      bytes[i] = byteString.charCodeAt(i)
+    }
+    return new TextDecoder('utf-8').decode(bytes)
+  } catch {
+    return byteString
+  }
+}
+// Encode Unicode string to UTF-8 byte string (latin1 encoding)
+// This matches: text.encode("utf-8").decode("latin1") in Python
+function encodeUtf8ByteString(text: string): string {
+  const bytes = new TextEncoder().encode(text)
+  let output = ''
+  bytes.forEach((b) => {
+    output += String.fromCharCode(b)
+  })
+  return output
+}
+export async function importMidiFile(file: File): Promise<ProjectSnapshot> {
+  const buffer = await file.arrayBuffer()
+  return parseMidiBuffer(buffer)
+}
+export async function parseMidiBuffer(buffer: ArrayBuffer): Promise<ProjectSnapshot> {
+  const midi = new Midi(buffer)
+  const tempo = midi.header.tempos[0]?.bpm ?? 120
+  const timeSignature = (midi.header.timeSignatures[0]?.timeSignature as TimeSignature | undefined) ?? DEFAULT_SIGNATURE
+  // Merge notes from all tracks and sort by ticks then by midi (for stable ordering)
+  const allNotes = midi.tracks
+    .flatMap(t => t.notes)
+    .sort((a, b) => a.ticks - b.ticks || a.midi - b.midi)
+  // Get lyrics from header.meta and sort by ticks
+  const lyricEvents = midi.header.meta
+    .filter((event) => event.type === 'lyrics')
+    .sort((a, b) => a.ticks - b.ticks)
+  // Match lyrics to notes by tick position
+  // Each lyric should be consumed by exactly one note at the same tick
+  const lyricsByTick = new Map<number, string[]>()
+  for (const event of lyricEvents) {
+    const existing = lyricsByTick.get(event.ticks) || []
+    existing.push(decodeUtf8ByteString(event.text))
+    lyricsByTick.set(event.ticks, existing)
+  }
+  // Track which lyrics have been used at each tick position
+  const usedLyricIndices = new Map<number, number>()
+  const notes: NoteEvent[] = allNotes.map((note, index) => {
+    const beat = note.ticks / midi.header.ppq
+    const durationBeats = note.durationTicks / midi.header.ppq
+    let lyric = ''
+    // First try exact tick match
+    const lyricsAtTick = lyricsByTick.get(note.ticks)
+    if (lyricsAtTick && lyricsAtTick.length > 0) {
+      const usedIndex = usedLyricIndices.get(note.ticks) || 0
+      if (usedIndex < lyricsAtTick.length) {
+        lyric = lyricsAtTick[usedIndex]
+        usedLyricIndices.set(note.ticks, usedIndex + 1)
+      }
+    }
+    // If no exact match, try nearby ticks (within small tolerance)
+    if (!lyric) {
+      const tolerance = midi.header.ppq / 100 // Very small tolerance
+      for (const [tick, lyrics] of lyricsByTick.entries()) {
+        if (Math.abs(tick - note.ticks) <= tolerance) {
+          const usedIndex = usedLyricIndices.get(tick) || 0
+          if (usedIndex < lyrics.length) {
+            lyric = lyrics[usedIndex]
+            usedLyricIndices.set(tick, usedIndex + 1)
+            break
+          }
+        }
+      }
+    }
+    return {
+      id: `${index}-${note.midi}-${Math.round(note.ticks)}`,
+      midi: note.midi,
+      start: beat,
+      duration: Math.max(durationBeats, 0.0625),
+      velocity: note.velocity,
+      lyric,
+    }
+  })
+  return { tempo, timeSignature, notes, ppq: midi.header.ppq }
+}
+// Used to add absoluteTime property for sorting
+type WithAbsoluteTime<T> = T & { absoluteTime: number }
+export function exportMidi(snapshot: ProjectSnapshot): Blob {
+  const ppq = snapshot.ppq ?? 480  // Use original ppq if available, otherwise default to 480
+  const microsecondsPerBeat = Math.round(60000000 / snapshot.tempo)  // Convert BPM to microseconds per beat
+  // Sort notes by start time, then by midi for stable ordering
+  const sortedNotes = [...snapshot.notes].sort((a, b) => a.start - b.start || a.midi - b.midi)
+  // Build events for a single track containing both lyrics and notes
+  // Event order at same tick: note_off (0) < lyrics (1) < note_on (2)
+  // This matches meta.py's tg2midi implementation
+  const events: Array<WithAbsoluteTime<MidiEvent>> = []
+  // Add all note events and their corresponding lyrics
+  sortedNotes.forEach((note) => {
+    const startTicks = Math.round(note.start * ppq)
+    const endTicks = Math.round((note.start + note.duration) * ppq)
+    const velocity = Math.round(note.velocity * 127)
+    // Add lyric event at the same tick as note_on (but will be sorted before it)
+    const lyricText = note.lyric ?? ''
+    const encodedLyric = encodeUtf8ByteString(lyricText)
+    // Lyric event - sort key 1 (after note_off, before note_on)
+    events.push({
+      absoluteTime: startTicks,
+      deltaTime: 0,
+      meta: true,
+      type: 'lyrics',
+      text: encodedLyric,
+      _sortKey: 1,
+    } as WithAbsoluteTime<MidiEvent> & { _sortKey: number })
+    // Note on event - sort key 2 (after lyrics)
+    events.push({
+      absoluteTime: startTicks,
+      deltaTime: 0,
+      type: 'noteOn',
+      channel: 0,
+      noteNumber: note.midi,
+      velocity: velocity,
+      _sortKey: 2,
+    } as WithAbsoluteTime<MidiEvent> & { _sortKey: number })
+    // Note off event - sort key 0 (before everything at same tick)
+    events.push({
+      absoluteTime: endTicks,
+      deltaTime: 0,
+      type: 'noteOff',
+      channel: 0,
+      noteNumber: note.midi,
+      velocity: 0,
+      _sortKey: 0,
+    } as WithAbsoluteTime<MidiEvent> & { _sortKey: number })
+  })
+  // Sort events by absoluteTime, then by _sortKey
+  events.sort((a, b) => {
+    const aKey = (a as { _sortKey?: number })._sortKey ?? 1
+    const bKey = (b as { _sortKey?: number })._sortKey ?? 1
+    return a.absoluteTime - b.absoluteTime || aKey - bKey
+  })
+  // Convert absolute time to delta time
+  let lastTick = 0
+  events.forEach(event => {
+    event.deltaTime = event.absoluteTime - lastTick
+    lastTick = event.absoluteTime
+    delete (event as { absoluteTime?: number }).absoluteTime
+    delete (event as { _sortKey?: number })._sortKey
+  })
+  // Build the MIDI track with header events
+  const track: MidiEvent[] = [
+    // Set tempo
+    {
+      deltaTime: 0,
+      meta: true,
+      type: 'setTempo',
+      microsecondsPerBeat: microsecondsPerBeat,
+    },
+    // Time signature
+    {
+      deltaTime: 0,
+      meta: true,
+      type: 'timeSignature',
+      numerator: snapshot.timeSignature[0],
+      denominator: snapshot.timeSignature[1],
+      metronome: 24,
+      thirtyseconds: 8,
+    },
+    // All note and lyric events
+    ...events,
+    // End of track
+    {
+      deltaTime: 0,
+      meta: true,
+      type: 'endOfTrack',
+    },
+  ]
+  // Build MIDI data structure
+  const midiData: MidiData = {
+    header: {
+      format: 0,  // Single track format (type 0)
+      numTracks: 1,
+      ticksPerBeat: ppq,
+    },
+    tracks: [track],
+  }
+  const bytes = writeMidi(midiData)
+  return new Blob([new Uint8Array(bytes)], { type: 'audio/midi' })
+}

preprocess/tools/midi_editor/src/main.tsx ADDED Viewed

	@@ -0,0 +1,10 @@

+import { StrictMode } from 'react'
+import { createRoot } from 'react-dom/client'
+import './index.css'
+import App from './App.tsx'
+createRoot(document.getElementById('root')!).render(
+  <StrictMode>
+    <App />
+  </StrictMode>,
+)

preprocess/tools/midi_editor/src/store/useMidiStore.ts ADDED Viewed

	@@ -0,0 +1,78 @@

+import { nanoid } from 'nanoid'
+import { create } from 'zustand'
+import type { NoteEvent, TimeSignature } from '../types'
+const clamp = (value: number, min: number, max: number) =>
+  Math.min(Math.max(value, min), max)
+export type MidiStore = {
+  tempo: number
+  timeSignature: TimeSignature
+  notes: NoteEvent[]
+  selectedId: string | null
+  playhead: number
+  ppq: number | undefined  // Ticks per quarter note (for preserving original MIDI timing)
+  addNote: (partial?: Partial<NoteEvent>) => NoteEvent
+  updateNote: (id: string, partial: Partial<NoteEvent>) => void
+  removeNote: (id: string) => void
+  setNotes: (notes: NoteEvent[]) => void
+  setTempo: (tempo: number) => void
+  setTimeSignature: (sig: TimeSignature) => void
+  setPpq: (ppq: number | undefined) => void
+  select: (id: string | null) => void
+  setLyric: (id: string, lyric: string) => void
+  setPlayhead: (beat: number) => void
+  clear: () => void
+}
+const defaultNotes: NoteEvent[] = [
+  { id: nanoid(), midi: 64, start: 0, duration: 1.5, velocity: 0.9, lyric: 'la' },
+  { id: nanoid(), midi: 67, start: 1.5, duration: 1.5, velocity: 0.85, lyric: 'na' },
+  { id: nanoid(), midi: 69, start: 3, duration: 2, velocity: 0.8, lyric: 'ah' },
+]
+export const useMidiStore = create<MidiStore>((set) => ({
+  tempo: 110,
+  timeSignature: [4, 4],
+  notes: defaultNotes,
+  selectedId: null,
+  playhead: 0,
+  ppq: undefined,
+  addNote: (partial = {}) => {
+    const note: NoteEvent = {
+      id: nanoid(),
+      midi: partial.midi ?? 64,
+      start: partial.start ?? 0,
+      duration: partial.duration ?? 1,
+      velocity: clamp(partial.velocity ?? 0.85, 0, 1),
+      lyric: partial.lyric ?? '',
+    }
+    set((state) => ({ notes: [...state.notes, note] }))
+    return note
+  },
+  updateNote: (id, partial) => {
+    set((state) => ({
+      notes: state.notes.map((note) =>
+        note.id === id
+          ? {
+              ...note,
+              ...partial,
+              duration: Math.max(partial.duration ?? note.duration, 0.0625),
+            }
+          : note,
+      ),
+    }))
+  },
+  removeNote: (id) => set((state) => ({ notes: state.notes.filter((n) => n.id !== id) })),
+  setNotes: (notes) => set(() => ({ notes })),
+  setTempo: (tempo) => set(() => ({ tempo: clamp(tempo, 30, 240) })),
+  setTimeSignature: (sig) => set(() => ({ timeSignature: sig })),
+  setPpq: (ppq) => set(() => ({ ppq })),
+  select: (id) => set(() => ({ selectedId: id })),
+  setLyric: (id, lyric) =>
+    set((state) => ({
+      notes: state.notes.map((note) => (note.id === id ? { ...note, lyric } : note)),
+    })),
+  setPlayhead: (beat) => set(() => ({ playhead: Math.max(beat, 0) })),
+  clear: () => set(() => ({ notes: [], selectedId: null })),
+}))

preprocess/tools/midi_editor/src/types.ts ADDED Viewed

	@@ -0,0 +1,17 @@

+export type NoteEvent = {
+  id: string
+  midi: number
+  start: number // in beats
+  duration: number // in beats
+  velocity: number
+  lyric: string
+}
+export type TimeSignature = [number, number]
+export type ProjectSnapshot = {
+  tempo: number
+  timeSignature: TimeSignature
+  notes: NoteEvent[]
+  ppq?: number  // Ticks per quarter note (for preserving original MIDI timing)
+}

preprocess/tools/midi_editor/tailwind.config.js ADDED Viewed

	@@ -0,0 +1,33 @@

+/** @type {import('tailwindcss').Config} */
+export default {
+  content: ['./index.html', './src/**/*.{ts,tsx,js,jsx}'],
+  theme: {
+    extend: {
+      fontFamily: {
+        display: ['"Space Grotesk"', '"IBM Plex Sans"', 'system-ui', 'sans-serif'],
+        mono: ['"JetBrains Mono"', 'ui-monospace', 'SFMono-Regular', 'monospace'],
+      },
+      colors: {
+        ink: {
+          50: '#f4f7fb',
+          100: '#dfe7f5',
+          200: '#beceec',
+          300: '#95addf',
+          400: '#6a87ce',
+          500: '#4b64bc',
+          600: '#3b4ea7',
+          700: '#32418a',
+          800: '#2c376f',
+          900: '#262f5c',
+        },
+        ember: '#ff7043',
+        mint: '#48e4c2',
+      },
+      boxShadow: {
+        panel: '0 14px 35px rgba(0, 0, 0, 0.25)',
+      },
+    },
+  },
+  plugins: [],
+}

preprocess/tools/midi_editor/tsconfig.app.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "compilerOptions": {
+    "tsBuildInfoFile": "./node_modules/.tmp/tsconfig.app.tsbuildinfo",
+    "target": "ES2022",
+    "useDefineForClassFields": true,
+    "lib": ["ES2022", "DOM", "DOM.Iterable"],
+    "module": "ESNext",
+    "types": ["vite/client"],
+    "skipLibCheck": true,
+    /* Bundler mode */
+    "moduleResolution": "bundler",
+    "allowImportingTsExtensions": true,
+    "verbatimModuleSyntax": true,
+    "moduleDetection": "force",
+    "noEmit": true,
+    "jsx": "react-jsx",
+    /* Linting */
+    "strict": true,
+    "noUnusedLocals": true,
+    "noUnusedParameters": true,
+    "erasableSyntaxOnly": true,
+    "noFallthroughCasesInSwitch": true,
+    "noUncheckedSideEffectImports": true
+  },
+  "include": ["src"]
+}

preprocess/tools/midi_editor/tsconfig.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "files": [],
+  "references": [
+    { "path": "./tsconfig.app.json" },
+    { "path": "./tsconfig.node.json" }
+  ]
+}

preprocess/tools/midi_editor/tsconfig.node.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "compilerOptions": {
+    "tsBuildInfoFile": "./node_modules/.tmp/tsconfig.node.tsbuildinfo",
+    "target": "ES2023",
+    "lib": ["ES2023"],
+    "module": "ESNext",
+    "types": ["node"],
+    "skipLibCheck": true,
+    /* Bundler mode */
+    "moduleResolution": "bundler",
+    "allowImportingTsExtensions": true,
+    "verbatimModuleSyntax": true,
+    "moduleDetection": "force",
+    "noEmit": true,
+    /* Linting */
+    "strict": true,
+    "noUnusedLocals": true,
+    "noUnusedParameters": true,
+    "erasableSyntaxOnly": true,
+    "noFallthroughCasesInSwitch": true,
+    "noUncheckedSideEffectImports": true
+  },
+  "include": ["vite.config.ts"]
+}

preprocess/tools/midi_editor/vite.config.ts ADDED Viewed

	@@ -0,0 +1,7 @@

+import { defineConfig } from 'vite'
+import react from '@vitejs/plugin-react'
+// https://vite.dev/config/
+export default defineConfig({
+  plugins: [react()],
+})

preprocess/tools/midi_parser.py ADDED Viewed

	@@ -0,0 +1,598 @@

+"""
+SoulX-Singer MIDI <-> metadata converter.
+Converts between SoulX-Singer-style metadata JSON (with note_text, note_dur,
+note_pitch, note_type per segment) and standard MIDI files. Uses an internal
+Note dataclass (start_s, note_dur, note_text, note_pitch, note_type) as the
+intermediate representation.
+"""
+import os
+import json
+import shutil
+from dataclasses import dataclass
+from typing import Any, List, Tuple, Union
+import librosa
+import mido
+from soundfile import write
+from .f0_extraction import F0Extractor
+from .g2p import g2p_transform
+# Audio, MIDI and segmentation constants
+SAMPLE_RATE = 44100             # Audio sample rate for any wav cuts during midi2meta
+MIDI_TICKS_PER_BEAT = 500       # The number of MIDI ticks per beat; affects the time resolution of MIDI output and conversion accuracy.
+MIDI_TEMPO = 500000             # Microseconds per beat (120 BPM)
+MIDI_TIME_SIGNATURE = (4, 4)    # Default time signature; not critical for conversion but included in MIDI output.
+MIDI_VELOCITY = 64              # Default velocity for note_on events; not critical for conversion but required for MIDI format.
+END_EXTENSION_SEC = 0.4         # Extend each segment end by this much silence (sec) to give the model more context
+MAX_GAP_SEC = 2.0               # Gap threshold to split segments in midi2meta (sec)
+MAX_SEGMENT_DUR_SUM_SEC = 60.0  # Max total duration sum of notes in a single metadata segment before splitting into multiple segments (sec)
+SILENCE_THRESHOLD_SEC = 0.2     # Threshold to insert explicit <SP> note for long silences between notes in midi2notes (sec)
+@dataclass
+class Note:
+    """Single note: text, duration (seconds), pitch (MIDI), type. start_s is absolute start time in seconds (for ordering / MIDI)."""
+    start_s: float
+    note_dur: float
+    note_text: str
+    note_pitch: int
+    note_type: int
+    @property
+    def end_s(self) -> float:
+        return self.start_s + self.note_dur
+def _seconds_to_ticks(seconds: float, ticks_per_beat: int, tempo: int) -> int:
+    """Convert seconds to MIDI ticks based on tempo and ticks per beat."""
+    return int(round(seconds * ticks_per_beat * 1_000_000 / tempo))
+def _append_segment_to_meta(
+    meta_data: List[dict],
+    meta_path_str: str,
+    cut_wavs_output_dir: str | None,
+    vocal_file: str | None,
+    language: str,
+    audio_data: Any | None,
+    pitch_extractor: F0Extractor | None,
+    note_start: List[float],
+    note_end: List[float],
+    note_text: List[Any],
+    note_pitch: List[Any],
+    note_type: List[Any],
+    note_dur: List[float],
+) -> None:
+    """Helper function for midi2meta to append the current segment (accumulated in note_*) to meta_data list, with optional wav cut and pitch extraction."""
+    if not all((note_start, note_end, note_text, note_pitch, note_type, note_dur)):
+        return
+    base_name = os.path.splitext(os.path.basename(meta_path_str))[0]
+    item_name = f"{base_name}_{len(meta_data)}"
+    wav_fn = None
+    if cut_wavs_output_dir and vocal_file and audio_data is not None:
+        wav_fn = os.path.join(cut_wavs_output_dir, f"{item_name}.wav")
+        end_pad = int(END_EXTENSION_SEC * SAMPLE_RATE)
+        start_sample = max(0, int(note_start[0] * SAMPLE_RATE))
+        end_sample = min(len(audio_data), int(note_end[-1] * SAMPLE_RATE) + end_pad)
+        end_pad_dur = (end_sample / SAMPLE_RATE - note_end[-1]) if end_sample > int(note_end[-1] * SAMPLE_RATE) else 0.0
+        if end_pad_dur > 0:
+            note_dur = note_dur + [end_pad_dur]
+            note_text = note_text + ["<SP>"]
+            note_pitch = note_pitch + [0]
+            note_type = note_type + [1]
+        start_ms = int(start_sample / SAMPLE_RATE * 1000)
+        end_ms = int(end_sample / SAMPLE_RATE * 1000)
+        write(wav_fn, audio_data[start_sample:end_sample], SAMPLE_RATE)
+    else:
+        start_ms = int(note_start[0] * 1000)
+        end_ms = int(note_end[-1] * 1000)
+    if pitch_extractor is not None:
+        if not wav_fn or not os.path.isfile(wav_fn):
+            raise FileNotFoundError(f"Segment wav file not found: {wav_fn}")
+        f0 = pitch_extractor.process(wav_fn)
+    else:
+        f0 = []
+    note_text_list = list(note_text)
+    note_pitch_list = list(note_pitch)
+    note_type_list = list(note_type)
+    note_dur_list = list(note_dur)
+    meta_data.append(
+        {
+            "index": item_name,
+            "language": language,
+            "time": [start_ms, end_ms],
+            "duration": " ".join(str(round(x, 2)) for x in note_dur_list),
+            "text": " ".join(note_text_list),
+            "phoneme": " ".join(g2p_transform(note_text_list, language)),
+            "note_pitch": " ".join(str(x) for x in note_pitch_list),
+            "note_type": " ".join(str(x) for x in note_type_list),
+            "f0": " ".join(str(round(float(x), 1)) for x in f0),
+        }
+    )
+def meta2notes(meta_path: str) -> List[Note]:
+    """Parse SoulX-Singer metadata JSON into a flat list of Note (absolute start_s)."""
+    with open(meta_path, "r", encoding="utf-8") as f:
+        segments = json.load(f)
+    if not isinstance(segments, list):
+        raise ValueError(f"Metadata must be a list of segments, got {type(segments).__name__}")
+    if not segments:
+        raise ValueError("Metadata has no segments.")
+    notes: List[Note] = []
+    for seg in segments:
+        offset_s = seg["time"][0] / 1000
+        words = [str(x).replace("<AP>", "<SP>") for x in seg["text"].split()]
+        word_durs = [float(x) for x in seg["duration"].split()]
+        pitches = [int(x) for x in seg["note_pitch"].split()]
+        types = [int(x) if words[i] != "<SP>" else 1 for i, x in enumerate(seg["note_type"].split())]
+        if len(words) != len(word_durs) or len(word_durs) != len(pitches) or len(pitches) != len(types):
+            raise ValueError(
+                f"Length mismatch in segment {seg.get('item_name', '?')}: "
+                "note_text, note_dur, note_pitch, note_type must have same length"
+            )
+        current_s = offset_s
+        for text, dur, pitch, type_ in zip(words, word_durs, pitches, types):
+            notes.append(
+                Note(
+                    start_s=current_s,
+                    note_dur=float(dur),
+                    note_text=str(text),
+                    note_pitch=int(pitch),
+                    note_type=int(type_),
+                )
+            )
+            current_s += float(dur)
+    return notes
+def notes2meta(
+    notes: List[Note],
+    meta_path: str,
+    vocal_file: str | None,
+    language: str,
+    pitch_extractor: F0Extractor | None,
+) -> None:
+    """Write SoulX-Singer metadata JSON from a list of Note (segmenting + wav cuts)."""
+    meta_path_str = str(meta_path)
+    cut_wavs_output_dir = None
+    if vocal_file:
+        cut_wavs_output_dir = os.path.join(os.path.dirname(vocal_file), "cut_wavs_tmp")
+        os.makedirs(cut_wavs_output_dir, exist_ok=True)
+    note_text: List[Any] = []
+    note_pitch: List[Any] = []
+    note_type: List[Any] = []
+    note_dur: List[float] = []
+    note_start: List[float] = []
+    note_end: List[float] = []
+    meta_data: List[dict] = []
+    audio_data = None
+    if vocal_file:
+        audio_data, _ = librosa.load(vocal_file, sr=SAMPLE_RATE, mono=True)
+    dur_sum = 0.0
+    def flush_current_segment() -> None:
+        nonlocal dur_sum
+        _append_segment_to_meta(
+            meta_data,
+            meta_path_str,
+            cut_wavs_output_dir,
+            vocal_file,
+            language,
+            audio_data,
+            pitch_extractor,
+            note_start,
+            note_end,
+            note_text,
+            note_pitch,
+            note_type,
+            note_dur,
+        )
+        note_text.clear()
+        note_pitch.clear()
+        note_type.clear()
+        note_dur.clear()
+        note_start.clear()
+        note_end.clear()
+        dur_sum = 0.0
+    def append_note(start: float, end: float, text: str, pitch: int, type_: int) -> None:
+        nonlocal dur_sum
+        duration = end - start
+        if duration <= 0:
+            return
+        if len(note_text) > 0 and text == "<SP>" and note_text[-1] == "<SP>":
+            note_dur[-1] += duration
+            note_end[-1] = end
+        else:
+            note_text.append(text)
+            note_pitch.append(pitch)
+            note_type.append(type_)
+            note_dur.append(duration)
+            note_start.append(start)
+            note_end.append(end)
+        dur_sum += duration
+    for note in notes:
+        start = float(note.start_s)
+        end = float(note.end_s)
+        text = note.note_text
+        pitch = note.note_pitch
+        type_ = note.note_type
+        if text == "" or pitch == "" or type_ == "":
+            append_note(start, end, "<SP>", 0, 1)
+            continue
+        # cut the segment when ends with a long <SP> note
+        if (
+            len(note_text) > 0
+            and note_text[-1] == "<SP>"
+            and note_dur[-1] > MAX_GAP_SEC
+        ):
+            note_text.pop()
+            note_pitch.pop()
+            note_type.pop()
+            note_dur.pop()
+            note_start.pop()
+            note_end.pop()
+            dur_sum = sum(note_dur)
+            flush_current_segment()
+        # cut the segment if adding the current note would exceed the max duration sum threshold
+        if dur_sum + (end - start) > MAX_SEGMENT_DUR_SUM_SEC and len(note_text) > 0:
+            flush_current_segment()
+        append_note(start, end, text, int(pitch), int(type_))
+    if note_text:
+        flush_current_segment()
+    with open(meta_path_str, "w", encoding="utf-8") as f:
+        json.dump(meta_data, f, ensure_ascii=False, indent=2)
+    if cut_wavs_output_dir:
+        try:
+            shutil.rmtree(cut_wavs_output_dir, ignore_errors=True)
+        except Exception:
+            pass
+def notes2midi(
+    notes: List[Note],
+    midi_path: str,
+) -> None:
+    """Write MIDI file from a list of Note."""
+    if not notes:
+        raise ValueError("Empty note list.")
+    events: List[Tuple[int, int, Union[mido.Message, mido.MetaMessage]]] = []
+    for n in notes:
+        start_s = n.start_s
+        end_s = n.end_s
+        if end_s <= start_s:
+            continue
+        start_ticks = _seconds_to_ticks(
+            start_s, MIDI_TICKS_PER_BEAT, MIDI_TEMPO
+        )
+        end_ticks = _seconds_to_ticks(
+            end_s, MIDI_TICKS_PER_BEAT, MIDI_TEMPO
+        )
+        if end_ticks <= start_ticks:
+            end_ticks = start_ticks + 1
+        lyric = n.note_text
+        # Some DAWs store lyric text as latin1-compatible bytes; keep best-effort round-trip.
+        try:
+            lyric = lyric.encode("utf-8").decode("latin1")
+        except (UnicodeEncodeError, UnicodeDecodeError):
+            pass
+        if n.note_type == 3:
+            lyric = "-"
+        events.append(
+            (start_ticks, 1, mido.MetaMessage("lyrics", text=lyric, time=0))
+        )
+        events.append(
+            (
+                start_ticks,
+                2,
+                mido.Message(
+                    "note_on",
+                    note=n.note_pitch,
+                    velocity=MIDI_VELOCITY,
+                    time=0,
+                ),
+            )
+        )
+        events.append(
+            (
+                end_ticks,
+                0,
+                mido.Message("note_off", note=n.note_pitch, velocity=0, time=0),
+            )
+        )
+    events.sort(key=lambda x: (x[0], x[1]))
+    mid = mido.MidiFile(ticks_per_beat=MIDI_TICKS_PER_BEAT)
+    track = mido.MidiTrack()
+    mid.tracks.append(track)
+    track.append(mido.MetaMessage("set_tempo", tempo=MIDI_TEMPO, time=0))
+    track.append(
+        mido.MetaMessage(
+            "time_signature",
+            numerator=MIDI_TIME_SIGNATURE[0],
+            denominator=MIDI_TIME_SIGNATURE[1],
+            time=0,
+        )
+    )
+    last_tick = 0
+    for tick, _, msg in events:
+        msg.time = max(0, tick - last_tick)
+        track.append(msg)
+        last_tick = tick
+    track.append(mido.MetaMessage("end_of_track", time=0))
+    mid.save(midi_path)
+def midi2notes(midi_path: str) -> List[Note]:
+    """Parse MIDI file into a list of Note."""
+    mid = mido.MidiFile(midi_path)
+    ticks_per_beat = mid.ticks_per_beat
+    tempo = 500000
+    raw_notes: List[dict] = []
+    lyrics: List[Tuple[int, str]] = []
+    for track in mid.tracks:
+        abs_ticks = 0
+        active = {}
+        for msg in track:
+            abs_ticks += msg.time
+            if msg.type == "set_tempo":
+                tempo = msg.tempo
+            elif msg.type == "lyrics":
+                text = msg.text
+                try:
+                    text = text.encode("latin1").decode("utf-8")
+                except Exception:
+                    pass
+                lyrics.append((abs_ticks, text))
+            elif msg.type == "note_on":
+                key = (msg.channel, msg.note)
+                if msg.velocity > 0:
+                    active[key] = (abs_ticks, msg.velocity)
+                else:
+                    if key in active:
+                        start_ticks, vel = active.pop(key)
+                        raw_notes.append(
+                            {
+                                "midi": msg.note,
+                                "start_ticks": start_ticks,
+                                "duration_ticks": abs_ticks - start_ticks,
+                                "velocity": vel,
+                                "lyric": "",
+                            }
+                        )
+            elif msg.type == "note_off":
+                key = (msg.channel, msg.note)
+                if key in active:
+                    start_ticks, vel = active.pop(key)
+                    raw_notes.append(
+                        {
+                            "midi": msg.note,
+                            "start_ticks": start_ticks,
+                            "duration_ticks": abs_ticks - start_ticks,
+                            "velocity": vel,
+                            "lyric": "",
+                        }
+                    )
+    if not raw_notes:
+        raise ValueError("No notes found in MIDI file")
+    for n in raw_notes:
+        n["end_ticks"] = n["start_ticks"] + n["duration_ticks"]
+    raw_notes.sort(key=lambda n: n["start_ticks"])
+    lyrics.sort(key=lambda x: x[0])
+    trimmed = []
+    # Remove/trim overlaps so generated notes are strictly non-overlapping in tick domain.
+    for note in raw_notes:
+        while trimmed:
+            prev = trimmed[-1]
+            if note["start_ticks"] < prev["end_ticks"]:
+                prev["end_ticks"] = note["start_ticks"]
+                prev["duration_ticks"] = prev["end_ticks"] - prev["start_ticks"]
+                if prev["duration_ticks"] <= 0:
+                    trimmed.pop()
+                    continue
+            break
+        trimmed.append(note)
+    raw_notes = trimmed
+    tolerance = ticks_per_beat // 100
+    # Attach lyrics near note_on positions with a small tick tolerance.
+    lyric_idx = 0
+    for note in raw_notes:
+        while lyric_idx < len(lyrics) and lyrics[lyric_idx][0] < note["start_ticks"] - tolerance:
+            lyric_idx += 1
+        if lyric_idx < len(lyrics):
+            lyric_ticks, lyric_text = lyrics[lyric_idx]
+            if abs(lyric_ticks - note["start_ticks"]) <= tolerance:
+                note["lyric"] = lyric_text
+                lyric_idx += 1
+    def ticks_to_seconds(ticks: int) -> float:
+        return (ticks / ticks_per_beat) * (tempo / 1_000_000)
+    result: List[Note] = []
+    prev_end_s = 0.0
+    for idx, n in enumerate(raw_notes):
+        start_s = ticks_to_seconds(n["start_ticks"])
+        end_s = ticks_to_seconds(n["end_ticks"])
+        if prev_end_s > start_s:
+            start_s = prev_end_s
+        dur_s = end_s - start_s
+        if dur_s <= 0:
+            continue
+        lyric = n.get("lyric", "")
+        # SoulX-Singer convention mapping from lyric token to note_type/text.
+        if not lyric:
+            note_type = 2
+            text = "啦"
+        elif lyric == "<SP>":
+            note_type = 1
+            text = "<SP>"
+        elif lyric == "-":
+            note_type = 3
+            text = raw_notes[idx - 1].get("lyric", "-") if idx > 0 else "-"
+        else:
+            note_type = 2
+            text = lyric
+        if start_s - prev_end_s > SILENCE_THRESHOLD_SEC:
+            # Explicitly represent long gaps as <SP> notes.
+            result.append(
+                Note(
+                    start_s=prev_end_s,
+                    note_dur=start_s - prev_end_s,
+                    note_text="<SP>",
+                    note_pitch=0,
+                    note_type=1,
+                )
+            )
+        else:
+            if len(result) > 0:
+                result[-1].note_dur = start_s - result[-1].start_s
+        result.append(
+            Note(
+                start_s=start_s,
+                note_dur=dur_s,
+                note_text=text,
+                note_pitch=n["midi"],
+                note_type=note_type,
+            )
+        )
+        prev_end_s = end_s
+    return result
+class MidiParser:
+    def __init__(
+        self,
+        rmvpe_model_path: str,
+        device: str = "cuda",
+    ) -> None:
+        self.rmvpe_model_path = rmvpe_model_path
+        self.device = device
+        self.pitch_extractor: F0Extractor | None = None
+    def _get_pitch_extractor(self) -> F0Extractor:
+        if self.pitch_extractor is None:
+            self.pitch_extractor = F0Extractor(
+                self.rmvpe_model_path,
+                device=self.device,
+                verbose=False,
+            )
+        return self.pitch_extractor
+    def midi2meta(
+        self,
+        midi_path: str,
+        meta_path: str,
+        vocal_file: str | None = None,
+        language: str = "Mandarin",
+    ) -> None:
+        meta_dir = os.path.dirname(meta_path)
+        if meta_dir:
+            os.makedirs(meta_dir, exist_ok=True)
+        notes = midi2notes(midi_path)
+        pitch_extractor = self._get_pitch_extractor() if vocal_file else None
+        notes2meta(
+            notes,
+            meta_path,
+            vocal_file,
+            language,
+            pitch_extractor=pitch_extractor,
+        )
+        print(f"Saved Meta to {meta_path}")
+    def meta2midi(self, meta_path: str, midi_path: str) -> None:
+        notes = meta2notes(meta_path)
+        notes2midi(notes, midi_path)
+        print(f"Saved MIDI to {midi_path}")
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(
+        description="Convert SoulX-Singer metadata JSON <-> MIDI."
+    )
+    parser.add_argument("--meta", type=str, help="Path to metadata JSON")
+    parser.add_argument("--midi", type=str, help="Path to MIDI file")
+    parser.add_argument("--vocal", type=str, default=None, help="Path to vocal wav (optional for midi2meta)")
+    parser.add_argument("--language", type=str, default="Mandarin", help="Lyric language for metadata phoneme conversion (default: Mandarin)")
+    parser.add_argument(
+        "--meta2midi",
+        action="store_true",
+        help="Convert meta -> midi (requires --meta and --midi)",
+    )
+    parser.add_argument(
+        "--midi2meta",
+        action="store_true",
+        help="Convert midi -> meta (requires --midi and --meta; --vocal is optional)",
+    )
+    parser.add_argument(
+        "--rmvpe_model_path",
+        type=str,
+        help="Path to RMVPE model",
+        default="pretrained_models/SoulX-Singer-Preprocess/rmvpe/rmvpe.pt",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        help="Device to use for RMVPE",
+        default="cuda",
+    )
+    args = parser.parse_args()
+    midi_parser = MidiParser(
+        rmvpe_model_path=args.rmvpe_model_path,
+        device=args.device,
+    )
+    if args.meta2midi:
+        if not args.meta or not args.midi:
+            parser.error("--meta2midi requires --meta and --midi")
+        midi_parser.meta2midi(args.meta, args.midi)
+    elif args.midi2meta:
+        if not args.midi or not args.meta:
+            parser.error(
+                "--midi2meta requires --midi and --meta"
+            )
+        midi_parser.midi2meta(args.midi, args.meta, args.vocal, args.language)
+    else:
+        parser.print_help()

preprocess/tools/note_transcription/__init__.py ADDED Viewed

File without changes

preprocess/tools/note_transcription/model.py ADDED Viewed

	@@ -0,0 +1,531 @@

+# https://github.com/RickyL-2000/ROSVOT
+import math
+import sys
+import traceback
+import json
+import time
+from pathlib import Path
+from typing import Any, Dict, Optional
+import librosa
+import numpy as np
+import torch
+import matplotlib.pyplot as plt
+from .utils.os_utils import safe_path
+from .utils.commons.hparams import set_hparams
+from .utils.commons.ckpt_utils import load_ckpt
+from .utils.commons.dataset_utils import pad_or_cut_xd
+from .utils.audio.mel import MelNet
+from .utils.audio.pitch_utils import (
+    norm_interp_f0,
+    denorm_f0,
+    f0_to_coarse,
+    boundary2Interval,
+    save_midi,
+    midi_to_hz,
+)
+from .utils.rosvot_utils import (
+    get_mel_len,
+    align_word,
+    regulate_real_note_itv,
+    regulate_ill_slur,
+    bd_to_durs,
+)
+from .modules.pe.rmvpe import RMVPE
+from .modules.rosvot.rosvot import MidiExtractor, WordbdExtractor
+@torch.no_grad()
+def infer_sample(
+    item: Dict[str, Any],
+    hparams: Dict[str, Any],
+    models: Dict[str, Any],
+    device: torch.device,
+    *,
+    save_dir: Optional[str] = None,
+    apply_rwbd: Optional[bool] = None,
+    # outputs
+    save_plot: bool = False,
+    no_save_midi: bool = True,
+    no_save_npy: bool = True,
+    verbose: bool = False,
+) -> Dict[str, Any]:
+    if "item_name" not in item or "wav_fn" not in item:
+        raise ValueError('item must contain keys: "item_name" and "wav_fn"')
+    item_name = item["item_name"]
+    wav_src = item["wav_fn"]
+    # Decide RWBD usage
+    if apply_rwbd is None:
+        apply_rwbd_ = ("word_durs" not in item)
+    else:
+        apply_rwbd_ = bool(apply_rwbd)
+    # Models
+    model = models["model"]
+    mel_net = models["mel_net"]
+    pe = models.get("pe")
+    wbd_predictor = models.get("wbd_predictor")
+    if wbd_predictor is None and apply_rwbd_:
+         raise ValueError("apply_rwbd is True but wbd_predictor model is not provided in models")
+    # ---- Prepare Data  ----
+    if isinstance(wav_src, str):
+        wav, _ = librosa.core.load(wav_src, sr=hparams["audio_sample_rate"])
+    else:
+        wav = wav_src
+        if not isinstance(wav, np.ndarray):
+            wav = np.asarray(wav)
+    wav = wav.astype(np.float32)
+    # Calculate timestamps and alignment lengths
+    wav_len_samples = wav.shape[-1]
+    mel_len = get_mel_len(wav_len_samples, hparams["hop_size"])
+    # Word boundary preparation
+    mel2word = None
+    word_durs_filtered = None
+    if not apply_rwbd_:
+        if "word_durs" not in item:
+             raise ValueError('apply_rwbd=False but item has no "word_durs"')
+        wd_raw = list(item["word_durs"])
+        min_word_dur = hparams.get("min_word_dur", 20) / 1000
+        word_durs_filtered = []
+        for i, wd in enumerate(wd_raw):
+            if wd < min_word_dur:
+                if i == 0 and len(wd_raw) > 1:
+                    wd_raw[i + 1] += wd
+                elif len(word_durs_filtered) > 0:
+                    word_durs_filtered[-1] += wd
+            else:
+                word_durs_filtered.append(wd)
+        mel2word, _ = align_word(word_durs_filtered, mel_len, hparams["hop_size"], hparams["audio_sample_rate"])
+        mel2word = np.asarray(mel2word)
+        if mel2word.size > 0 and mel2word[0] == 0:
+             mel2word = mel2word + 1
+        mel2word_len = int(np.sum(mel2word > 0))
+        real_len = min(mel_len, mel2word_len)
+    else:
+        real_len = min(mel_len, hparams["max_frames"])
+    T = math.ceil(min(real_len, hparams["max_frames"]) / hparams["frames_multiple"]) * hparams["frames_multiple"]
+    # ---- Input Tensors & Padding ----
+    target_samples = T * hparams["hop_size"]
+    wav_t = torch.from_numpy(wav).float().to(device).unsqueeze(0) # [1, L]
+    if wav_t.shape[-1] < target_samples:
+        wav_t = pad_or_cut_xd(wav_t, target_samples, 1)
+    # ---- Pitch Extraction ----
+    if pe is not None:
+        f0s, uvs = pe.get_pitch_batch(
+            wav_t,
+            sample_rate=hparams["audio_sample_rate"],
+            hop_size=hparams["hop_size"],
+            lengths=[real_len],
+            fmax=hparams["f0_max"],
+            fmin=hparams["f0_min"],
+        )
+        f0_1d, uv_1d = norm_interp_f0(f0s[0][:T])
+        f0_t = pad_or_cut_xd(torch.FloatTensor(f0_1d).to(device), T, 0).unsqueeze(0)
+        uv_t = pad_or_cut_xd(torch.FloatTensor(uv_1d).to(device), T, 0).long().unsqueeze(0)
+        pitch_coarse = f0_to_coarse(denorm_f0(f0_t, uv_t)).to(device)
+        f0_np = denorm_f0(f0_t, uv_t)[0].detach().cpu().numpy()[:real_len]
+    else:
+        f0_t = uv_t = pitch_coarse = None
+        f0_np = None
+    # ---- Mel Extraction ----
+    mel = mel_net(wav_t) # [1, T_padded, C]
+    mel = pad_or_cut_xd(mel, T, 1)
+    # Construct non-padding mask
+    mel_nonpadding_mask = torch.zeros(1, T, device=device)
+    mel_nonpadding_mask[:, :real_len] = 1.0
+    # Apply mask to mel (zero out padding)
+    mel = (mel.transpose(1, 2) * mel_nonpadding_mask.unsqueeze(1)).transpose(1, 2)
+    # Re-calculate non_padding bool mask
+    mel_nonpadding = mel.abs().sum(-1) > 0
+    # ---- Word Boundary ----
+    word_durs_used = None
+    if apply_rwbd_:
+        mel_input = mel[:, :, : hparams.get("wbd_use_mel_bins", 80)]
+        wbd_outputs = wbd_predictor(
+            mel=mel_input,
+            pitch=pitch_coarse,
+            uv=uv_t,
+            non_padding=mel_nonpadding,
+            train=False,
+        )
+        word_bd = wbd_outputs["word_bd_pred"] # [1, T]
+    else:
+        # Construct word_bd from provided durs
+        mel2word_t = pad_or_cut_xd(torch.LongTensor(mel2word).to(device), T, 0)
+        word_bd = torch.zeros_like(mel2word_t)
+        # Vectorized check
+        word_bd[1:] = (mel2word_t[1:] != mel2word_t[:-1]).long()
+        word_bd[real_len:] = 0
+        word_bd = word_bd.unsqueeze(0) # [1, T]
+        word_durs_used = np.array(word_durs_filtered)
+    # ---- Main Inference ----
+    mel_input = mel[:, :, : hparams.get("use_mel_bins", 80)]
+    outputs = model(
+        mel=mel_input,
+        word_bd=word_bd,
+        pitch=pitch_coarse,
+        uv=uv_t,
+        non_padding=mel_nonpadding,
+        train=False,
+    )
+    note_lengths = outputs["note_lengths"].detach().cpu().numpy()
+    note_bd_pred = outputs["note_bd_pred"][0].detach().cpu().numpy()[:real_len]
+    note_pred = outputs["note_pred"][0].detach().cpu().numpy()[: note_lengths[0]]
+    note_bd_logits = torch.sigmoid(outputs["note_bd_logits"])[0].detach().cpu().numpy()[:real_len]
+    if note_pred.shape == (0,):
+        if verbose:
+            print(f"skip {item_name}: no notes detected")
+        return {
+            "item_name": item_name,
+            "pitches": [],
+            "note_durs": [],
+            "note2words": None,
+        }
+    # ---- Post-Processing & Regulation ----
+    note_itv_pred = boundary2Interval(note_bd_pred)
+    note2words = None
+    if apply_rwbd_:
+        word_bd_np = outputs['word_bd_pred'][0].detach().cpu().numpy()[:real_len]
+        word_durs_derived = np.array(bd_to_durs(word_bd_np)) * hparams['hop_size'] / hparams['audio_sample_rate']
+        word_durs_for_reg = word_durs_derived
+        word_bd_for_reg = word_bd_np
+    else:
+        word_bd_for_reg = word_bd[0].detach().cpu().numpy()[:real_len]
+        word_durs_for_reg = word_durs_used
+    should_regulate = hparams.get("infer_regulate_real_note_itv", True) and (not apply_rwbd_)
+    if should_regulate and (word_durs_for_reg is not None):
+        try:
+            note_itv_pred_secs, note2words = regulate_real_note_itv(
+                note_itv_pred,
+                note_bd_pred,
+                word_bd_for_reg,
+                word_durs_for_reg,
+                hparams["hop_size"],
+                hparams["audio_sample_rate"],
+            )
+            note_pred, note_itv_pred_secs, note2words = regulate_ill_slur(note_pred, note_itv_pred_secs, note2words)
+        except Exception as err:
+            if verbose:
+                _, exc_value, exc_tb = sys.exc_info()
+                tb = traceback.extract_tb(exc_tb)[-1]
+                print(f"postprocess failed: {err}: {exc_value} in {tb[0]}:{tb[1]} '{tb[2]}' in {tb[3]}")
+            # Fallback
+            note_itv_pred_secs = note_itv_pred * hparams["hop_size"] / hparams["audio_sample_rate"]
+            note2words = None
+    else:
+        note_itv_pred_secs = note_itv_pred * hparams["hop_size"] / hparams["audio_sample_rate"]
+    # ---- Output ----
+    note_durs = [float((itv[1] - itv[0])) for itv in note_itv_pred_secs]
+    out = {
+        "item_name": item_name,
+        "pitches": note_pred.tolist(),
+        "note_durs": note_durs,
+        "note2words": note2words.tolist() if note2words is not None else None,
+    }
+    # ---- Saving ----
+    if save_dir is not None:
+        save_dir_path = Path(save_dir)
+        save_dir_path.mkdir(parents=True, exist_ok=True)
+        fn = str(item_name)
+        if not no_save_midi:
+            save_midi(note_pred, note_itv_pred_secs, safe_path(save_dir_path / "midi" / f"{fn}.mid"))
+        if not no_save_npy:
+            np.save(safe_path(save_dir_path / "npy" / f"[note]{fn}.npy"), out, allow_pickle=True)
+        if save_plot:
+            fig = plt.figure()
+            if f0_np is not None:
+                plt.plot(f0_np, color="red", label="f0")
+            midi_pred = np.zeros(note_bd_pred.shape[0], dtype=np.float32)
+            itvs = np.round(note_itv_pred_secs * hparams["audio_sample_rate"] / hparams["hop_size"]).astype(int)
+            for i, itv in enumerate(itvs):
+                midi_pred[itv[0] : itv[1]] = note_pred[i]
+            plt.plot(midi_to_hz(midi_pred), color="blue", label="pred midi")
+            plt.plot(note_bd_logits * 100, color="green", label="note bd logits x100")
+            plt.legend()
+            plt.tight_layout()
+            plt.savefig(safe_path(save_dir_path / "plot" / f"[MIDI]{fn}.png"), format="png")
+            plt.close(fig)
+    return out
+def load_rosvot_models(ckpt, config="", wbd_ckpt="", wbd_config="", device="cuda:0", verbose=False, thr=0.85):
+    """
+    Load models once to reuse across multiple items.
+    """
+    dev = torch.device(device)
+    # 1. Hparams
+    config_path = Path(ckpt).with_name("config.yaml") if config == "" else config
+    pe_ckpt = Path(ckpt).parent.parent / "rmvpe/model.pt"
+    hparams = set_hparams(
+        config=config_path,
+        print_hparams=verbose,
+        hparams_str=f"note_bd_threshold={thr}",
+    )
+    # 2. Main Model
+    model = MidiExtractor(hparams)
+    load_ckpt(model, ckpt, verbose=verbose)
+    model.eval().to(dev)
+    # 3. MelNet
+    mel_net = MelNet(hparams)
+    mel_net.to(dev)
+    # 4. Pitch Extractor
+    pe = None
+    if hparams.get("use_pitch_embed", False):
+        pe = RMVPE(pe_ckpt, device=dev)
+    # 5. Word Boundary Predictor (optional but we load if ckpt provided or needed)
+    wbd_predictor = None
+    if wbd_ckpt:
+        wbd_config_path = Path(wbd_ckpt).with_name("config.yaml") if wbd_config == "" else wbd_config
+        wbd_hparams = set_hparams(
+            config=wbd_config_path,
+            print_hparams=False,
+            hparams_str="",
+        )
+        hparams.update({
+            "wbd_use_mel_bins": wbd_hparams["use_mel_bins"],
+            "min_word_dur": wbd_hparams["min_word_dur"],
+        })
+        wbd_predictor = WordbdExtractor(wbd_hparams)
+        load_ckpt(wbd_predictor, wbd_ckpt, verbose=verbose)
+        wbd_predictor.eval().to(dev)
+    models = {
+        "model": model,
+        "mel_net": mel_net,
+        "pe": pe,
+        "wbd_predictor": wbd_predictor
+    }
+    return hparams, models
+class NoteTranscriber:
+    """Note transcription wrapper based on ROSVOT.
+    Loads ROSVOT and optional RWBD models once in ``__init__`` and
+    exposes a :py:meth:`process` API that turns an item dict into
+    aligned note metadata for downstream SVS.
+    """
+    def __init__(
+        self,
+        rosvot_model_path: str,
+        rwbd_model_path: str,
+        *,
+        rosvot_config_path: str = "",
+        rwbd_config_path: str = "",
+        device: str = "cuda:0",
+        thr: float = 0.85,
+        verbose: bool = True,
+    ):
+        """Initialize the note transcriber.
+        Args:
+            ckpt: Path to the main ROSVOT checkpoint.
+            config: Optional config YAML path for ROSVOT.
+            wbd_ckpt: Optional word-boundary checkpoint path.
+            wbd_config: Optional config YAML path for RWBD.
+            device: Torch device string, e.g. ``"cuda:0"`` / ``"cpu"``.
+            thr: Note boundary threshold.
+            verbose: Whether to print verbose logs.
+        """
+        self.verbose = verbose
+        self.device = torch.device(device)
+        self.hparams, self.models = load_rosvot_models(
+            ckpt=rosvot_model_path,
+            config=rosvot_config_path,
+            wbd_ckpt=rwbd_model_path,
+            wbd_config=rwbd_config_path,
+            device=device,
+            verbose=verbose,
+            thr=thr,
+        )
+        if self.verbose:
+            print(
+                "[note transcription] init success:",
+                f"device={self.device}",
+                f"rosvot_model_path={rosvot_model_path}",
+                f"rwbd_model_path={rwbd_model_path if rwbd_model_path else 'None'}",
+                f"thr={thr}",
+            )
+    def process(
+        self,
+        item: Dict[str, Any],
+        *,
+        segment_info: Optional[Dict[str, Any]] = None,
+        save_dir: Optional[str] = None,
+        apply_rwbd: Optional[bool] = None,
+        save_plot: bool = False,
+        no_save_midi: bool = True,
+        no_save_npy: bool = True,
+        verbose: Optional[bool] = None,
+    ) -> Dict[str, Any]:
+        """Run ROSVOT on a single item and post-process outputs.
+        Args:
+            item: Input metadata dict with at least ``item_name`` and ``wav_fn``.
+            segment_info: Optional segment metadata for sliced audio.
+            save_dir: Optional directory for debug artifacts (plots, midis).
+            apply_rwbd: Whether to run RWBD-based word boundary refinement.
+            save_plot: Whether to save diagnostic plots.
+            no_save_midi: If True, skip saving midi.
+            no_save_npy: If True, skip saving numpy intermediates.
+            verbose: Override instance-level verbose flag for this call.
+        Returns:
+            Dict with aligned note information for downstream SVS.
+        """
+        v = self.verbose if verbose is None else verbose
+        if v:
+            item_name = item.get("item_name", "")
+            wav_fn = item.get("wav_fn", "")
+            print(f"[note transcription] process: start: item_name={item_name} wav_fn={wav_fn}")
+            t0 = time.time()
+        rosvot_out = infer_sample(
+            item,
+            self.hparams,
+            self.models,
+            device=self.device,
+            save_dir=save_dir,
+            apply_rwbd=apply_rwbd,
+            save_plot=save_plot,
+            no_save_midi=no_save_midi,
+            no_save_npy=no_save_npy,
+            verbose=v,
+        )
+        out = self.post_process(
+            metadata=item,
+            segment_info=segment_info,
+            rosvot_out=rosvot_out,
+        )
+        if v:
+            dt = time.time() - t0
+            print(
+                "[note transcription] process: done:",
+                f"item_name={out.get('item_name','')}",
+                f"n_notes={len(out.get('note_pitch', []) or [])}",
+                f"time={dt:.3f}s",
+            )
+        return out
+    @staticmethod
+    def _normalize_note2words(note2words: list[int]) -> list[int]:
+        if not note2words:
+            return []
+        normalized = [note2words[0]]
+        for idx in range(1, len(note2words)):
+            if note2words[idx] < normalized[-1]:
+                normalized.append(normalized[-1])
+            else:
+                normalized.append(note2words[idx])
+        return normalized
+    @staticmethod
+    def _build_ep_types(note2words: list[int], align_words: list[str]) -> list[int]:
+        ep_types: list[int] = []
+        prev = -1
+        for i, w in zip(note2words, align_words):
+            if w == "<SP>":
+                ep_types.append(1)
+            else:
+                ep_types.append(2 if i != prev else 3)
+            prev = i
+        return ep_types
+    def post_process(
+        self,
+        *,
+        metadata: Dict[str, Any],
+        segment_info: Dict[str, Any],
+        rosvot_out: Dict[str, Any],
+    ) -> Dict[str, Any]:
+        """Build aligned note metadata using ROSVOT outputs."""
+        note2words_raw = rosvot_out.get("note2words") or []
+        note2words = self._normalize_note2words(note2words_raw)
+        align_words = [
+            metadata["words"][idx - 1]
+            for idx in note2words_raw
+            if 0 < idx <= len(metadata["words"])
+        ]
+        ep_types = self._build_ep_types(note2words, align_words) if align_words else []
+        return {
+            "item_name": rosvot_out.get("item_name", "") if not segment_info else segment_info["item_name"],
+            "wav_fn": metadata.get("wav_fn", "") if not segment_info else segment_info["wav_fn"],
+            "origin_wav_fn": metadata.get("origin_wav_fn", "") if not segment_info else segment_info["origin_wav_fn"],
+            "start_time_ms": "" if not segment_info else segment_info["start_time_ms"],
+            "end_time_ms": "" if not segment_info else segment_info["end_time_ms"],
+            "language": metadata.get("language", ""),
+            "note_text": align_words,
+            "note_dur": rosvot_out.get("note_durs", []),
+            "note_type": ep_types,
+            "note_pitch": rosvot_out.get("pitches", []),
+        }
+if __name__ == "__main__":
+    item = {
+        'item_name': 'vocal_0',
+        'wav_fn': 'example/audio/zh_prompt.mp3',
+        'start_time_ms': 320,
+        'end_time_ms': 10687,
+        'origin_wav_fn': 'example/audio/zh_prompt.mp3',
+        'duration': 10367,
+        'words': ['<SP>', '除', '了', '想', '你', '<SP>', '除', '了', '爱', '你', '<SP>', '我', '什', '么', '什', '么', '都', '愿', '意'],
+        'word_durs': [0.21, 0.36, 0.26, 0.7000000000000001, 0.96, 0.3800000000000001, 0.43999999999999995, 0.3799999999999999, 0.6400000000000001, 0.9600000000000002, 1.1199999999999999, 0.28000000000000025, 0.3799999999999999, 0.3199999999999994, 0.3200000000000003, 0.3799999999999999, 0.3200000000000003, 0.5, 1.457981859410431],
+        'language': 'Mandarin'
+    }
+    m = NoteTranscriber(
+        rosvot_model_path="pretrained_models/SoulX-Singer-Preprocess/rosvot/rosvot/model.pt",
+        rwbd_model_path="pretrained_models/SoulX-Singer-Preprocess/rosvot/rwbd/model.pt",
+        device="cuda"
+    )
+    out = m.process(item, segment_info=item)
+    print(out)

preprocess/tools/note_transcription/modules/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """ROSVOT model submodules."""

preprocess/tools/note_transcription/modules/commons/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Common ROSVOT layers and utilities."""

preprocess/tools/note_transcription/modules/commons/conformer/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Conformer layers for ROSVOT."""

preprocess/tools/note_transcription/modules/commons/conformer/conformer.py ADDED Viewed

	@@ -0,0 +1,96 @@

+from torch import nn
+from .espnet_positional_embedding import RelPositionalEncoding, ScaledPositionalEncoding, PositionalEncoding
+from .espnet_transformer_attn import RelPositionMultiHeadedAttention, MultiHeadedAttention
+from .layers import Swish, ConvolutionModule, EncoderLayer, MultiLayeredConv1d
+from ..layers import Embedding
+class ConformerLayers(nn.Module):
+    def __init__(self, hidden_size, num_layers, kernel_size=9, dropout=0.0, num_heads=4,
+                 use_last_norm=True, save_hidden=False):
+        super().__init__()
+        self.use_last_norm = use_last_norm
+        self.layers = nn.ModuleList()
+        positionwise_layer = MultiLayeredConv1d
+        positionwise_layer_args = (hidden_size, hidden_size * 4, 1, dropout)
+        self.pos_embed = RelPositionalEncoding(hidden_size, dropout)
+        self.encoder_layers = nn.ModuleList([EncoderLayer(
+            hidden_size,
+            RelPositionMultiHeadedAttention(num_heads, hidden_size, 0.0),
+            positionwise_layer(*positionwise_layer_args),
+            positionwise_layer(*positionwise_layer_args),
+            ConvolutionModule(hidden_size, kernel_size, Swish()),
+            dropout,
+        ) for _ in range(num_layers)])
+        if self.use_last_norm:
+            self.layer_norm = nn.LayerNorm(hidden_size)
+        else:
+            self.layer_norm = nn.Linear(hidden_size, hidden_size)
+        self.save_hidden = save_hidden
+        if save_hidden:
+            self.hiddens = []
+    def forward(self, x, padding_mask=None):
+        """
+        :param x: [B, T, H]
+        :param padding_mask: [B, T]
+        :return: [B, T, H]
+        """
+        self.hiddens = []
+        nonpadding_mask = x.abs().sum(-1) > 0
+        x = self.pos_embed(x)
+        for l in self.encoder_layers:
+            x, mask = l(x, nonpadding_mask[:, None, :])
+            if self.save_hidden:
+                self.hiddens.append(x[0])
+        x = x[0]
+        x = self.layer_norm(x) * nonpadding_mask.float()[:, :, None]
+        return x
+class FastConformerLayers(ConformerLayers):
+    def __init__(self, hidden_size, num_layers, kernel_size=9, dropout=0.0, num_heads=4,
+                 use_last_norm=True, save_hidden=False):
+        super(ConformerLayers, self).__init__()
+        self.use_last_norm = use_last_norm
+        self.layers = nn.ModuleList()
+        positionwise_layer = MultiLayeredConv1d
+        positionwise_layer_args = (hidden_size, hidden_size * 4, 1, dropout)
+        self.pos_embed = PositionalEncoding(hidden_size, dropout)
+        self.encoder_layers = nn.ModuleList([EncoderLayer(
+            hidden_size,
+            MultiHeadedAttention(num_heads, hidden_size, 0.0, flash=True),
+            positionwise_layer(*positionwise_layer_args),
+            positionwise_layer(*positionwise_layer_args),
+            ConvolutionModule(hidden_size, kernel_size, Swish()),
+            dropout,
+        ) for _ in range(num_layers)])
+        if self.use_last_norm:
+            self.layer_norm = nn.LayerNorm(hidden_size)
+        else:
+            self.layer_norm = nn.Linear(hidden_size, hidden_size)
+        self.save_hidden = save_hidden
+        if save_hidden:
+            self.hiddens = []
+class ConformerEncoder(ConformerLayers):
+    def __init__(self, hidden_size, dict_size, num_layers=None):
+        conformer_enc_kernel_size = 9
+        super().__init__(hidden_size, num_layers, conformer_enc_kernel_size)
+        self.embed = Embedding(dict_size, hidden_size, padding_idx=0)
+    def forward(self, x):
+        """
+        :param src_tokens: [B, T]
+        :return: [B x T x C]
+        """
+        x = self.embed(x)  # [B, T, H]
+        x = super(ConformerEncoder, self).forward(x)
+        return x
+class ConformerDecoder(ConformerLayers):
+    def __init__(self, hidden_size, num_layers):
+        conformer_dec_kernel_size = 9
+        super().__init__(hidden_size, num_layers, conformer_dec_kernel_size)

preprocess/tools/note_transcription/modules/commons/conformer/espnet_positional_embedding.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import math
+import torch
+class PositionalEncoding(torch.nn.Module):
+    """Positional encoding.
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+        reverse (bool): Whether to reverse the input position.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000, reverse=False):
+        """Construct an PositionalEncoding object."""
+        super(PositionalEncoding, self).__init__()
+        self.d_model = d_model
+        self.reverse = reverse
+        self.xscale = math.sqrt(self.d_model)
+        self.dropout = torch.nn.Dropout(p=dropout_rate)
+        self.pe = None
+        self.extend_pe(torch.tensor(0.0).expand(1, max_len))
+    def extend_pe(self, x):
+        """Reset the positional encodings."""
+        if self.pe is not None:
+            if self.pe.size(1) >= x.size(1):
+                if self.pe.dtype != x.dtype or self.pe.device != x.device:
+                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)
+                return
+        pe = torch.zeros(x.size(1), self.d_model)
+        if self.reverse:
+            position = torch.arange(
+                x.size(1) - 1, -1, -1.0, dtype=torch.float32
+            ).unsqueeze(1)
+        else:
+            position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32)
+            * -(math.log(10000.0) / self.d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.pe = pe.to(device=x.device, dtype=x.dtype)
+    def forward(self, x: torch.Tensor):
+        """Add positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x * self.xscale + self.pe[:, : x.size(1)]
+        return self.dropout(x)
+class ScaledPositionalEncoding(PositionalEncoding):
+    """Scaled positional encoding module.
+    See Sec. 3.2  https://arxiv.org/abs/1809.08895
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(d_model=d_model, dropout_rate=dropout_rate, max_len=max_len)
+        self.alpha = torch.nn.Parameter(torch.tensor(1.0))
+    def reset_parameters(self):
+        """Reset parameters."""
+        self.alpha.data = torch.tensor(1.0)
+    def forward(self, x):
+        """Add positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x + self.alpha * self.pe[:, : x.size(1)]
+        return self.dropout(x)
+class RelPositionalEncoding(PositionalEncoding):
+    """Relative positional encoding module.
+    See : Appendix B in https://arxiv.org/abs/1901.02860
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+    """
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(d_model, dropout_rate, max_len, reverse=True)
+    def forward(self, x):
+        """Compute positional encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+            torch.Tensor: Positional embedding tensor (1, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x * self.xscale
+        pos_emb = self.pe[:, : x.size(1)]
+        return self.dropout(x), self.dropout(pos_emb)

preprocess/tools/note_transcription/modules/commons/conformer/espnet_transformer_attn.py ADDED Viewed

	@@ -0,0 +1,198 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+"""Multi-Head Attention layer definition."""
+from packaging import version
+import math
+import numpy
+import torch
+from torch import nn
+class MultiHeadedAttention(nn.Module):
+    """Multi-Head Attention layer.
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+    """
+    def __init__(self, n_head, n_feat, dropout_rate, flash=False):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadedAttention, self).__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        self.linear_q = nn.Linear(n_feat, n_feat)
+        self.linear_k = nn.Linear(n_feat, n_feat)
+        self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.attn = None
+        self.dropout = nn.Dropout(p=dropout_rate)
+        self.dropout_rate = dropout_rate
+        self.flash = flash
+    def forward_qkv(self, query, key, value):
+        """Transform query, key and value.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+        Returns:
+            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
+            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
+            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
+        """
+        n_batch = query.size(0)
+        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
+        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
+        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
+        q = q.transpose(1, 2)  # (batch, head, time1, d_k)
+        k = k.transpose(1, 2)  # (batch, head, time2, d_k)
+        v = v.transpose(1, 2)  # (batch, head, time2, d_k)
+        return q, k, v
+    def forward_attention(self, value, scores, mask):
+        """Compute attention context vector.
+        Args:
+            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
+            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
+            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+            min_value = float(
+                numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min
+            )
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(
+                mask, 0.0
+            )  # (batch, head, time1, time2)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = (
+            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        )  # (batch, time1, d_model)
+        return self.linear_out(x)  # (batch, time1, d_model)
+    def forward(self, query, key, value, mask):
+        """Compute scaled dot product attention.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        if version.parse(torch.__version__) >= version.parse("2.0") and self.flash:
+            n_batch = value.size(0)
+            x = torch.nn.functional.scaled_dot_product_attention(
+                q, k, v, attn_mask=mask.unsqueeze(1) if mask is not None else None, dropout_p=self.dropout_rate)
+            x = (
+                x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+            )  # (batch, time1, d_model)
+            return self.linear_out(x)
+        else:
+            scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
+            return self.forward_attention(v, scores, mask)
+class RelPositionMultiHeadedAttention(MultiHeadedAttention):
+    """Multi-Head Attention layer with relative position encoding.
+    Paper: https://arxiv.org/abs/1901.02860
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+    """
+    def __init__(self, n_head, n_feat, dropout_rate):
+        """Construct an RelPositionMultiHeadedAttention object."""
+        super().__init__(n_head, n_feat, dropout_rate)
+        # linear transformation for positional ecoding
+        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
+        # these two learnable bias are used in matrix c and matrix d
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        torch.nn.init.xavier_uniform_(self.pos_bias_u)
+        torch.nn.init.xavier_uniform_(self.pos_bias_v)
+    def rel_shift(self, x, zero_triu=False):
+        """Compute relative positinal encoding.
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, size).
+            zero_triu (bool): If true, return the lower triangular part of the matrix.
+        Returns:
+            torch.Tensor: Output tensor.
+        """
+        zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)
+        x_padded = torch.cat([zero_pad, x], dim=-1)
+        x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))
+        x = x_padded[:, :, 1:].view_as(x)
+        if zero_triu:
+            ones = torch.ones((x.size(2), x.size(3)))
+            x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
+        return x
+    def forward(self, query, key, value, pos_emb, mask):
+        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            pos_emb (torch.Tensor): Positional embedding tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        q = q.transpose(1, 2)  # (batch, time1, head, d_k)
+        n_batch_pos = pos_emb.size(0)
+        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
+        p = p.transpose(1, 2)  # (batch, head, time1, d_k)
+        # (batch, head, time1, d_k)
+        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)
+        # (batch, head, time1, d_k)
+        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)
+        # compute attention score
+        # first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # (batch, head, time1, time2)
+        matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))
+        # compute matrix b and matrix d
+        # (batch, head, time1, time2)
+        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))
+        matrix_bd = self.rel_shift(matrix_bd)
+        scores = (matrix_ac + matrix_bd) / math.sqrt(
+            self.d_k
+        )  # (batch, head, time1, time2)
+        return self.forward_attention(v, scores, mask)

preprocess/tools/note_transcription/modules/commons/conformer/layers.py ADDED Viewed

	@@ -0,0 +1,260 @@

+from torch import nn
+import torch
+from ..layers import LayerNorm
+class ConvolutionModule(nn.Module):
+    """ConvolutionModule in Conformer model.
+    Args:
+        channels (int): The number of channels of conv layers.
+        kernel_size (int): Kernerl size of conv layers.
+    """
+    def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):
+        """Construct an ConvolutionModule object."""
+        super(ConvolutionModule, self).__init__()
+        # kernerl_size should be a odd number for 'SAME' padding
+        assert (kernel_size - 1) % 2 == 0
+        self.pointwise_conv1 = nn.Conv1d(
+            channels,
+            2 * channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.depthwise_conv = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+            groups=channels,
+            bias=bias,
+        )
+        self.norm = nn.BatchNorm1d(channels)
+        self.pointwise_conv2 = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.activation = activation
+    def forward(self, x):
+        """Compute convolution module.
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, channels).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, channels).
+        """
+        # exchange the temporal dimension and the feature dimension
+        x = x.transpose(1, 2)
+        # GLU mechanism
+        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
+        x = nn.functional.glu(x, dim=1)  # (batch, channel, dim)
+        # 1D Depthwise Conv
+        x = self.depthwise_conv(x)
+        x = self.activation(self.norm(x))
+        x = self.pointwise_conv2(x)
+        return x.transpose(1, 2)
+class MultiLayeredConv1d(torch.nn.Module):
+    """Multi-layered conv1d for Transformer block.
+    This is a module of multi-leyered conv1d designed
+    to replace positionwise feed-forward network
+    in Transforner block, which is introduced in
+    `FastSpeech: Fast, Robust and Controllable Text to Speech`_.
+    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:
+        https://arxiv.org/pdf/1905.09263.pdf
+    """
+    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):
+        """Initialize MultiLayeredConv1d module.
+        Args:
+            in_chans (int): Number of input channels.
+            hidden_chans (int): Number of hidden channels.
+            kernel_size (int): Kernel size of conv1d.
+            dropout_rate (float): Dropout rate.
+        """
+        super(MultiLayeredConv1d, self).__init__()
+        self.w_1 = torch.nn.Conv1d(
+            in_chans,
+            hidden_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.w_2 = torch.nn.Conv1d(
+            hidden_chans,
+            in_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.dropout = torch.nn.Dropout(dropout_rate)
+    def forward(self, x):
+        """Calculate forward propagation.
+        Args:
+            x (torch.Tensor): Batch of input tensors (B, T, in_chans).
+        Returns:
+            torch.Tensor: Batch of output tensors (B, T, hidden_chans).
+        """
+        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)
+        return self.w_2(self.dropout(x).transpose(-1, 1)).transpose(-1, 1)
+class Swish(torch.nn.Module):
+    """Construct an Swish object."""
+    def forward(self, x):
+        """Return Swich activation function."""
+        return x * torch.sigmoid(x)
+class EncoderLayer(nn.Module):
+    """Encoder layer module.
+    Args:
+        size (int): Input dimension.
+        self_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
+            can be used as the argument.
+        feed_forward (torch.nn.Module): Feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        feed_forward_macaron (torch.nn.Module): Additional feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        conv_module (torch.nn.Module): Convolution module instance.
+            `ConvlutionModule` instance can be used as the argument.
+        dropout_rate (float): Dropout rate.
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
+    """
+    def __init__(
+            self,
+            size,
+            self_attn,
+            feed_forward,
+            feed_forward_macaron,
+            conv_module,
+            dropout_rate,
+            normalize_before=True,
+            concat_after=False,
+    ):
+        """Construct an EncoderLayer object."""
+        super(EncoderLayer, self).__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.feed_forward_macaron = feed_forward_macaron
+        self.conv_module = conv_module
+        self.norm_ff = LayerNorm(size)  # for the FNN module
+        self.norm_mha = LayerNorm(size)  # for the MHA module
+        if feed_forward_macaron is not None:
+            self.norm_ff_macaron = LayerNorm(size)
+            self.ff_scale = 0.5
+        else:
+            self.ff_scale = 1.0
+        if self.conv_module is not None:
+            self.norm_conv = LayerNorm(size)  # for the CNN module
+            self.norm_final = LayerNorm(size)  # for the final output of the block
+        self.dropout = nn.Dropout(dropout_rate)
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear = nn.Linear(size + size, size)
+    def forward(self, x_input, mask, cache=None):
+        """Compute encoded features.
+        Args:
+            x_input (Union[Tuple, torch.Tensor]): Input tensor w/ or w/o pos emb.
+                - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].
+                - w/o pos emb: Tensor (#batch, time, size).
+            mask (torch.Tensor): Mask tensor for the input (#batch, time).
+            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, size).
+            torch.Tensor: Mask tensor (#batch, time).
+        """
+        if isinstance(x_input, tuple):
+            x, pos_emb = x_input[0], x_input[1]
+        else:
+            x, pos_emb = x_input, None
+        # whether to use macaron style
+        if self.feed_forward_macaron is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_ff_macaron(x)
+            x = residual + self.ff_scale * self.dropout(self.feed_forward_macaron(x))
+            if not self.normalize_before:
+                x = self.norm_ff_macaron(x)
+        # multi-headed self-attention module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_mha(x)
+        if cache is None:
+            x_q = x
+        else:
+            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)
+            x_q = x[:, -1:, :]
+            residual = residual[:, -1:, :]
+            mask = None if mask is None else mask[:, -1:, :]
+        if pos_emb is not None:
+            x_att = self.self_attn(x_q, x, x, pos_emb, mask)
+        else:
+            x_att = self.self_attn(x_q, x, x, mask)
+        if self.concat_after:
+            x_concat = torch.cat((x, x_att), dim=-1)
+            x = residual + self.concat_linear(x_concat)
+        else:
+            x = residual + self.dropout(x_att)
+        if not self.normalize_before:
+            x = self.norm_mha(x)
+        # convolution module
+        if self.conv_module is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_conv(x)
+            x = residual + self.dropout(self.conv_module(x))
+            if not self.normalize_before:
+                x = self.norm_conv(x)
+        # feed forward module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_ff(x)
+        x = residual + self.ff_scale * self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm_ff(x)
+        if self.conv_module is not None:
+            x = self.norm_final(x)
+        if cache is not None:
+            x = torch.cat([cache, x], dim=1)
+        if pos_emb is not None:
+            return (x, pos_emb), mask
+        return x, mask

preprocess/tools/note_transcription/modules/commons/conv.py ADDED Viewed

	@@ -0,0 +1,175 @@

+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .layers import LayerNorm, Embedding
+class LambdaLayer(nn.Module):
+    def __init__(self, lambd):
+        super(LambdaLayer, self).__init__()
+        self.lambd = lambd
+    def forward(self, x):
+        return self.lambd(x)
+def init_weights_func(m):
+    classname = m.__class__.__name__
+    if classname.find("Conv1d") != -1:
+        torch.nn.init.xavier_uniform_(m.weight)
+def get_norm_builder(norm_type, channels, ln_eps=1e-6):
+    if norm_type == 'bn':
+        norm_builder = lambda: nn.BatchNorm1d(channels)
+    elif norm_type == 'in':
+        norm_builder = lambda: nn.InstanceNorm1d(channels, affine=True)
+    elif norm_type == 'gn':
+        norm_builder = lambda: nn.GroupNorm(8, channels)
+    elif norm_type == 'ln':
+        norm_builder = lambda: LayerNorm(channels, dim=1, eps=ln_eps)
+    else:
+        norm_builder = lambda: nn.Identity()
+    return norm_builder
+def get_act_builder(act_type):
+    if act_type == 'gelu':
+        act_builder = lambda: nn.GELU()
+    elif act_type == 'relu':
+        act_builder = lambda: nn.ReLU(inplace=True)
+    elif act_type == 'leakyrelu':
+        act_builder = lambda: nn.LeakyReLU(negative_slope=0.01, inplace=True)
+    elif act_type == 'swish':
+        act_builder = lambda: nn.SiLU(inplace=True)
+    else:
+        act_builder = lambda: nn.Identity()
+    return act_builder
+class ResidualBlock(nn.Module):
+    """Implements conv->PReLU->norm n-times"""
+    def __init__(self, channels, kernel_size, dilation, n=2, norm_type='bn', dropout=0.0,
+                 c_multiple=2, ln_eps=1e-12, act_type='gelu'):
+        super(ResidualBlock, self).__init__()
+        norm_builder = get_norm_builder(norm_type, channels, ln_eps)
+        act_builder = get_act_builder(act_type)
+        self.blocks = [
+            nn.Sequential(
+                norm_builder(),
+                nn.Conv1d(channels, c_multiple * channels, kernel_size, dilation=dilation,
+                          padding=(dilation * (kernel_size - 1)) // 2),
+                LambdaLayer(lambda x: x * kernel_size ** -0.5),
+                act_builder(),
+                nn.Conv1d(c_multiple * channels, channels, 1, dilation=dilation),
+            )
+            for i in range(n)
+        ]
+        self.blocks = nn.ModuleList(self.blocks)
+        self.dropout = dropout
+    def forward(self, x):
+        nonpadding = (x.abs().sum(1) > 0).float()[:, None, :]
+        for b in self.blocks:
+            x_ = b(x)
+            if self.dropout > 0 and self.training:
+                x_ = F.dropout(x_, self.dropout, training=self.training)
+            x = x + x_
+            x = x * nonpadding
+        return x
+class ConvBlocks(nn.Module):
+    """Decodes the expanded phoneme encoding into spectrograms"""
+    def __init__(self, hidden_size, out_dims, dilations, kernel_size,
+                 norm_type='ln', layers_in_block=2, c_multiple=2,
+                 dropout=0.0, ln_eps=1e-5,
+                 init_weights=True, is_BTC=True, num_layers=None, post_net_kernel=3, act_type='gelu'):
+        super(ConvBlocks, self).__init__()
+        self.is_BTC = is_BTC
+        if num_layers is not None:
+            dilations = [1] * num_layers
+        self.res_blocks = nn.Sequential(
+            *[ResidualBlock(hidden_size, kernel_size, d,
+                            n=layers_in_block, norm_type=norm_type, c_multiple=c_multiple,
+                            dropout=dropout, ln_eps=ln_eps, act_type=act_type)
+              for d in dilations],
+        )
+        norm = get_norm_builder(norm_type, hidden_size, ln_eps)()
+        self.last_norm = norm
+        self.post_net1 = nn.Conv1d(hidden_size, out_dims, kernel_size=post_net_kernel,
+                                   padding=post_net_kernel // 2)
+        if init_weights:
+            self.apply(init_weights_func)
+    def forward(self, x, nonpadding=None):
+        """
+        :param x: [B, T, H]
+        :return:  [B, T, H]
+        """
+        if self.is_BTC:
+            x = x.transpose(1, 2)
+        if nonpadding is None:
+            nonpadding = (x.abs().sum(1) > 0).float()[:, None, :]
+        elif self.is_BTC:
+            nonpadding = nonpadding.transpose(1, 2)
+        x = self.res_blocks(x) * nonpadding
+        x = self.last_norm(x) * nonpadding
+        x = self.post_net1(x) * nonpadding
+        if self.is_BTC:
+            x = x.transpose(1, 2)
+        return x
+class TextConvEncoder(ConvBlocks):
+    def __init__(self, dict_size, hidden_size, out_dims, dilations, kernel_size,
+                 norm_type='ln', layers_in_block=2, c_multiple=2,
+                 dropout=0.0, ln_eps=1e-5, init_weights=True, num_layers=None, post_net_kernel=3):
+        super().__init__(hidden_size, out_dims, dilations, kernel_size,
+                         norm_type, layers_in_block, c_multiple,
+                         dropout, ln_eps, init_weights, num_layers=num_layers,
+                         post_net_kernel=post_net_kernel)
+        self.embed_tokens = Embedding(dict_size, hidden_size, 0)
+        self.embed_scale = math.sqrt(hidden_size)
+    def forward(self, txt_tokens):
+        """
+        :param txt_tokens: [B, T]
+        :return: {
+            'encoder_out': [B x T x C]
+        }
+        """
+        x = self.embed_scale * self.embed_tokens(txt_tokens)
+        return super().forward(x)
+class ConditionalConvBlocks(ConvBlocks):
+    def __init__(self, hidden_size, c_cond, c_out, dilations, kernel_size,
+                 norm_type='ln', layers_in_block=2, c_multiple=2,
+                 dropout=0.0, ln_eps=1e-5, init_weights=True, is_BTC=True, num_layers=None):
+        super().__init__(hidden_size, c_out, dilations, kernel_size,
+                         norm_type, layers_in_block, c_multiple,
+                         dropout, ln_eps, init_weights, is_BTC=False, num_layers=num_layers)
+        self.g_prenet = nn.Conv1d(c_cond, hidden_size, 3, padding=1)
+        self.is_BTC_ = is_BTC
+        if init_weights:
+            self.g_prenet.apply(init_weights_func)
+    def forward(self, x, cond, nonpadding=None):
+        if self.is_BTC_:
+            x = x.transpose(1, 2)
+            cond = cond.transpose(1, 2)
+            if nonpadding is not None:
+                nonpadding = nonpadding.transpose(1, 2)
+        if nonpadding is None:
+            nonpadding = x.abs().sum(1)[:, None]
+        x = x + self.g_prenet(cond)
+        x = x * nonpadding
+        x = super(ConditionalConvBlocks, self).forward(x)  # input needs to be BTC
+        if self.is_BTC_:
+            x = x.transpose(1, 2)
+        return x

preprocess/tools/note_transcription/modules/commons/layers.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import torch
+from torch import nn
+from torch.autograd import Function
+class LayerNorm(torch.nn.LayerNorm):
+    """Layer normalization module.
+    :param int nout: output dim size
+    :param int dim: dimension to be normalized
+    """
+    def __init__(self, nout, dim=-1, eps=1e-5):
+        """Construct an LayerNorm object."""
+        super(LayerNorm, self).__init__(nout, eps=eps)
+        self.dim = dim
+    def forward(self, x):
+        """Apply layer normalization.
+        :param torch.Tensor x: input tensor
+        :return: layer normalized tensor
+        :rtype torch.Tensor
+        """
+        if self.dim == -1:
+            return super(LayerNorm, self).forward(x)
+        return super(LayerNorm, self).forward(x.transpose(1, -1)).transpose(1, -1)
+class Reshape(nn.Module):
+    def __init__(self, *args):
+        super(Reshape, self).__init__()
+        self.shape = args
+    def forward(self, x):
+        return x.view(self.shape)
+class Permute(nn.Module):
+    def __init__(self, *args):
+        super(Permute, self).__init__()
+        self.args = args
+    def forward(self, x):
+        return x.permute(self.args)
+def Linear(in_features, out_features, bias=True, init_type='xavier'):
+    m = nn.Linear(in_features, out_features, bias)
+    if init_type == 'xavier':
+        nn.init.xavier_uniform_(m.weight)
+    elif init_type == 'kaiming':
+        nn.init.kaiming_normal_(m.weight, mode='fan_in')
+    if bias:
+        nn.init.constant_(m.bias, 0.)
+    return m
+def Embedding(num_embeddings, embedding_dim, padding_idx=None, init_type='normal'):
+    m = nn.Embedding(num_embeddings, embedding_dim, padding_idx=padding_idx)
+    if init_type == 'normal':
+        nn.init.normal_(m.weight, mean=0, std=embedding_dim ** -0.5)
+    elif init_type == 'kaiming':
+        nn.init.kaiming_normal_(m.weight, mode='fan_in')
+    if padding_idx is not None:
+        nn.init.constant_(m.weight[padding_idx], 0)
+    return m
+class GradientReverseFunction(Function):
+    @staticmethod
+    def forward(ctx, input, coeff=1.):
+        ctx.coeff = coeff
+        output = input * 1.0
+        return output
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.neg() * ctx.coeff, None
+class GRL(nn.Module):
+    def __init__(self):
+        super(GRL, self).__init__()
+    def forward(self, *input):
+        return GradientReverseFunction.apply(*input)

preprocess/tools/note_transcription/modules/commons/rel_transformer.py ADDED Viewed

	@@ -0,0 +1,378 @@

+import math
+import torch
+from torch import nn
+from torch.nn import functional as F
+from .layers import Embedding
+def convert_pad_shape(pad_shape):
+    l = pad_shape[::-1]
+    pad_shape = [item for sublist in l for item in sublist]
+    return pad_shape
+def shift_1d(x):
+    x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
+    return x
+def sequence_mask(length, max_length=None):
+    if max_length is None:
+        max_length = length.max()
+    x = torch.arange(max_length, dtype=length.dtype, device=length.device)
+    return x.unsqueeze(0) < length.unsqueeze(1)
+class Encoder(nn.Module):
+    def __init__(self, hidden_channels, filter_channels, n_heads, n_layers, kernel_size=1, p_dropout=0.,
+                 window_size=None, block_length=None, pre_ln=False, **kwargs):
+        super().__init__()
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.block_length = block_length
+        self.pre_ln = pre_ln
+        self.drop = nn.Dropout(p_dropout)
+        self.attn_layers = nn.ModuleList()
+        self.norm_layers_1 = nn.ModuleList()
+        self.ffn_layers = nn.ModuleList()
+        self.norm_layers_2 = nn.ModuleList()
+        for i in range(self.n_layers):
+            self.attn_layers.append(
+                MultiHeadAttention(hidden_channels, hidden_channels, n_heads, window_size=window_size,
+                                   p_dropout=p_dropout, block_length=block_length))
+            self.norm_layers_1.append(LayerNorm(hidden_channels))
+            self.ffn_layers.append(
+                FFN(hidden_channels, hidden_channels, filter_channels, kernel_size, p_dropout=p_dropout))
+            self.norm_layers_2.append(LayerNorm(hidden_channels))
+        if pre_ln:
+            self.last_ln = LayerNorm(hidden_channels)
+    def forward(self, x, x_mask):
+        attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
+        for i in range(self.n_layers):
+            x = x * x_mask
+            x_ = x
+            if self.pre_ln:
+                x = self.norm_layers_1[i](x)
+            y = self.attn_layers[i](x, x, attn_mask)
+            y = self.drop(y)
+            x = x_ + y
+            if not self.pre_ln:
+                x = self.norm_layers_1[i](x)
+            x_ = x
+            if self.pre_ln:
+                x = self.norm_layers_2[i](x)
+            y = self.ffn_layers[i](x, x_mask)
+            y = self.drop(y)
+            x = x_ + y
+            if not self.pre_ln:
+                x = self.norm_layers_2[i](x)
+        if self.pre_ln:
+            x = self.last_ln(x)
+        x = x * x_mask
+        return x
+class MultiHeadAttention(nn.Module):
+    def __init__(self, channels, out_channels, n_heads, window_size=None, heads_share=True, p_dropout=0.,
+                 block_length=None, proximal_bias=False, proximal_init=False):
+        super().__init__()
+        assert channels % n_heads == 0
+        self.channels = channels
+        self.out_channels = out_channels
+        self.n_heads = n_heads
+        self.window_size = window_size
+        self.heads_share = heads_share
+        self.block_length = block_length
+        self.proximal_bias = proximal_bias
+        self.p_dropout = p_dropout
+        self.attn = None
+        self.k_channels = channels // n_heads
+        self.conv_q = nn.Conv1d(channels, channels, 1)
+        self.conv_k = nn.Conv1d(channels, channels, 1)
+        self.conv_v = nn.Conv1d(channels, channels, 1)
+        if window_size is not None:
+            n_heads_rel = 1 if heads_share else n_heads
+            rel_stddev = self.k_channels ** -0.5
+            self.emb_rel_k = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
+            self.emb_rel_v = nn.Parameter(torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels) * rel_stddev)
+        self.conv_o = nn.Conv1d(channels, out_channels, 1)
+        self.drop = nn.Dropout(p_dropout)
+        nn.init.xavier_uniform_(self.conv_q.weight)
+        nn.init.xavier_uniform_(self.conv_k.weight)
+        if proximal_init:
+            self.conv_k.weight.data.copy_(self.conv_q.weight.data)
+            self.conv_k.bias.data.copy_(self.conv_q.bias.data)
+        nn.init.xavier_uniform_(self.conv_v.weight)
+    def forward(self, x, c, attn_mask=None):
+        q = self.conv_q(x)
+        k = self.conv_k(c)
+        v = self.conv_v(c)
+        x, self.attn = self.attention(q, k, v, mask=attn_mask)
+        x = self.conv_o(x)
+        return x
+    def attention(self, query, key, value, mask=None):
+        # reshape [b, d, t] -> [b, n_h, t, d_k]
+        b, d, t_s, t_t = (*key.size(), query.size(2))
+        query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
+        key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
+        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.k_channels)
+        if self.window_size is not None:
+            assert t_s == t_t, "Relative attention is only available for self-attention."
+            key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
+            rel_logits = self._matmul_with_relative_keys(query, key_relative_embeddings)
+            rel_logits = self._relative_position_to_absolute_position(rel_logits)
+            scores_local = rel_logits / math.sqrt(self.k_channels)
+            scores = scores + scores_local
+        if self.proximal_bias:
+            assert t_s == t_t, "Proximal bias is only available for self-attention."
+            scores = scores + self._attention_bias_proximal(t_s).to(device=scores.device, dtype=scores.dtype)
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -1e4)
+            if self.block_length is not None:
+                block_mask = torch.ones_like(scores).triu(-self.block_length).tril(self.block_length)
+                scores = scores * block_mask + -1e4 * (1 - block_mask)
+        p_attn = F.softmax(scores, dim=-1)  # [b, n_h, t_t, t_s]
+        p_attn = self.drop(p_attn)
+        output = torch.matmul(p_attn, value)
+        if self.window_size is not None:
+            relative_weights = self._absolute_position_to_relative_position(p_attn)
+            value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, t_s)
+            output = output + self._matmul_with_relative_values(relative_weights, value_relative_embeddings)
+        output = output.transpose(2, 3).contiguous().view(b, d, t_t)  # [b, n_h, t_t, d_k] -> [b, d, t_t]
+        return output, p_attn
+    def _matmul_with_relative_values(self, x, y):
+        """
+        x: [b, h, l, m]
+        y: [h or 1, m, d]
+        ret: [b, h, l, d]
+        """
+        ret = torch.matmul(x, y.unsqueeze(0))
+        return ret
+    def _matmul_with_relative_keys(self, x, y):
+        """
+        x: [b, h, l, d]
+        y: [h or 1, m, d]
+        ret: [b, h, l, m]
+        """
+        ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
+        return ret
+    def _get_relative_embeddings(self, relative_embeddings, length):
+        max_relative_position = 2 * self.window_size + 1
+        # Pad first before slice to avoid using cond ops.
+        pad_length = max(length - (self.window_size + 1), 0)
+        slice_start_position = max((self.window_size + 1) - length, 0)
+        slice_end_position = slice_start_position + 2 * length - 1
+        if pad_length > 0:
+            padded_relative_embeddings = F.pad(
+                relative_embeddings,
+                convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]))
+        else:
+            padded_relative_embeddings = relative_embeddings
+        used_relative_embeddings = padded_relative_embeddings[:, slice_start_position:slice_end_position]
+        return used_relative_embeddings
+    def _relative_position_to_absolute_position(self, x):
+        """
+        x: [b, h, l, 2*l-1]
+        ret: [b, h, l, l]
+        """
+        batch, heads, length, _ = x.size()
+        # Concat columns of pad to shift from relative to absolute indexing.
+        x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
+        # Concat extra elements so to add up to shape (len+1, 2*len-1).
+        x_flat = x.view([batch, heads, length * 2 * length])
+        x_flat = F.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [0, length - 1]]))
+        # Reshape and slice out the padded elements.
+        x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[:, :, :length, length - 1:]
+        return x_final
+    def _absolute_position_to_relative_position(self, x):
+        """
+        x: [b, h, l, l]
+        ret: [b, h, l, 2*l-1]
+        """
+        batch, heads, length, _ = x.size()
+        # padd along column
+        x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]]))
+        x_flat = x.view([batch, heads, length ** 2 + length * (length - 1)])
+        # add 0's in the beginning that will skew the elements after reshape
+        x_flat = F.pad(x_flat, convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
+        x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
+        return x_final
+    def _attention_bias_proximal(self, length):
+        """Bias for self-attention to encourage attention to close positions.
+        Args:
+          length: an integer scalar.
+        Returns:
+          a Tensor with shape [1, 1, length, length]
+        """
+        r = torch.arange(length, dtype=torch.float32)
+        diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
+        return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
+class FFN(nn.Module):
+    def __init__(self, in_channels, out_channels, filter_channels, kernel_size, p_dropout=0., activation=None):
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.filter_channels = filter_channels
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.activation = activation
+        self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size, padding=kernel_size // 2)
+        self.conv_2 = nn.Conv1d(filter_channels, out_channels, 1)
+        self.drop = nn.Dropout(p_dropout)
+    def forward(self, x, x_mask):
+        x = self.conv_1(x * x_mask)
+        if self.activation == "gelu":
+            x = x * torch.sigmoid(1.702 * x)
+        else:
+            x = torch.relu(x)
+        x = self.drop(x)
+        x = self.conv_2(x * x_mask)
+        return x * x_mask
+class LayerNorm(nn.Module):
+    def __init__(self, channels, eps=1e-4):
+        super().__init__()
+        self.channels = channels
+        self.eps = eps
+        self.gamma = nn.Parameter(torch.ones(channels))
+        self.beta = nn.Parameter(torch.zeros(channels))
+    def forward(self, x):
+        n_dims = len(x.shape)
+        mean = torch.mean(x, 1, keepdim=True)
+        variance = torch.mean((x - mean) ** 2, 1, keepdim=True)
+        x = (x - mean) * torch.rsqrt(variance + self.eps)
+        shape = [1, -1] + [1] * (n_dims - 2)
+        x = x * self.gamma.view(*shape) + self.beta.view(*shape)
+        return x
+class ConvReluNorm(nn.Module):
+    def __init__(self, in_channels, hidden_channels, out_channels, kernel_size, n_layers, p_dropout):
+        super().__init__()
+        self.in_channels = in_channels
+        self.hidden_channels = hidden_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.n_layers = n_layers
+        self.p_dropout = p_dropout
+        assert n_layers > 1, "Number of layers should be larger than 0."
+        self.conv_layers = nn.ModuleList()
+        self.norm_layers = nn.ModuleList()
+        self.conv_layers.append(nn.Conv1d(in_channels, hidden_channels, kernel_size, padding=kernel_size // 2))
+        self.norm_layers.append(LayerNorm(hidden_channels))
+        self.relu_drop = nn.Sequential(
+            nn.ReLU(),
+            nn.Dropout(p_dropout))
+        for _ in range(n_layers - 1):
+            self.conv_layers.append(nn.Conv1d(hidden_channels, hidden_channels, kernel_size, padding=kernel_size // 2))
+            self.norm_layers.append(LayerNorm(hidden_channels))
+        self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
+        self.proj.weight.data.zero_()
+        self.proj.bias.data.zero_()
+    def forward(self, x, x_mask):
+        x_org = x
+        for i in range(self.n_layers):
+            x = self.conv_layers[i](x * x_mask)
+            x = self.norm_layers[i](x)
+            x = self.relu_drop(x)
+        x = x_org + self.proj(x)
+        return x * x_mask
+class RelTransformerEncoder(nn.Module):
+    def __init__(self,
+                 n_vocab,
+                 out_channels,
+                 hidden_channels,
+                 filter_channels,
+                 n_heads,
+                 n_layers,
+                 kernel_size,
+                 p_dropout=0.0,
+                 window_size=4,
+                 block_length=None,
+                 prenet=True,
+                 pre_ln=True,
+                 ):
+        super().__init__()
+        self.n_vocab = n_vocab
+        self.out_channels = out_channels
+        self.hidden_channels = hidden_channels
+        self.filter_channels = filter_channels
+        self.n_heads = n_heads
+        self.n_layers = n_layers
+        self.kernel_size = kernel_size
+        self.p_dropout = p_dropout
+        self.window_size = window_size
+        self.block_length = block_length
+        self.prenet = prenet
+        if n_vocab > 0:
+            self.emb = Embedding(n_vocab, hidden_channels, padding_idx=0)
+        if prenet:
+            self.pre = ConvReluNorm(hidden_channels, hidden_channels, hidden_channels,
+                                    kernel_size=5, n_layers=3, p_dropout=0)
+        self.encoder = Encoder(
+            hidden_channels,
+            filter_channels,
+            n_heads,
+            n_layers,
+            kernel_size,
+            p_dropout,
+            window_size=window_size,
+            block_length=block_length,
+            pre_ln=pre_ln,
+        )
+    def forward(self, x, x_mask=None):
+        if self.n_vocab > 0:
+            x_lengths = (x > 0).long().sum(-1)
+            x = self.emb(x) * math.sqrt(self.hidden_channels)  # [b, t, h]
+        else:
+            x_lengths = (x.abs().sum(-1) > 0).long().sum(-1)
+        x = torch.transpose(x, 1, -1)  # [b, h, t]
+        x_mask = torch.unsqueeze(sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
+        if self.prenet:
+            x = self.pre(x, x_mask)
+        x = self.encoder(x, x_mask)
+        return x.transpose(1, 2)

preprocess/tools/note_transcription/modules/commons/rnn.py ADDED Viewed

	@@ -0,0 +1,261 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+class PreNet(nn.Module):
+    def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
+        super().__init__()
+        self.fc1 = nn.Linear(in_dims, fc1_dims)
+        self.fc2 = nn.Linear(fc1_dims, fc2_dims)
+        self.p = dropout
+    def forward(self, x):
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = F.dropout(x, self.p, training=self.training)
+        x = self.fc2(x)
+        x = F.relu(x)
+        x = F.dropout(x, self.p, training=self.training)
+        return x
+class HighwayNetwork(nn.Module):
+    def __init__(self, size):
+        super().__init__()
+        self.W1 = nn.Linear(size, size)
+        self.W2 = nn.Linear(size, size)
+        self.W1.bias.data.fill_(0.)
+    def forward(self, x):
+        x1 = self.W1(x)
+        x2 = self.W2(x)
+        g = torch.sigmoid(x2)
+        y = g * F.relu(x1) + (1. - g) * x
+        return y
+class BatchNormConv(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel, relu=True):
+        super().__init__()
+        self.conv = nn.Conv1d(in_channels, out_channels, kernel, stride=1, padding=kernel // 2, bias=False)
+        self.bnorm = nn.BatchNorm1d(out_channels)
+        self.relu = relu
+    def forward(self, x):
+        x = self.conv(x)
+        x = F.relu(x) if self.relu is True else x
+        return self.bnorm(x)
+class ConvNorm(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
+                 padding=None, dilation=1, bias=True, w_init_gain='linear'):
+        super(ConvNorm, self).__init__()
+        if padding is None:
+            assert (kernel_size % 2 == 1)
+            padding = int(dilation * (kernel_size - 1) / 2)
+        self.conv = torch.nn.Conv1d(in_channels, out_channels,
+                                    kernel_size=kernel_size, stride=stride,
+                                    padding=padding, dilation=dilation,
+                                    bias=bias)
+        torch.nn.init.xavier_uniform_(
+            self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
+    def forward(self, signal):
+        conv_signal = self.conv(signal)
+        return conv_signal
+class CBHG(nn.Module):
+    def __init__(self, K, in_channels, channels, proj_channels, num_highways):
+        super().__init__()
+        # List of all rnns to call `flatten_parameters()` on
+        self._to_flatten = []
+        self.bank_kernels = [i for i in range(1, K + 1)]
+        self.conv1d_bank = nn.ModuleList()
+        for k in self.bank_kernels:
+            conv = BatchNormConv(in_channels, channels, k)
+            self.conv1d_bank.append(conv)
+        self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
+        self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
+        self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
+        # Fix the highway input if necessary
+        if proj_channels[-1] != channels:
+            self.highway_mismatch = True
+            self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
+        else:
+            self.highway_mismatch = False
+        self.highways = nn.ModuleList()
+        for i in range(num_highways):
+            hn = HighwayNetwork(channels)
+            self.highways.append(hn)
+        self.rnn = nn.GRU(channels, channels, batch_first=True, bidirectional=True)
+        self._to_flatten.append(self.rnn)
+        # Avoid fragmentation of RNN parameters and associated warning
+        self._flatten_parameters()
+    def forward(self, x):
+        # Although we `_flatten_parameters()` on init, when using DataParallel
+        # the model gets replicated, making it no longer guaranteed that the
+        # weights are contiguous in GPU memory. Hence, we must call it again
+        self._flatten_parameters()
+        # Save these for later
+        residual = x
+        seq_len = x.size(-1)
+        conv_bank = []
+        # Convolution Bank
+        for conv in self.conv1d_bank:
+            c = conv(x)  # Convolution
+            conv_bank.append(c[:, :, :seq_len])
+        # Stack along the channel axis
+        conv_bank = torch.cat(conv_bank, dim=1)
+        # dump the last padding to fit residual
+        x = self.maxpool(conv_bank)[:, :, :seq_len]
+        # Conv1d projections
+        x = self.conv_project1(x)
+        x = self.conv_project2(x)
+        # Residual Connect
+        x = x + residual
+        # Through the highways
+        x = x.transpose(1, 2)
+        if self.highway_mismatch is True:
+            x = self.pre_highway(x)
+        for h in self.highways:
+            x = h(x)
+        # And then the RNN
+        x, _ = self.rnn(x)
+        return x
+    def _flatten_parameters(self):
+        """Calls `flatten_parameters` on all the rnns used by the WaveRNN. Used
+        to improve efficiency and avoid PyTorch yelling at us."""
+        [m.flatten_parameters() for m in self._to_flatten]
+class TacotronEncoder(nn.Module):
+    def __init__(self, embed_dims, num_chars, cbhg_channels, K, num_highways, dropout):
+        super().__init__()
+        self.embedding = nn.Embedding(num_chars, embed_dims)
+        self.pre_net = PreNet(embed_dims, embed_dims, embed_dims, dropout=dropout)
+        self.cbhg = CBHG(K=K, in_channels=cbhg_channels, channels=cbhg_channels,
+                         proj_channels=[cbhg_channels, cbhg_channels],
+                         num_highways=num_highways)
+        self.proj_out = nn.Linear(cbhg_channels * 2, cbhg_channels)
+    def forward(self, x):
+        x = self.embedding(x)
+        x = self.pre_net(x)
+        x.transpose_(1, 2)
+        x = self.cbhg(x)
+        x = self.proj_out(x)
+        return x
+class RNNEncoder(nn.Module):
+    def __init__(self, num_chars, embedding_dim, n_convolutions=3, kernel_size=5):
+        super(RNNEncoder, self).__init__()
+        self.embedding = nn.Embedding(num_chars, embedding_dim, padding_idx=0)
+        convolutions = []
+        for _ in range(n_convolutions):
+            conv_layer = nn.Sequential(
+                ConvNorm(embedding_dim,
+                         embedding_dim,
+                         kernel_size=kernel_size, stride=1,
+                         padding=int((kernel_size - 1) / 2),
+                         dilation=1, w_init_gain='relu'),
+                nn.BatchNorm1d(embedding_dim))
+            convolutions.append(conv_layer)
+        self.convolutions = nn.ModuleList(convolutions)
+        self.lstm = nn.LSTM(embedding_dim, int(embedding_dim / 2), 1,
+                            batch_first=True, bidirectional=True)
+    def forward(self, x):
+        input_lengths = (x > 0).sum(-1)
+        input_lengths = input_lengths.cpu().numpy()
+        x = self.embedding(x)
+        x = x.transpose(1, 2)  # [B, H, T]
+        for conv in self.convolutions:
+            x = F.dropout(F.relu(conv(x)), 0.5, self.training) + x
+        x = x.transpose(1, 2)  # [B, T, H]
+        # pytorch tensor are not reversible, hence the conversion
+        x = nn.utils.rnn.pack_padded_sequence(x, input_lengths, batch_first=True, enforce_sorted=False)
+        self.lstm.flatten_parameters()
+        outputs, _ = self.lstm(x)
+        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True)
+        return outputs
+class DecoderRNN(torch.nn.Module):
+    def __init__(self, hidden_size, decoder_rnn_dim, dropout):
+        super(DecoderRNN, self).__init__()
+        self.in_conv1d = nn.Sequential(
+            torch.nn.Conv1d(
+                in_channels=hidden_size,
+                out_channels=hidden_size,
+                kernel_size=9, padding=4,
+            ),
+            torch.nn.ReLU(),
+            torch.nn.Conv1d(
+                in_channels=hidden_size,
+                out_channels=hidden_size,
+                kernel_size=9, padding=4,
+            ),
+        )
+        self.ln = nn.LayerNorm(hidden_size)
+        if decoder_rnn_dim == 0:
+            decoder_rnn_dim = hidden_size * 2
+        self.rnn = torch.nn.LSTM(
+            input_size=hidden_size,
+            hidden_size=decoder_rnn_dim,
+            num_layers=1,
+            batch_first=True,
+            bidirectional=True,
+            dropout=dropout
+        )
+        self.rnn.flatten_parameters()
+        self.conv1d = torch.nn.Conv1d(
+            in_channels=decoder_rnn_dim * 2,
+            out_channels=hidden_size,
+            kernel_size=3,
+            padding=1,
+        )
+    def forward(self, x):
+        input_masks = x.abs().sum(-1).ne(0).data[:, :, None]
+        input_lengths = input_masks.sum([-1, -2])
+        input_lengths = input_lengths.cpu().numpy()
+        x = self.in_conv1d(x.transpose(1, 2)).transpose(1, 2)
+        x = self.ln(x)
+        x = nn.utils.rnn.pack_padded_sequence(x, input_lengths, batch_first=True, enforce_sorted=False)
+        self.rnn.flatten_parameters()
+        x, _ = self.rnn(x)  # [B, T, C]
+        x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
+        x = x * input_masks
+        pre_mel = self.conv1d(x.transpose(1, 2)).transpose(1, 2)  # [B, T, C]
+        pre_mel = pre_mel * input_masks
+        return pre_mel

preprocess/tools/note_transcription/modules/commons/transformer.py ADDED Viewed

	@@ -0,0 +1,751 @@

+import math
+import torch
+from torch import nn
+from torch.nn import Parameter, Linear
+from .layers import LayerNorm, Embedding
+from ...utils.nn.seq_utils import (
+    get_incremental_state,
+    set_incremental_state,
+    softmax,
+    make_positions,
+)
+import torch.nn.functional as F
+DEFAULT_MAX_SOURCE_POSITIONS = 2000
+DEFAULT_MAX_TARGET_POSITIONS = 2000
+class SinusoidalPositionalEmbedding(nn.Module):
+    """This module produces sinusoidal positional embeddings of any length.
+    Padding symbols are ignored.
+    """
+    def __init__(self, embedding_dim, padding_idx, init_size=1024):
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.padding_idx = padding_idx
+        self.weights = SinusoidalPositionalEmbedding.get_embedding(
+            init_size,
+            embedding_dim,
+            padding_idx,
+        )
+        self.register_buffer('_float_tensor', torch.FloatTensor(1))
+    @staticmethod
+    def get_embedding(num_embeddings, embedding_dim, padding_idx=None):
+        """Build sinusoidal embeddings.
+        This matches the implementation in tensor2tensor, but differs slightly
+        from the description in Section 3.5 of "Attention Is All You Need".
+        """
+        half_dim = embedding_dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, dtype=torch.float) * -emb)
+        emb = torch.arange(num_embeddings, dtype=torch.float).unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1).view(num_embeddings, -1)
+        if embedding_dim % 2 == 1:
+            # zero pad
+            emb = torch.cat([emb, torch.zeros(num_embeddings, 1)], dim=1)
+        if padding_idx is not None:
+            emb[padding_idx, :] = 0
+        return emb
+    def forward(self, input, incremental_state=None, timestep=None, positions=None, **kwargs):
+        """Input is expected to be of size [bsz x seqlen]."""
+        bsz, seq_len = input.shape[:2]
+        max_pos = self.padding_idx + 1 + seq_len
+        if self.weights is None or max_pos > self.weights.size(0):
+            # recompute/expand embeddings if needed
+            self.weights = SinusoidalPositionalEmbedding.get_embedding(
+                max_pos,
+                self.embedding_dim,
+                self.padding_idx,
+            )
+        self.weights = self.weights.to(self._float_tensor)
+        if incremental_state is not None:
+            # positions is the same for every token when decoding a single step
+            pos = timestep.view(-1)[0] + 1 if timestep is not None else seq_len
+            return self.weights[self.padding_idx + pos, :].expand(bsz, 1, -1)
+        positions = make_positions(input, self.padding_idx) if positions is None else positions
+        return self.weights.index_select(0, positions.view(-1)).view(bsz, seq_len, -1).detach()
+    def max_positions(self):
+        """Maximum number of supported positions."""
+        return int(1e5)  # an arbitrary large number
+class TransformerFFNLayer(nn.Module):
+    def __init__(self, hidden_size, filter_size, padding="SAME", kernel_size=1, dropout=0., act='gelu'):
+        super().__init__()
+        self.kernel_size = kernel_size
+        self.dropout = dropout
+        self.act = act
+        if padding == 'SAME':
+            self.ffn_1 = nn.Conv1d(hidden_size, filter_size, kernel_size, padding=kernel_size // 2)
+        elif padding == 'LEFT':
+            self.ffn_1 = nn.Sequential(
+                nn.ConstantPad1d((kernel_size - 1, 0), 0.0),
+                nn.Conv1d(hidden_size, filter_size, kernel_size)
+            )
+        self.ffn_2 = Linear(filter_size, hidden_size)
+    def forward(self, x, incremental_state=None):
+        # x: T x B x C
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_input' in saved_state:
+                prev_input = saved_state['prev_input']
+                x = torch.cat((prev_input, x), dim=0)
+            x = x[-self.kernel_size:]
+            saved_state['prev_input'] = x
+            self._set_input_buffer(incremental_state, saved_state)
+        x = self.ffn_1(x.permute(1, 2, 0)).permute(2, 0, 1)
+        x = x * self.kernel_size ** -0.5
+        if incremental_state is not None:
+            x = x[-1:]
+        if self.act == 'gelu':
+            x = F.gelu(x)
+        if self.act == 'relu':
+            x = F.relu(x)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = self.ffn_2(x)
+        return x
+    def _get_input_buffer(self, incremental_state):
+        return get_incremental_state(
+            self,
+            incremental_state,
+            'f',
+        ) or {}
+    def _set_input_buffer(self, incremental_state, buffer):
+        set_incremental_state(
+            self,
+            incremental_state,
+            'f',
+            buffer,
+        )
+    def clear_buffer(self, incremental_state):
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_input' in saved_state:
+                del saved_state['prev_input']
+            self._set_input_buffer(incremental_state, saved_state)
+class MultiheadAttention(nn.Module):
+    def __init__(self, embed_dim, num_heads, kdim=None, vdim=None, dropout=0., bias=True,
+                 add_bias_kv=False, add_zero_attn=False, self_attention=False,
+                 encoder_decoder_attention=False):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.kdim = kdim if kdim is not None else embed_dim
+        self.vdim = vdim if vdim is not None else embed_dim
+        self.qkv_same_dim = self.kdim == embed_dim and self.vdim == embed_dim
+        self.num_heads = num_heads
+        self.dropout = dropout
+        self.head_dim = embed_dim // num_heads
+        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
+        self.scaling = self.head_dim ** -0.5
+        self.self_attention = self_attention
+        self.encoder_decoder_attention = encoder_decoder_attention
+        assert not self.self_attention or self.qkv_same_dim, 'Self-attention requires query, key and ' \
+                                                             'value to be of the same size'
+        if self.qkv_same_dim:
+            self.in_proj_weight = Parameter(torch.Tensor(3 * embed_dim, embed_dim))
+        else:
+            self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))
+            self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))
+            self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
+        if bias:
+            self.in_proj_bias = Parameter(torch.Tensor(3 * embed_dim))
+        else:
+            self.register_parameter('in_proj_bias', None)
+        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
+        if add_bias_kv:
+            self.bias_k = Parameter(torch.Tensor(1, 1, embed_dim))
+            self.bias_v = Parameter(torch.Tensor(1, 1, embed_dim))
+        else:
+            self.bias_k = self.bias_v = None
+        self.add_zero_attn = add_zero_attn
+        self.reset_parameters()
+        self.enable_torch_version = False
+        if hasattr(F, "multi_head_attention_forward"):
+            self.enable_torch_version = True
+        else:
+            self.enable_torch_version = False
+        self.last_attn_probs = None
+    def reset_parameters(self):
+        if self.qkv_same_dim:
+            nn.init.xavier_uniform_(self.in_proj_weight)
+        else:
+            nn.init.xavier_uniform_(self.k_proj_weight)
+            nn.init.xavier_uniform_(self.v_proj_weight)
+            nn.init.xavier_uniform_(self.q_proj_weight)
+        nn.init.xavier_uniform_(self.out_proj.weight)
+        if self.in_proj_bias is not None:
+            nn.init.constant_(self.in_proj_bias, 0.)
+            nn.init.constant_(self.out_proj.bias, 0.)
+        if self.bias_k is not None:
+            nn.init.xavier_normal_(self.bias_k)
+        if self.bias_v is not None:
+            nn.init.xavier_normal_(self.bias_v)
+    def forward(
+            self,
+            query, key, value,
+            key_padding_mask=None,
+            incremental_state=None,
+            need_weights=True,
+            static_kv=False,
+            attn_mask=None,
+            before_softmax=False,
+            need_head_weights=False,
+            enc_dec_attn_constraint_mask=None,
+            reset_attn_weight=None
+    ):
+        """Input shape: Time x Batch x Channel
+        Args:
+            key_padding_mask (ByteTensor, optional): mask to exclude
+                keys that are pads, of shape `(batch, src_len)`, where
+                padding elements are indicated by 1s.
+            need_weights (bool, optional): return the attention weights,
+                averaged over heads (default: False).
+            attn_mask (ByteTensor, optional): typically used to
+                implement causal attention, where the mask prevents the
+                attention from looking forward in time (default: None).
+            before_softmax (bool, optional): return the raw attention
+                weights and values before the attention softmax.
+            need_head_weights (bool, optional): return the attention
+                weights for each head. Implies *need_weights*. Default:
+                return the average attention weights over all heads.
+        """
+        if need_head_weights:
+            need_weights = True
+        tgt_len, bsz, embed_dim = query.size()
+        assert embed_dim == self.embed_dim
+        assert list(query.size()) == [tgt_len, bsz, embed_dim]
+        if self.enable_torch_version and incremental_state is None and not static_kv and reset_attn_weight is None:
+            if self.qkv_same_dim:
+                return F.multi_head_attention_forward(query, key, value,
+                                                      self.embed_dim, self.num_heads,
+                                                      self.in_proj_weight,
+                                                      self.in_proj_bias, self.bias_k, self.bias_v,
+                                                      self.add_zero_attn, self.dropout,
+                                                      self.out_proj.weight, self.out_proj.bias,
+                                                      self.training, key_padding_mask, need_weights,
+                                                      attn_mask)
+            else:
+                return F.multi_head_attention_forward(query, key, value,
+                                                      self.embed_dim, self.num_heads,
+                                                      torch.empty([0]),
+                                                      self.in_proj_bias, self.bias_k, self.bias_v,
+                                                      self.add_zero_attn, self.dropout,
+                                                      self.out_proj.weight, self.out_proj.bias,
+                                                      self.training, key_padding_mask, need_weights,
+                                                      attn_mask, use_separate_proj_weight=True,
+                                                      q_proj_weight=self.q_proj_weight,
+                                                      k_proj_weight=self.k_proj_weight,
+                                                      v_proj_weight=self.v_proj_weight)
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_key' in saved_state:
+                # previous time steps are cached - no need to recompute
+                # key and value if they are static
+                if static_kv:
+                    assert self.encoder_decoder_attention and not self.self_attention
+                    key = value = None
+        else:
+            saved_state = None
+        if self.self_attention:
+            # self-attention
+            q, k, v = self.in_proj_qkv(query)
+        elif self.encoder_decoder_attention:
+            # encoder-decoder attention
+            q = self.in_proj_q(query)
+            if key is None:
+                assert value is None
+                k = v = None
+            else:
+                k = self.in_proj_k(key)
+                v = self.in_proj_v(key)
+        else:
+            q = self.in_proj_q(query)
+            k = self.in_proj_k(key)
+            v = self.in_proj_v(value)
+        q *= self.scaling
+        if self.bias_k is not None:
+            assert self.bias_v is not None
+            k = torch.cat([k, self.bias_k.repeat(1, bsz, 1)])
+            v = torch.cat([v, self.bias_v.repeat(1, bsz, 1)])
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, key_padding_mask.new_zeros(key_padding_mask.size(0), 1)], dim=1)
+        q = q.contiguous().view(tgt_len, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if k is not None:
+            k = k.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if v is not None:
+            v = v.contiguous().view(-1, bsz * self.num_heads, self.head_dim).transpose(0, 1)
+        if saved_state is not None:
+            # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)
+            if 'prev_key' in saved_state:
+                prev_key = saved_state['prev_key'].view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    k = prev_key
+                else:
+                    k = torch.cat((prev_key, k), dim=1)
+            if 'prev_value' in saved_state:
+                prev_value = saved_state['prev_value'].view(bsz * self.num_heads, -1, self.head_dim)
+                if static_kv:
+                    v = prev_value
+                else:
+                    v = torch.cat((prev_value, v), dim=1)
+            if 'prev_key_padding_mask' in saved_state and saved_state['prev_key_padding_mask'] is not None:
+                prev_key_padding_mask = saved_state['prev_key_padding_mask']
+                if static_kv:
+                    key_padding_mask = prev_key_padding_mask
+                else:
+                    key_padding_mask = torch.cat((prev_key_padding_mask, key_padding_mask), dim=1)
+            saved_state['prev_key'] = k.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state['prev_value'] = v.view(bsz, self.num_heads, -1, self.head_dim)
+            saved_state['prev_key_padding_mask'] = key_padding_mask
+            self._set_input_buffer(incremental_state, saved_state)
+        src_len = k.size(1)
+        # This is part of a workaround to get around fork/join parallelism
+        # not supporting Optional types.
+        if key_padding_mask is not None and key_padding_mask.shape == torch.Size([]):
+            key_padding_mask = None
+        if key_padding_mask is not None:
+            assert key_padding_mask.size(0) == bsz
+            assert key_padding_mask.size(1) == src_len
+        if self.add_zero_attn:
+            src_len += 1
+            k = torch.cat([k, k.new_zeros((k.size(0), 1) + k.size()[2:])], dim=1)
+            v = torch.cat([v, v.new_zeros((v.size(0), 1) + v.size()[2:])], dim=1)
+            if attn_mask is not None:
+                attn_mask = torch.cat([attn_mask, attn_mask.new_zeros(attn_mask.size(0), 1)], dim=1)
+            if key_padding_mask is not None:
+                key_padding_mask = torch.cat(
+                    [key_padding_mask, torch.zeros(key_padding_mask.size(0), 1).type_as(key_padding_mask)], dim=1)
+        attn_weights = torch.bmm(q, k.transpose(1, 2))
+        attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz)
+        assert list(attn_weights.size()) == [bsz * self.num_heads, tgt_len, src_len]
+        if attn_mask is not None:
+            if len(attn_mask.shape) == 2:
+                attn_mask = attn_mask.unsqueeze(0)
+            elif len(attn_mask.shape) == 3:
+                attn_mask = attn_mask[:, None].repeat([1, self.num_heads, 1, 1]).reshape(
+                    bsz * self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights + attn_mask
+        if enc_dec_attn_constraint_mask is not None:  # bs x head x L_kv
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.masked_fill(
+                enc_dec_attn_constraint_mask.unsqueeze(2).bool(),
+                -1e8,
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        if key_padding_mask is not None:
+            # don't attend to padding symbols
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+            attn_weights = attn_weights.masked_fill(
+                key_padding_mask.unsqueeze(1).unsqueeze(2),
+                -1e8,
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+        attn_logits = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+        if before_softmax:
+            return attn_weights, v
+        attn_weights_float = softmax(attn_weights, dim=-1)
+        attn_weights = attn_weights_float.type_as(attn_weights)
+        attn_probs = F.dropout(attn_weights_float.type_as(attn_weights), p=self.dropout, training=self.training)
+        if reset_attn_weight is not None:
+            if reset_attn_weight:
+                self.last_attn_probs = attn_probs.detach()
+            else:
+                assert self.last_attn_probs is not None
+                attn_probs = self.last_attn_probs
+        attn = torch.bmm(attn_probs, v)
+        assert list(attn.size()) == [bsz * self.num_heads, tgt_len, self.head_dim]
+        attn = attn.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
+        attn = self.out_proj(attn)
+        if need_weights:
+            attn_weights = attn_weights_float.view(bsz, self.num_heads, tgt_len, src_len).transpose(1, 0)
+            if not need_head_weights:
+                # average attention weights over heads
+                attn_weights = attn_weights.mean(dim=0)
+        else:
+            attn_weights = None
+        return attn, (attn_weights, attn_logits)
+    def in_proj_qkv(self, query):
+        return self._in_proj(query).chunk(3, dim=-1)
+    def in_proj_q(self, query):
+        if self.qkv_same_dim:
+            return self._in_proj(query, end=self.embed_dim)
+        else:
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[:self.embed_dim]
+            return F.linear(query, self.q_proj_weight, bias)
+    def in_proj_k(self, key):
+        if self.qkv_same_dim:
+            return self._in_proj(key, start=self.embed_dim, end=2 * self.embed_dim)
+        else:
+            weight = self.k_proj_weight
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[self.embed_dim:2 * self.embed_dim]
+            return F.linear(key, weight, bias)
+    def in_proj_v(self, value):
+        if self.qkv_same_dim:
+            return self._in_proj(value, start=2 * self.embed_dim)
+        else:
+            weight = self.v_proj_weight
+            bias = self.in_proj_bias
+            if bias is not None:
+                bias = bias[2 * self.embed_dim:]
+            return F.linear(value, weight, bias)
+    def _in_proj(self, input, start=0, end=None):
+        weight = self.in_proj_weight
+        bias = self.in_proj_bias
+        weight = weight[start:end, :]
+        if bias is not None:
+            bias = bias[start:end]
+        return F.linear(input, weight, bias)
+    def _get_input_buffer(self, incremental_state):
+        return get_incremental_state(
+            self,
+            incremental_state,
+            'attn_state',
+        ) or {}
+    def _set_input_buffer(self, incremental_state, buffer):
+        set_incremental_state(
+            self,
+            incremental_state,
+            'attn_state',
+            buffer,
+        )
+    def apply_sparse_mask(self, attn_weights, tgt_len, src_len, bsz):
+        return attn_weights
+    def clear_buffer(self, incremental_state=None):
+        if incremental_state is not None:
+            saved_state = self._get_input_buffer(incremental_state)
+            if 'prev_key' in saved_state:
+                del saved_state['prev_key']
+            if 'prev_value' in saved_state:
+                del saved_state['prev_value']
+            self._set_input_buffer(incremental_state, saved_state)
+class EncSALayer(nn.Module):
+    def __init__(self, c, num_heads, dropout, attention_dropout=0.1,
+                 relu_dropout=0.1, kernel_size=9, padding='SAME', act='gelu'):
+        super().__init__()
+        self.c = c
+        self.dropout = dropout
+        self.num_heads = num_heads
+        if num_heads > 0:
+            self.layer_norm1 = LayerNorm(c)
+            self.self_attn = MultiheadAttention(
+                self.c, num_heads, self_attention=True, dropout=attention_dropout, bias=False)
+        self.layer_norm2 = LayerNorm(c)
+        self.ffn = TransformerFFNLayer(
+            c, 4 * c, kernel_size=kernel_size, dropout=relu_dropout, padding=padding, act=act)
+    def forward(self, x, encoder_padding_mask=None, **kwargs):
+        layer_norm_training = kwargs.get('layer_norm_training', None)
+        if layer_norm_training is not None:
+            self.layer_norm1.training = layer_norm_training
+            self.layer_norm2.training = layer_norm_training
+        if self.num_heads > 0:
+            residual = x
+            x = self.layer_norm1(x)
+            x, _, = self.self_attn(
+                query=x,
+                key=x,
+                value=x,
+                key_padding_mask=encoder_padding_mask
+            )
+            x = F.dropout(x, self.dropout, training=self.training)
+            x = residual + x
+            x = x * (1 - encoder_padding_mask.float()).transpose(0, 1)[..., None]
+        residual = x
+        x = self.layer_norm2(x)
+        x = self.ffn(x)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        x = x * (1 - encoder_padding_mask.float()).transpose(0, 1)[..., None]
+        return x
+class DecSALayer(nn.Module):
+    def __init__(self, c, num_heads, dropout, attention_dropout=0.1, relu_dropout=0.1,
+                 kernel_size=9, act='gelu'):
+        super().__init__()
+        self.c = c
+        self.dropout = dropout
+        self.layer_norm1 = LayerNorm(c)
+        self.self_attn = MultiheadAttention(
+            c, num_heads, self_attention=True, dropout=attention_dropout, bias=False
+        )
+        self.layer_norm2 = LayerNorm(c)
+        self.encoder_attn = MultiheadAttention(
+            c, num_heads, encoder_decoder_attention=True, dropout=attention_dropout, bias=False,
+        )
+        self.layer_norm3 = LayerNorm(c)
+        self.ffn = TransformerFFNLayer(
+            c, 4 * c, padding='LEFT', kernel_size=kernel_size, dropout=relu_dropout, act=act)
+    def forward(
+            self,
+            x,
+            encoder_out=None,
+            encoder_padding_mask=None,
+            incremental_state=None,
+            self_attn_mask=None,
+            self_attn_padding_mask=None,
+            attn_out=None,
+            reset_attn_weight=None,
+            **kwargs,
+    ):
+        layer_norm_training = kwargs.get('layer_norm_training', None)
+        if layer_norm_training is not None:
+            self.layer_norm1.training = layer_norm_training
+            self.layer_norm2.training = layer_norm_training
+            self.layer_norm3.training = layer_norm_training
+        residual = x
+        x = self.layer_norm1(x)
+        x, _ = self.self_attn(
+            query=x,
+            key=x,
+            value=x,
+            key_padding_mask=self_attn_padding_mask,
+            incremental_state=incremental_state,
+            attn_mask=self_attn_mask
+        )
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        attn_logits = None
+        if encoder_out is not None or attn_out is not None:
+            residual = x
+            x = self.layer_norm2(x)
+        if encoder_out is not None:
+            x, attn = self.encoder_attn(
+                query=x,
+                key=encoder_out,
+                value=encoder_out,
+                key_padding_mask=encoder_padding_mask,
+                incremental_state=incremental_state,
+                static_kv=True,
+                enc_dec_attn_constraint_mask=get_incremental_state(self, incremental_state,
+                                                                   'enc_dec_attn_constraint_mask'),
+                reset_attn_weight=reset_attn_weight
+            )
+            attn_logits = attn[1]
+        elif attn_out is not None:
+            x = self.encoder_attn.in_proj_v(attn_out)
+        if encoder_out is not None or attn_out is not None:
+            x = F.dropout(x, self.dropout, training=self.training)
+            x = residual + x
+        residual = x
+        x = self.layer_norm3(x)
+        x = self.ffn(x, incremental_state=incremental_state)
+        x = F.dropout(x, self.dropout, training=self.training)
+        x = residual + x
+        return x, attn_logits
+    def clear_buffer(self, input, encoder_out=None, encoder_padding_mask=None, incremental_state=None):
+        self.encoder_attn.clear_buffer(incremental_state)
+        self.ffn.clear_buffer(incremental_state)
+    def set_buffer(self, name, tensor, incremental_state):
+        return set_incremental_state(self, incremental_state, name, tensor)
+class TransformerEncoderLayer(nn.Module):
+    def __init__(self, hidden_size, dropout, kernel_size=9, num_heads=2):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.num_heads = num_heads
+        self.op = EncSALayer(
+            hidden_size, num_heads, dropout=dropout,
+            attention_dropout=0.0, relu_dropout=dropout,
+            kernel_size=kernel_size)
+    def forward(self, x, **kwargs):
+        return self.op(x, **kwargs)
+class TransformerDecoderLayer(nn.Module):
+    def __init__(self, hidden_size, dropout, kernel_size=9, num_heads=2):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.num_heads = num_heads
+        self.op = DecSALayer(
+            hidden_size, num_heads, dropout=dropout,
+            attention_dropout=0.0, relu_dropout=dropout,
+            kernel_size=kernel_size)
+    def forward(self, x, **kwargs):
+        return self.op(x, **kwargs)
+    def clear_buffer(self, *args):
+        return self.op.clear_buffer(*args)
+    def set_buffer(self, *args):
+        return self.op.set_buffer(*args)
+class FFTBlocks(nn.Module):
+    def __init__(self, hidden_size, num_layers, ffn_kernel_size=9, dropout=0.0,
+                 num_heads=2, use_pos_embed=True, use_last_norm=True,
+                 use_pos_embed_alpha=True):
+        super().__init__()
+        self.num_layers = num_layers
+        embed_dim = self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.use_pos_embed = use_pos_embed
+        self.use_last_norm = use_last_norm
+        if use_pos_embed:
+            self.max_source_positions = DEFAULT_MAX_TARGET_POSITIONS
+            self.padding_idx = 0
+            self.pos_embed_alpha = nn.Parameter(torch.Tensor([1])) if use_pos_embed_alpha else 1
+            self.embed_positions = SinusoidalPositionalEmbedding(
+                embed_dim, self.padding_idx, init_size=DEFAULT_MAX_TARGET_POSITIONS,
+            )
+        self.layers = nn.ModuleList([])
+        self.layers.extend([
+            TransformerEncoderLayer(self.hidden_size, self.dropout,
+                                    kernel_size=ffn_kernel_size, num_heads=num_heads)
+            for _ in range(self.num_layers)
+        ])
+        if self.use_last_norm:
+            self.layer_norm = nn.LayerNorm(embed_dim)
+        else:
+            self.layer_norm = None
+    def forward(self, x, padding_mask=None, attn_mask=None, return_hiddens=False):
+        """
+        :param x: [B, T, C]
+        :param padding_mask: [B, T]
+        :return: [B, T, C] or [L, B, T, C]
+        """
+        padding_mask = x.abs().sum(-1).eq(0).data if padding_mask is None else padding_mask
+        nonpadding_mask_TB = 1 - padding_mask.transpose(0, 1).float()[:, :, None]  # [T, B, 1]
+        if self.use_pos_embed:
+            positions = self.pos_embed_alpha * self.embed_positions(x[..., 0])
+            x = x + positions
+            x = F.dropout(x, p=self.dropout, training=self.training)
+        # B x T x C -> T x B x C
+        x = x.transpose(0, 1) * nonpadding_mask_TB
+        hiddens = []
+        for layer in self.layers:
+            x = layer(x, encoder_padding_mask=padding_mask, attn_mask=attn_mask) * nonpadding_mask_TB
+            hiddens.append(x)
+        if self.use_last_norm:
+            x = self.layer_norm(x) * nonpadding_mask_TB
+        if return_hiddens:
+            x = torch.stack(hiddens, 0)  # [L, T, B, C]
+            x = x.transpose(1, 2)  # [L, B, T, C]
+        else:
+            x = x.transpose(0, 1)  # [B, T, C]
+        return x
+class FastSpeechEncoder(FFTBlocks):
+    def __init__(self, dict_size, hidden_size=256, num_layers=4, kernel_size=9, num_heads=2,
+                 dropout=0.0):
+        super().__init__(hidden_size, num_layers, kernel_size, num_heads=num_heads,
+                         use_pos_embed=False, dropout=dropout)  # use_pos_embed_alpha for compatibility
+        self.embed_tokens = Embedding(dict_size, hidden_size, 0)
+        self.embed_scale = math.sqrt(hidden_size)
+        self.padding_idx = 0
+        self.embed_positions = SinusoidalPositionalEmbedding(
+            hidden_size, self.padding_idx, init_size=DEFAULT_MAX_TARGET_POSITIONS,
+        )
+    def forward(self, txt_tokens, attn_mask=None):
+        """
+        :param txt_tokens: [B, T]
+        :return: {
+            'encoder_out': [B x T x C]
+        }
+        """
+        encoder_padding_mask = txt_tokens.eq(self.padding_idx).data
+        x = self.forward_embedding(txt_tokens)  # [B, T, H]
+        if self.num_layers > 0:
+            x = super(FastSpeechEncoder, self).forward(x, encoder_padding_mask, attn_mask=attn_mask)
+        return x
+    def forward_embedding(self, txt_tokens):
+        # embed tokens and positions
+        x = self.embed_scale * self.embed_tokens(txt_tokens)
+        positions = self.embed_positions(txt_tokens)
+        x = x + positions
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        return x
+class FastSpeechDecoder(FFTBlocks):
+    def __init__(self, hidden_size=256, num_layers=4, kernel_size=9, num_heads=2):
+        super().__init__(hidden_size, num_layers, kernel_size, num_heads=num_heads)

preprocess/tools/note_transcription/modules/commons/wavenet.py ADDED Viewed

	@@ -0,0 +1,109 @@

+import torch
+from torch import nn
+from packaging import version
+def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
+    n_channels_int = n_channels[0]
+    in_act = input_a + input_b
+    t_act = torch.tanh(in_act[:, :n_channels_int, :])
+    s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
+    acts = t_act * s_act
+    return acts
+jit_fused_add_tanh_sigmoid_multiply = fused_add_tanh_sigmoid_multiply
+def script_function():
+    if version.parse(torch.__version__) >= version.parse('2.0'):
+        global jit_fused_add_tanh_sigmoid_multiply
+        jit_fused_add_tanh_sigmoid_multiply = torch.jit.script(fused_add_tanh_sigmoid_multiply)
+class WN(torch.nn.Module):
+    def __init__(self, hidden_size, kernel_size, dilation_rate, n_layers, c_cond=0,
+                 p_dropout=0, share_cond_layers=False, is_BTC=False):
+        super(WN, self).__init__()
+        assert (kernel_size % 2 == 1)
+        assert (hidden_size % 2 == 0)
+        self.is_BTC = is_BTC
+        self.hidden_size = hidden_size
+        self.kernel_size = kernel_size
+        self.dilation_rate = dilation_rate
+        self.n_layers = n_layers
+        self.gin_channels = c_cond
+        self.p_dropout = p_dropout
+        self.share_cond_layers = share_cond_layers
+        self.in_layers = torch.nn.ModuleList()
+        self.res_skip_layers = torch.nn.ModuleList()
+        self.drop = nn.Dropout(p_dropout)
+        if c_cond != 0 and not share_cond_layers:
+            cond_layer = torch.nn.Conv1d(c_cond, 2 * hidden_size * n_layers, 1)
+            self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
+        for i in range(n_layers):
+            dilation = dilation_rate ** i
+            padding = int((kernel_size * dilation - dilation) / 2)
+            in_layer = torch.nn.Conv1d(hidden_size, 2 * hidden_size, kernel_size,
+                                       dilation=dilation, padding=padding)
+            in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
+            self.in_layers.append(in_layer)
+            # last one is not necessary
+            if i < n_layers - 1:
+                res_skip_channels = 2 * hidden_size
+            else:
+                res_skip_channels = hidden_size
+            res_skip_layer = torch.nn.Conv1d(hidden_size, res_skip_channels, 1)
+            res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
+            self.res_skip_layers.append(res_skip_layer)
+        script_function()
+    def forward(self, x, nonpadding=None, cond=None):
+        if self.is_BTC:
+            x = x.transpose(1, 2)
+            cond = cond.transpose(1, 2) if cond is not None else None
+            nonpadding = nonpadding.transpose(1, 2) if nonpadding is not None else None
+        if nonpadding is None:
+            nonpadding = 1
+        output = torch.zeros_like(x)
+        n_channels_tensor = torch.IntTensor([self.hidden_size])
+        if cond is not None and not self.share_cond_layers:
+            cond = self.cond_layer(cond)
+        for i in range(self.n_layers):
+            x_in = self.in_layers[i](x)
+            x_in = self.drop(x_in)
+            if cond is not None:
+                cond_offset = i * 2 * self.hidden_size
+                cond_l = cond[:, cond_offset:cond_offset + 2 * self.hidden_size, :]
+            else:
+                cond_l = torch.zeros_like(x_in)
+            if version.parse(torch.__version__) >= version.parse('2.0'):
+                acts = jit_fused_add_tanh_sigmoid_multiply(x_in, cond_l, n_channels_tensor)
+            else:
+                acts = fused_add_tanh_sigmoid_multiply(x_in, cond_l, n_channels_tensor)
+            res_skip_acts = self.res_skip_layers[i](acts)
+            if i < self.n_layers - 1:
+                x = (x + res_skip_acts[:, :self.hidden_size, :]) * nonpadding
+                output = output + res_skip_acts[:, self.hidden_size:, :]
+            else:
+                output = output + res_skip_acts
+        output = output * nonpadding
+        if self.is_BTC:
+            output = output.transpose(1, 2)
+        return output
+    def remove_weight_norm(self):
+        def remove_weight_norm(m):
+            try:
+                nn.utils.remove_weight_norm(m)
+            except ValueError:  # this module didn't have weight norm
+                return
+        self.apply(remove_weight_norm)

preprocess/tools/note_transcription/modules/pe/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Pitch extractor modules for ROSVOT."""

preprocess/tools/note_transcription/modules/pe/rmvpe/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from .constants import *
+from .model import E2E0
+from .utils import to_local_average_f0, to_viterbi_f0
+from .inference import RMVPE
+from .spec import MelSpectrogram
+from .extractor import extract