Spaces:

LEMAS-Project
/

LEMAS-TTS

Running on Zero

App Files Files Community

Approximetal commited on Jan 1

Commit

34fb334

verified ·

1 Parent(s): 4d882a6

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
README.md +116 -9
app.py +19 -0
apt.txt +3 -0
denoised_audio.wav +3 -0
inference_gradio.py +603 -0
lemas_tts/__init__.py +6 -0
lemas_tts/api.py +252 -0
lemas_tts/configs/multilingual_grl.yaml +78 -0
lemas_tts/configs/multilingual_prosody.yaml +78 -0
lemas_tts/infer/frontend.py +251 -0
lemas_tts/infer/infer_cli.py +386 -0
lemas_tts/infer/text_norm/__init__.py +0 -0
lemas_tts/infer/text_norm/cn_tn.py +824 -0
lemas_tts/infer/text_norm/en_tn.py +178 -0
lemas_tts/infer/text_norm/gp2py.py +148 -0
lemas_tts/infer/text_norm/id_tn.py +275 -0
lemas_tts/infer/text_norm/jieba_dict.txt +0 -0
lemas_tts/infer/text_norm/pinyin-lexicon-r.txt +4120 -0
lemas_tts/infer/text_norm/symbols.py +419 -0
lemas_tts/infer/text_norm/tokenizer.py +219 -0
lemas_tts/infer/text_norm/txt2pinyin.py +225 -0
lemas_tts/infer/utils_infer.py +651 -0
lemas_tts/model/backbones/README.md +20 -0
lemas_tts/model/backbones/dit.py +254 -0
lemas_tts/model/backbones/ecapa_tdnn.py +931 -0
lemas_tts/model/backbones/mmdit.py +189 -0
lemas_tts/model/backbones/prosody_encoder.py +433 -0
lemas_tts/model/backbones/unett.py +250 -0
lemas_tts/model/cfm.py +899 -0
lemas_tts/model/modules.py +802 -0
lemas_tts/model/utils.py +190 -0
lemas_tts/scripts/inference_gradio.py +584 -0
requirements.txt +182 -0
uvr5/gui_data/constants.py +1147 -0
uvr5/lib_v5/mdxnet.py +140 -0
uvr5/lib_v5/mixer.ckpt +3 -0
uvr5/lib_v5/modules.py +74 -0
uvr5/lib_v5/pyrb.py +92 -0
uvr5/lib_v5/spec_utils.py +703 -0
uvr5/lib_v5/vr_network/__init__.py +1 -0
uvr5/lib_v5/vr_network/layers.py +143 -0
uvr5/lib_v5/vr_network/layers_new.py +126 -0
uvr5/lib_v5/vr_network/model_param_init.py +59 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr16000_hl512.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr32000_hl512.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr33075_hl384.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl1024.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl256.json +19 -0
uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl512.json +19 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+denoised_audio.wav filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,12 +1,119 @@
 ---
-title: LEMAS TTS
-emoji: 🔥
-colorFrom: red
-colorTo: indigo
-sdk: gradio
-sdk_version: 6.2.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# LEMAS-TTS Gradio Demo (Hugging Face Space)
+This folder is a **clean, inference-only** version of LEMAS-TTS, organized for easy deployment on **Hugging Face Spaces**.
+It keeps only:
+- the inference models & configs (`lemas_tts`)
+- pretrained checkpoints and vocab (`pretrained_models`)
+- the bundled UVR5 denoiser (`uvr5`)
+- a Gradio web UI (`inference_gradio.py`, `app.py`)
+---
+## 1. Features
+- Zero-shot TTS: clone voice from a reference audio + reference text
+- Multilingual text input (Chinese / English / ES / IT / PT / DE, etc.)
+- Optional UVR5-based reference denoising
+- Two custom LEMAS checkpoints:
+  - `multilingual_prosody_custom`
+  - `multilingual_acc_grl_custom`
+---
+## 2. Project Structure
+```text
+LEMAS-TTS_gradio/
+  app.py                     # HF Space entrypoint (Gradio Blocks)
+  inference_gradio.py        # Full Gradio UI & logic
+  requirements.txt           # Minimal runtime dependencies
+  lemas_tts/                 # Core LEMAS-TTS package (inference only)
+    api.py                   # F5TTS API (used by the UI)
+    configs/                 # Model configs (F5TTS / E2TTS)
+    infer/                   # Inference utilities & text frontend
+    model/                   # DiT backbone, utils, etc.
+  pretrained_models/         # All local assets needed for inference
+    ckpts/
+      F5TTS_v1_Base_vocos_custom_multilingual_prosody/model_2698000.pt
+      F5TTS_v1_Base_vocos_custom_multilingual_acc_grl/model_2680000.pt
+      prosody_encoder/...
+      vocos-mel-24khz/...
+    data/
+      multilingual_prosody_custom/vocab.txt
+      multilingual_acc_grl_custom/vocab.txt
+      test_examples/*.wav    # Demo audios used in the UI
+    uvr5/
+      models/MDX_Net_Models/model_data/*.onnx, *.json
+  uvr5/                      # Bundled UVR5 implementation for denoising
+```
+`lemas_tts.api.F5TTS` automatically resolves `pretrained_models/` based on the repo layout, so no extra path configuration is required.
+---
+## 3. How to Run Locally
+```bash
+cd LEMAS-TTS_gradio
+pip install -r requirements.txt
+python app.py
+```
+Then open the printed URL (default `http://127.0.0.1:7860`) in your browser.
 ---
+## 4. Hugging Face Space Setup
+1. Create a new Space (type: **Gradio**).
+2. Upload the contents of `LEMAS-TTS_gradio/` to the Space repo:
+   - `app.py`
+   - `inference_gradio.py`
+   - `requirements.txt`
+   - `lemas_tts/`
+   - `pretrained_models/`
+   - `uvr5/`
+3. In the Space settings, choose a GPU hardware profile (the model is heavy).
+4. The Space will automatically run `app.py` and launch the Gradio Blocks named `app`.
+No extra arguments are needed; all paths are relative inside the repo.
 ---
+## 5. Usage Tips
+- **Reference Text** should match the reference audio roughly in content and language for best voice cloning.
+- **Denoise**:
+  - Turn on if your reference audio is noisy; it runs UVR5 on CPU.
+  - Turn off if the reference is already clean (saves time).
+- **Seed**:
+  - `-1` → random seed
+  - Any other integer → reproducible output
+---
+## 6. 中文说明（简要）
+这个目录是专门为 **Hugging Face Space** 打包的 **推理版 LEMAS-TTS**：
+- 只保留推理相关代码（`lemas_tts`）、预训练模型（`pretrained_models`）和 UVR5 去噪模块（`uvr5`）
+- Gradio 入口为 `app.py`，内部调用 `inference_gradio.py` 里的 `app`（一个 `gr.Blocks` 界面）
+- `pretrained_models/` 下已经包含：
+  - 自定义多语种 prosody / accent GRL 的 finetune 权重
+  - vocoder（`vocos-mel-24khz`）
+  - prosody encoder
+  - 以及示例语音 `test_examples/*.wav`
+在本地或 Space 中运行步骤：
+```bash
+pip install -r requirements.txt
+python app.py
+```
+然后在浏览器中打开提示的链接即可使用零样本 TTS Demo。

app.py ADDED Viewed

	@@ -0,0 +1,19 @@

+"""
+Gradio entrypoint for Hugging Face Spaces.
+This file simply re-exports the `app` Blocks defined in `inference_gradio.py`
+so that Spaces can discover and launch it.
+"""
+import gradio as gr  # noqa: F401
+from inference_gradio import app as _app
+# Expose as both `app` and `demo` for maximum compatibility
+app = _app
+demo = _app
+if __name__ == "__main__":
+    app.queue(api_open=True).launch()

apt.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+ffmpeg
+espeak-ng
+espeak

denoised_audio.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7d715f294233999f56d51424b1ab8e9d28aed5dfb0821b427c2fb4f4abaa3aa
+size 1386548

inference_gradio.py ADDED Viewed

	@@ -0,0 +1,603 @@

+import gc
+import os
+import platform
+import psutil
+import tempfile
+from glob import glob
+import traceback
+import click
+import gradio as gr
+import torch
+import torchaudio
+import soundfile as sf
+from pathlib import Path
+from cached_path import cached_path
+from lemas_tts.api import TTS, PRETRAINED_ROOT, CKPTS_ROOT
+# Global variables
+tts_api = None
+last_checkpoint = ""
+last_device = ""
+last_ema = None
+# Device detection
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
+REPO_ROOT = Path(__file__).resolve().parent
+# HF location for pretrained assets (used as a fallback when local files are missing)
+HF_PRETRAINED_ROOT = "hf://LEMAS-Project/LEMAS-TTS/pretrained_models"
+# 1) 指向你仓库里的 libespeak-ng.so
+ESPEAK_LIB = PRETRAINED_ROOT / "espeak-ng-lib" / "libespeak-ng.so"
+os.environ["PHONEMIZER_ESPEAK_LIBRARY"] = str(ESPEAK_LIB)
+# 2) 指向你仓库里的 espeak-ng-data
+ESPEAK_DATA_DIR = PRETRAINED_ROOT / "espeak-ng-data"
+os.environ["ESPEAK_DATA_PATH"] = str(ESPEAK_DATA_DIR)
+os.environ["ESPEAKNG_DATA_PATH"] = str(ESPEAK_DATA_DIR)
+class UVR5:
+    """Small wrapper around the bundled uvr5 implementation for denoising."""
+    def __init__(self, model_dir: Path, code_dir: Path):
+        self.model = self.load_model(str(model_dir), str(code_dir))
+    def load_model(self, model_dir: str, code_dir: str):
+        import sys
+        import json
+        if code_dir not in sys.path:
+            sys.path.append(code_dir)
+        from multiprocess_cuda_infer import ModelData, Inference
+        model_path = os.path.join(model_dir, "Kim_Vocal_1.onnx")
+        config_path = os.path.join(model_dir, "MDX-Net-Kim-Vocal1.json")
+        configs = json.loads(open(config_path, "r", encoding="utf-8").read())
+        model_data = ModelData(
+            model_path=model_path,
+            audio_path=model_dir,
+            result_path=model_dir,
+            device="cpu",
+            process_method="MDX-Net",
+            base_dir=model_dir,  # keep base_dir and model_dir the same (paths under `pretrained_models`)
+            **configs,
+        )
+        uvr5_model = Inference(model_data, "cpu")
+        uvr5_model.load_model(model_path, 1)
+        return uvr5_model
+    def denoise(self, audio_info):
+        print("denoise UVR5: ", audio_info)
+        input_audio = load_wav(audio_info, sr=44100, channel=2)
+        output_audio = self.model.demix_base({0: input_audio.squeeze()}, is_match_mix=False)
+        return output_audio.squeeze().T.numpy(), 44100
+denoise_model = UVR5(
+    model_dir=PRETRAINED_ROOT / "uvr5",
+    code_dir=REPO_ROOT / "uvr5",
+)
+def load_wav(audio_info, sr=16000, channel=1):
+    print("load audio:", audio_info)
+    audio, raw_sr = torchaudio.load(audio_info)
+    audio = audio.T if len(audio.shape) > 1 and audio.shape[1] == 2 else audio
+    audio = audio / torch.max(torch.abs(audio))
+    audio = audio.squeeze().float()
+    if channel == 1 and len(audio.shape) == 2:  # stereo to mono
+        audio = audio.mean(dim=0, keepdim=True)
+    elif channel == 2 and len(audio.shape) == 1:
+        audio = torch.stack((audio, audio)) # mono to stereo
+    if raw_sr != sr:
+        audio = torchaudio.functional.resample(audio.squeeze(), raw_sr, sr)
+    audio = torch.clip(audio, -0.999, 0.999).squeeze()
+    return audio
+def denoise(audio_info):
+    save_path = "./denoised_audio.wav"
+    denoised_audio, sr = denoise_model.denoise(audio_info)
+    sf.write(save_path, denoised_audio, sr, format='wav', subtype='PCM_24')
+    print("save denoised audio:", save_path)
+    return save_path
+def cancel_denoise(audio_info):
+    return audio_info
+def get_checkpoints_project(project_name=None, is_gradio=True):
+    """Get available checkpoint files"""
+    checkpoint_dir = [str(CKPTS_ROOT)]
+    # Remote ckpt locations on HF (used if local ckpts are not present)
+    remote_ckpts = {
+        "multilingual_grl": f"{HF_PRETRAINED_ROOT}/ckpts/multilingual_grl/multilingual_grl.safetensors",
+        "multilingual_prosody": f"{HF_PRETRAINED_ROOT}/ckpts/multilingual_prosody/multilingual_prosody.safetensors",
+    }
+    if project_name is None:
+        # Look for checkpoints in local directory
+        files_checkpoints = []
+        for path in checkpoint_dir:
+            if os.path.isdir(path):
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.pt"), recursive=True))
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.safetensors"), recursive=True))
+                break
+        # Fallback: use HF ckpts
+        if not files_checkpoints:
+            files_checkpoints = list(remote_ckpts.values())
+    else:
+        if os.path.isdir(checkpoint_dir[0]):
+            files_checkpoints = glob(os.path.join(checkpoint_dir[0], project_name, "*.pt"))
+            files_checkpoints.extend(glob(os.path.join(checkpoint_dir[0], project_name, "*.safetensors")))
+        else:
+            ckpt = remote_ckpts.get(project_name)
+            files_checkpoints = [ckpt] if ckpt is not None else []
+    print("files_checkpoints:", project_name, files_checkpoints)
+    # Separate pretrained and regular checkpoints
+    pretrained_checkpoints = [f for f in files_checkpoints if "pretrained_" in os.path.basename(f)]
+    regular_checkpoints = [
+        f
+        for f in files_checkpoints
+        if "pretrained_" not in os.path.basename(f) and "model_last.pt" not in os.path.basename(f)
+    ]
+    last_checkpoint = [f for f in files_checkpoints if "model_last.pt" in os.path.basename(f)]
+    # Sort regular checkpoints by number
+    try:
+        regular_checkpoints = sorted(
+            regular_checkpoints, key=lambda x: int(os.path.basename(x).split("_")[1].split(".")[0])
+        )
+    except (IndexError, ValueError):
+        regular_checkpoints = sorted(regular_checkpoints)
+    # Combine in order: pretrained, regular, last
+    files_checkpoints = pretrained_checkpoints + regular_checkpoints + last_checkpoint
+    select_checkpoint = None if not files_checkpoints else files_checkpoints[-1]
+    if is_gradio:
+        return gr.update(choices=files_checkpoints, value=select_checkpoint)
+    return files_checkpoints, select_checkpoint
+def get_available_projects():
+    """Get available project names from data directory"""
+    data_paths = [
+        str(PRETRAINED_ROOT / "data"),
+    ]
+    project_list = []
+    for data_path in data_paths:
+        if os.path.isdir(data_path):
+            for folder in os.listdir(data_path):
+                path_folder = os.path.join(data_path, folder)
+                if "test" not in folder:
+                    project_list.append(folder)
+            break
+    # Fallback: if no local data dir, default to known HF projects
+    if not project_list:
+        project_list = ["multilingual_grl", "multilingual_prosody"]
+    project_list.sort()
+    print("project_list:", project_list)
+    return project_list
+def infer(
+    project, file_checkpoint, exp_name, ref_text, ref_audio, denoise_audio, gen_text, nfe_step, use_ema, separate_langs, frontend, speed, cfg_strength, use_acc_grl, ref_ratio, no_ref_audio, sway_sampling_coef, use_prosody_encoder, seed
+):
+    global last_checkpoint, last_device, tts_api, last_ema
+    # Resolve checkpoint path (local or HF)
+    ckpt_path = file_checkpoint
+    if isinstance(ckpt_path, str) and ckpt_path.startswith("hf://"):
+        try:
+            ckpt_resolved = str(cached_path(ckpt_path))
+        except Exception as e:
+            traceback.print_exc()
+            return None, f"Error downloading checkpoint: {str(e)}", ""
+    else:
+        ckpt_resolved = ckpt_path
+    if not os.path.isfile(ckpt_resolved):
+        return None, "Checkpoint not found!", ""
+    if denoise_audio:
+        ref_audio = denoise_audio
+    device_test = device  # Use the global device
+    if last_checkpoint != ckpt_resolved or last_device != device_test or last_ema != use_ema or tts_api is None:
+        if last_checkpoint != ckpt_resolved:
+            last_checkpoint = ckpt_resolved
+        if last_device != device_test:
+            last_device = device_test
+        if last_ema != use_ema:
+            last_ema = use_ema
+        # Automatically enable prosody encoder when using the prosody checkpoint
+        use_prosody_encoder = True if "prosody" in str(ckpt_resolved) else False
+        # Resolve vocab file (local or HF)
+        local_vocab = Path(PRETRAINED_ROOT) / "data" / project / "vocab.txt"
+        if local_vocab.is_file():
+            vocab_file = str(local_vocab)
+        else:
+            remote_vocab_map = {
+                "multilingual_grl": f"{HF_PRETRAINED_ROOT}/data/multilingual_grl/vocab.txt",
+                "multilingual_prosody": f"{HF_PRETRAINED_ROOT}/data/multilingual_prosody/vocab.txt",
+            }
+            remote_vocab = remote_vocab_map.get(project)
+            if remote_vocab is None:
+                return None, "Vocab file not found!", ""
+            try:
+                vocab_file = str(cached_path(remote_vocab))
+            except Exception as e:
+                traceback.print_exc()
+                return None, f"Error downloading vocab: {str(e)}", ""
+        # Resolve prosody encoder config & weights
+        local_prosody_cfg = CKPTS_ROOT / "prosody_encoder" / "pretssel_cfg.json"
+        local_prosody_ckpt = CKPTS_ROOT / "prosody_encoder" / "prosody_encoder_UnitY2.pt"
+        if local_prosody_cfg.is_file():
+            prosody_cfg_path = str(local_prosody_cfg)
+        else:
+            prosody_cfg_path = str(
+                cached_path(f"{HF_PRETRAINED_ROOT}/ckpts/prosody_encoder/pretssel_cfg.json")
+            )
+        if local_prosody_ckpt.is_file():
+            prosody_ckpt_path = str(local_prosody_ckpt)
+        else:
+            prosody_ckpt_path = str(
+                cached_path(f"{HF_PRETRAINED_ROOT}/ckpts/prosody_encoder/prosody_encoder_UnitY2.pt")
+            )
+        try:
+            tts_api = TTS(
+                model=exp_name,
+                ckpt_file=ckpt_resolved,
+                vocab_file=vocab_file,
+                device=device_test,
+                use_ema=use_ema,
+                frontend=frontend,
+                use_prosody_encoder=use_prosody_encoder,
+                prosody_cfg_path=prosody_cfg_path,
+                prosody_ckpt_path=prosody_ckpt_path,
+            )
+        except Exception as e:
+            traceback.print_exc()
+            return None, f"Error loading model: {str(e)}", ""
+        print("Model loaded >>", device_test, file_checkpoint, use_ema)
+    if seed == -1:  # -1 used for random
+        seed = None
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
+            tts_api.infer(
+                ref_file=ref_audio,
+                ref_text=ref_text.strip(),
+                gen_text=gen_text.strip(),
+                nfe_step=nfe_step,
+                separate_langs=separate_langs,
+                speed=speed,
+                cfg_strength=cfg_strength,
+                sway_sampling_coef=sway_sampling_coef,
+                use_acc_grl=use_acc_grl,
+                ref_ratio=ref_ratio,
+                no_ref_audio=no_ref_audio,
+                use_prosody_encoder=use_prosody_encoder,
+                file_wave=f.name,
+                seed=seed,
+            )
+            return f.name, f"Device: {tts_api.device}", str(tts_api.seed)
+    except Exception as e:
+        traceback.print_exc()
+        return None, f"Inference error: {str(e)}", ""
+def get_gpu_stats():
+    """Get GPU statistics"""
+    gpu_stats = ""
+    if torch.cuda.is_available():
+        gpu_count = torch.cuda.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.cuda.get_device_name(i)
+            gpu_properties = torch.cuda.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.cuda.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.cuda.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.xpu.is_available():
+        gpu_count = torch.xpu.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.xpu.get_device_name(i)
+            gpu_properties = torch.xpu.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.xpu.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.xpu.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.backends.mps.is_available():
+        gpu_count = 1
+        gpu_stats += "MPS GPU\n"
+        total_memory = psutil.virtual_memory().total / (
+            1024**3
+        )  # Total system memory (MPS doesn't have its own memory)
+        allocated_memory = 0
+        reserved_memory = 0
+        gpu_stats += (
+            f"Total system memory: {total_memory:.2f} GB\n"
+            f"Allocated GPU memory (MPS): {allocated_memory:.2f} MB\n"
+            f"Reserved GPU memory (MPS): {reserved_memory:.2f} MB\n"
+        )
+    else:
+        gpu_stats = "No GPU available"
+    return gpu_stats
+def get_cpu_stats():
+    """Get CPU statistics"""
+    cpu_usage = psutil.cpu_percent(interval=1)
+    memory_info = psutil.virtual_memory()
+    memory_used = memory_info.used / (1024**2)
+    memory_total = memory_info.total / (1024**2)
+    memory_percent = memory_info.percent
+    pid = os.getpid()
+    process = psutil.Process(pid)
+    nice_value = process.nice()
+    cpu_stats = (
+        f"CPU Usage: {cpu_usage:.2f}%\n"
+        f"System Memory: {memory_used:.2f} MB used / {memory_total:.2f} MB total ({memory_percent}% used)\n"
+        f"Process Priority (Nice value): {nice_value}"
+    )
+    return cpu_stats
+def get_combined_stats():
+    """Get combined system stats"""
+    gpu_stats = get_gpu_stats()
+    cpu_stats = get_cpu_stats()
+    combined_stats = f"### GPU Stats\n{gpu_stats}\n\n### CPU Stats\n{cpu_stats}"
+    return combined_stats
+# Create Gradio interface
+with gr.Blocks(title="LEMAS-TTS Inference") as app:
+    gr.Markdown(
+        """
+        # Zero-Shot TTS
+        Set seed to -1 for random generation.
+        """
+    )
+    with gr.Accordion("Model configuration", open=False):
+    # Model configuration
+        with gr.Row():
+            exp_name = gr.Radio(
+                label="Model",
+                choices=["multilingual_grl", "multilingual_prosody"],
+                value="multilingual_grl",
+                visible=False,
+            )
+        # Project selection
+        available_projects = get_available_projects()
+        # Get initial checkpoints
+        list_checkpoints, checkpoint_select = get_checkpoints_project(available_projects[0] if available_projects else None, False)
+        with gr.Row():
+            with gr.Column(scale=1):
+                # load_models_btn = gr.Button(value="Load models")
+                cm_project = gr.Dropdown(
+                    choices=available_projects,
+                    value=available_projects[0] if available_projects else None,
+                    label="Project",
+                    allow_custom_value=True,
+                    scale=4
+                )
+            with gr.Column(scale=5):
+                cm_checkpoint = gr.Dropdown(
+                    choices=list_checkpoints, value=checkpoint_select, label="Checkpoints", allow_custom_value=True # scale=4,
+)
+            bt_checkpoint_refresh = gr.Button("Refresh", scale=1)
+        with gr.Row():
+            ch_use_ema = gr.Checkbox(label="Use EMA", visible=False, value=True, scale=2, info="Turn off at early stage might offer better results")
+            frontend = gr.Radio(label="Frontend", visible=False, choices=["phone", "char", "bpe"], value="phone", scale=3)
+            separate_langs = gr.Checkbox(label="Separate Languages", visible=False, value=True, scale=2, info="separate language tokens")
+        # Inference parameters
+        with gr.Row():
+            nfe_step = gr.Number(label="NFE Step", scale=1, value=64)
+            speed = gr.Slider(label="Speed", scale=3, value=1.0, minimum=0.5, maximum=1.5, step=0.1)
+            cfg_strength = gr.Slider(label="CFG Strength", scale=2, value=5.0, minimum=0.0, maximum=10.0, step=1)
+            sway_sampling_coef = gr.Slider(label="Sway Sampling Coef", scale=2, value=3, minimum=2, maximum=5, step=0.1)
+            ref_ratio = gr.Slider(label="Ref Ratio", scale=2, value=1.0, minimum=0.0, maximum=1.0, step=0.1)
+            no_ref_audio = gr.Checkbox(label="No Reference Audio", visible=False, value=False, scale=1, info="No mel condition")
+            use_acc_grl = gr.Checkbox(label="Use accent grl condition", visible=False, value=True, scale=1, info="Use accent grl condition")
+            use_prosody_encoder = gr.Checkbox(label="Use prosody encoder", visible=False, value=False, scale=1, info="Use prosody encoder")
+            seed = gr.Number(label="Random Seed", scale=1, value=-1, minimum=-1)
+    # Input fields
+    ref_text = gr.Textbox(label="Reference Text", placeholder="Enter the text for the reference audio...")
+    ref_audio = gr.Audio(label="Reference Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    with gr.Accordion("Denoise audio (Optional / Recommend)", open=True):
+        with gr.Row():
+            denoise_btn = gr.Button(value="Denoise")
+            cancel_btn = gr.Button(value="Cancel Denoise")
+        denoise_audio = gr.Audio(label="Denoised Audio", value=None, type="filepath", interactive=True, show_download_button=True, editable=True)
+    gen_text = gr.Textbox(label="Text to Generate", placeholder="Enter the text you want to generate...")
+    # Inference button and outputs
+    with gr.Row():
+        txt_info_gpu = gr.Textbox("", label="Device Info")
+        seed_info = gr.Textbox(label="Used Random Seed")
+        check_button_infer = gr.Button("Generate Audio", variant="primary")
+    gen_audio = gr.Audio(label="Generated Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    # Examples
+    def _resolve_example(name: str) -> str:
+        local = PRETRAINED_ROOT / "data" / "test_examples" / name
+        if local.is_file():
+            return str(local)
+        remote_map = {
+            "en.wav": f"{HF_PRETRAINED_ROOT}/data/test_examples/en.wav",
+            "es.wav": f"{HF_PRETRAINED_ROOT}/data/test_examples/es.wav",
+            "pt.wav": f"{HF_PRETRAINED_ROOT}/data/test_examples/pt.wav",
+        }
+        url = remote_map.get(name)
+        return str(cached_path(url)) if url is not None else ""
+    examples = gr.Examples(
+        examples=[
+            ["em, #1 I have a list of YouTubers, and I'm gonna be going to their houses and raiding them by.",
+            _resolve_example("en.wav"),
+            "我有一份 YouTuber 名单，我打算去他们家，对他们进行突袭。",
+            ],
+            ["Te voy a dar un tip #1 que le copia a John Rockefeller, uno de los empresarios más picudos de la historia.",
+            _resolve_example("es.wav"),
+            "我要给你一个从历史上最精明的商人之一约翰·洛克菲勒那里抄来的秘诀。",
+            ],
+            ["Nova, #1 dia 25 desse mês vai rolar operação the last Frontier.",
+            _resolve_example("pt.wav"),
+            "新消息，本月二十五日，'最后的边疆行动'将启动。",
+            ],
+        ],
+        inputs=[
+            ref_text,
+            ref_audio,
+            gen_text,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+        fn=infer,
+        cache_examples=False
+    )
+    # System Info section at the bottom
+    gr.Markdown("---")
+    gr.Markdown("## System Information")
+    with gr.Accordion("Update System Stats", open=False):
+        update_button = gr.Button("Update System Stats", scale=1)
+        output_box = gr.Textbox(label="GPU and CPU Information", lines=5, scale=5)
+    def update_stats():
+        return get_combined_stats()
+    denoise_btn.click(fn=denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    cancel_btn.click(fn=cancel_denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    # Event handlers
+    check_button_infer.click(
+        fn=infer,
+        inputs=[
+            cm_project,
+            cm_checkpoint,
+            exp_name,
+            ref_text,
+            ref_audio,
+            denoise_audio,
+            gen_text,
+            nfe_step,
+            ch_use_ema,
+            separate_langs,
+            frontend,
+            speed,
+            cfg_strength,
+            use_acc_grl,
+            ref_ratio,
+            no_ref_audio,
+            sway_sampling_coef,
+            use_prosody_encoder,
+            seed,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+    )
+    bt_checkpoint_refresh.click(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    cm_project.change(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    ref_audio.change(
+            fn=lambda x: None,
+            inputs=[ref_audio],
+            outputs=[denoise_audio]
+        )
+    update_button.click(fn=update_stats, outputs=output_box)
+    # Auto-load system stats on startup
+    app.load(fn=update_stats, outputs=output_box)
+@click.command()
+@click.option("--port", "-p", default=7860, type=int, help="Port to run the app on")
+@click.option("--host", "-H", default="0.0.0.0", help="Host to run the app on")
+@click.option(
+    "--share",
+    "-s",
+    default=False,
+    is_flag=True,
+    help="Share the app via Gradio share link",
+)
+@click.option("--api", "-a", default=True, is_flag=True, help="Allow API access")
+def main(port, host, share, api):
+    global app
+    print("Starting LEMAS-TTS Inference Interface...")
+    print(f"Device: {device}")
+    app.queue(api_open=api).launch(
+        server_name=host,
+        server_port=port,
+        share=share,
+        show_api=api,
+        allowed_paths=[str(PRETRAINED_ROOT / "data")],
+    )
+if __name__ == "__main__":
+    main()

lemas_tts/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+from .api import TTS
+__all__ = ["TTS"]
+__version__ = "0.1.0"

lemas_tts/api.py ADDED Viewed

	@@ -0,0 +1,252 @@

+import random
+import sys
+from pathlib import Path
+import re, regex
+import soundfile as sf
+import tqdm
+from cached_path import cached_path
+from hydra.utils import get_class
+from omegaconf import OmegaConf
+from lemas_tts.infer.utils_infer import (
+    load_model,
+    load_vocoder,
+    transcribe,
+    preprocess_ref_audio_text,
+    infer_process,
+    remove_silence_for_generated_wav,
+    save_spectrogram,
+)
+from lemas_tts.model.utils import seed_everything
+from lemas_tts.model.backbones.dit import DiT
+# Resolve repository layout so we can find pretrained assets (ckpts, vocoder, etc.)
+THIS_FILE = Path(__file__).resolve()
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+REPO_ROOT = _find_repo_root(THIS_FILE)
+# Local pretrained root (used when running from a repo / Space that bundles weights)
+PRETRAINED_ROOT = REPO_ROOT / "pretrained_models"
+# Remote pretrained root on Hugging Face Hub (fallback when local files are absent)
+HF_PRETRAINED_ROOT = "hf://LEMAS-Project/LEMAS-TTS/pretrained_models"
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+class TTS:
+    def __init__(
+        self,
+        model="multilingual",
+        ckpt_file="",
+        vocab_file="",
+        ode_method="euler",
+        use_ema=False,
+        vocoder_local_path=str(CKPTS_ROOT / "vocos-mel-24khz"),
+        use_prosody_encoder=False,
+        prosody_cfg_path="",
+        prosody_ckpt_path="",
+        device=None,
+        hf_cache_dir=None,
+        frontend="phone",
+    ):
+        # Load model architecture config from bundled yaml
+        config_dir = THIS_FILE.parent / "configs"
+        model_cfg = OmegaConf.load(config_dir / f"{model}.yaml")
+        # model_cls = get_class(f"lemas_tts.model.dit.{model_cfg.model.backbone}")
+        model_arc = model_cfg.model.arch
+        self.mel_spec_type = model_cfg.model.mel_spec.mel_spec_type
+        self.target_sample_rate = model_cfg.model.mel_spec.target_sample_rate
+        self.ode_method = ode_method
+        self.use_ema = use_ema
+        self.langs = {"cmn":"zh", "zh":"zh", "en":"en-us", "it":"it", "es":"es", "pt":"pt-br", "fr":"fr-fr", "de":"de", "ru":"ru", "id":"id", "vi":"vi", "th":"th"}
+        if device is not None:
+            self.device = device
+        else:
+            import torch
+            self.device = (
+                "cuda"
+                if torch.cuda.is_available()
+                else "xpu"
+                if torch.xpu.is_available()
+                else "mps"
+                if torch.backends.mps.is_available()
+                else "cpu"
+            )
+        # # Load models
+        # Prefer local vocoder directory if it exists; otherwise let `load_vocoder`
+        # fall back to downloading from the default HF repo (charactr/vocos-mel-24khz).
+        vocoder_is_local = False
+        if vocoder_local_path is not None:
+            try:
+                vocoder_is_local = Path(vocoder_local_path).is_dir()
+            except TypeError:
+                vocoder_is_local = False
+        self.vocoder = load_vocoder(
+            self.mel_spec_type, vocoder_is_local, vocoder_local_path, self.device, hf_cache_dir
+        )
+        # self.vocoder = load_vocoder(vocoder_name="vocos", is_local=True, local_path=vocoder_local_path, device=self.device)
+        if frontend is not None:
+            from lemas_tts.infer.frontend import TextNorm
+            self.frontend = TextNorm(dtype=frontend)
+        else:
+            self.frontend = None
+        self.ema_model = load_model(
+            DiT, model_arc, ckpt_file, self.mel_spec_type, vocab_file, self.ode_method, self.use_ema, self.device,
+            use_prosody_encoder=use_prosody_encoder, prosody_cfg_path=prosody_cfg_path, prosody_ckpt_path=prosody_ckpt_path,
+        )
+    def transcribe(self, ref_audio, language=None):
+        return transcribe(ref_audio, language)
+    def export_wav(self, wav, file_wave, remove_silence=False):
+        sf.write(file_wave, wav, self.target_sample_rate)
+        if remove_silence:
+            remove_silence_for_generated_wav(file_wave)
+    def export_spectrogram(self, spec, file_spec):
+        save_spectrogram(spec, file_spec)
+    def infer(
+        self,
+        ref_file,
+        ref_text,
+        gen_text,
+        show_info=print,
+        progress=tqdm,
+        target_rms=0.1,
+        cross_fade_duration=0.15,
+        use_acc_grl=False,
+        ref_ratio=None,
+        no_ref_audio=False,
+        cfg_strength=2,
+        nfe_step=32,
+        speed=1.0,
+        sway_sampling_coef=5,
+        separate_langs=False,
+        fix_duration=None,
+        use_prosody_encoder=True,
+        file_wave=None,
+        file_spec=None,
+        seed=None,
+    ):
+        if seed is None:
+            seed = random.randint(0, sys.maxsize)
+        seed_everything(seed)
+        self.seed = seed
+        ref_file, ref_text = preprocess_ref_audio_text(ref_file, ref_text)
+        print("preprocesss:\n", "ref_file:", ref_file, "\nref_text:", ref_text)
+        if self.frontend.dtype == "phone":
+            ref_text = self.frontend.text2phn(ref_text+". ").replace("(cmn)", "(zh)").split("|")
+            gen_text = gen_text.split("\n")
+            gen_text = [self.frontend.text2phn(x+". ").replace("(cmn)", "(zh)").split("|") for x in gen_text]
+        elif self.frontend.dtype == "char":
+            src_lang, ref_text = self.frontend.text2norm(ref_text+". ")
+            ref_text = ["("+src_lang.replace("cmn", "zh")+")"] + list(ref_text)
+            gen_text = gen_text.split("\n")
+            gen_text = [self.frontend.text2norm(x+". ") for x in gen_text]
+            gen_text = [["("+x[0].replace("cmn", "zh")+")"] + list(x[1]) for x in gen_text]
+        print("after frontend:\n", "ref_text:", ref_text, "\ngen_text:", gen_text)
+        if separate_langs:
+            ref_text = self.process_phone_list(ref_text) # Optional
+            gen_text = [self.process_phone_list(x) for x in gen_text]
+        print("gen_text:", gen_text, "\nref_text:", ref_text)
+        wav, sr, spec = infer_process(
+            ref_file,
+            ref_text,
+            gen_text,
+            self.ema_model,
+            self.vocoder,
+            self.mel_spec_type,
+            show_info=show_info,
+            progress=progress,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            use_prosody_encoder=use_prosody_encoder,
+            use_acc_grl=use_acc_grl,
+            ref_ratio=ref_ratio,
+            no_ref_audio=no_ref_audio,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=self.device,
+        )
+        if file_wave is not None:
+            self.export_wav(wav, file_wave, remove_silence=False)
+        if file_spec is not None:
+            self.export_spectrogram(spec, file_spec)
+        return wav, sr, spec
+    def process_phone_list(self, parts):
+        puncs = {"#1", "#2", "#3", "#4", "_", "!", ",", ".", "?", '"', "'", "^", "。", "，", "？", "！"}
+        """(vocab756 ver)处理phone list，给不带language id的phone添加当前language id前缀"""
+        # parts = phn_str.split('|')
+        processed = []
+        current_lang = ""
+        for i in range(len(parts)):
+            part = parts[i]
+            if part.startswith('(') and part.endswith(')') and part[1:-1] in self.langs:
+                # 这是一个language id
+                current_lang = part
+                # processed.append(part)
+            elif part in puncs: # not bool(regex.search(r'\p{L}', part[0])): # 匹配非字母数字、非空格的字符
+                # 是停顿符或标点
+                if len(processed) > 0 and processed[-1] == "_":
+                    processed.pop()
+                elif len(processed) > 0 and processed[-1] in puncs and part == "_":
+                    continue
+                processed.append(part)
+                # if i < len(parts) - 1 and parts[i+1] != "_":
+                #     processed.append("_")
+            elif current_lang is not None:
+                # 不是language id且有当前language id，添加前缀
+                processed.append(f"{current_lang}{part}")
+        return processed
+if __name__ == "__main__":
+    f5tts = F5TTS()
+    wav, sr, spec = f5tts.infer(
+        ref_file=str((THIS_FILE.parent / "infer" / "examples" / "basic" / "basic_ref_en.wav").resolve()),
+        ref_text="some call me nature, others call me mother nature.",
+        gen_text=(
+            "I don't really care what you call me. I've been a silent spectator, watching species evolve, "
+            "empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture "
+            "you; ignore me and you shall face the consequences."
+        ),
+        file_wave=str((REPO_ROOT / "outputs" / "api_out.wav").resolve()),
+        file_spec=str((REPO_ROOT / "outputs" / "api_out.png").resolve()),
+        seed=None,
+    )
+    print("seed :", f5tts.seed)

lemas_tts/configs/multilingual_grl.yaml ADDED Viewed

	@@ -0,0 +1,78 @@

+# compute_environment: LOCAL_MACHINE
+# debug: false
+# distributed_type: MULTI_GPU
+# downcast_bf16: 'no'
+# enable_cpu_affinity: true
+# gpu_ids: all
+# # machine_rank: 0
+# # main_training_function: main
+# mixed_precision: bf16
+# num_machines: 1
+# num_processes: 16
+# # rdzv_backend: static
+# same_network: true
+# use_cpu: false
+hydra:
+  run:
+    dir: exp/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: multilingual_vocab898_acc_grl_ctc_fix  # dataset name
+  batch_size_per_gpu: 40000  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 2
+  separate_langs: True
+optim:
+  epochs: 100
+  learning_rate: 2e-5
+  num_warmup_updates: 1000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: multilingual  # model name
+  tokenizer: custom  # tokenizer type
+  tokenizer_path: "pretrained_models/data/multilingual_grl/vocab.txt"  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  audio_dir: "pretrained_models/data/multilingual_grl"
+  use_ctc_loss: True  # whether to use ctc loss
+  use_spk_enc: False
+  use_prosody_encoder: False
+  prosody_cfg_path: "pretrained_models/ckpts/prosody_encoder/pretssel_cfg.json"  # pretssel_cfg.json
+  prosody_ckpt_path: "pretrained_models/ckpts/prosody_encoder/prosody_encoder_UnitY2.pt"  # prosody_encoder_pretssel.pt
+  backbone: DiT
+  arch:
+    dim: 1024
+    depth: 22
+    heads: 16
+    ff_mult: 2
+    text_dim: 512
+    text_mask_padding: True
+    qk_norm: null  # null | rms_norm
+    conv_layers: 4
+    pe_attn_head: null
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: True  # use local offline ckpt or not
+    # Path in the original training environment; kept here for reference only.
+    # For the open-sourced LEMAS-TTS repo, use `pretrained_models/ckpts/vocos-mel-24khz`.
+    local_path: "pretrained_models/ckpts/vocos-mel-24khz"  # local vocoder path
+ckpts:
+  logger: tensorboard  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 1000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 1000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

lemas_tts/configs/multilingual_prosody.yaml ADDED Viewed

	@@ -0,0 +1,78 @@

+# compute_environment: LOCAL_MACHINE
+# debug: false
+# distributed_type: MULTI_GPU
+# downcast_bf16: 'no'
+# enable_cpu_affinity: true
+# gpu_ids: all
+# # machine_rank: 0
+# # main_training_function: main
+# mixed_precision: bf16
+# num_machines: 1
+# num_processes: 16
+# # rdzv_backend: static
+# same_network: true
+# use_cpu: false
+hydra:
+  run:
+    dir: exp/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}/${now:%Y-%m-%d}/${now:%H-%M-%S}
+datasets:
+  name: multilingual_vocab898_acc_grl_prosody_ctc_fix  # dataset name
+  batch_size_per_gpu: 40000  # 8 GPUs, 8 * 38400 = 307200
+  batch_size_type: frame  # frame | sample
+  max_samples: 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
+  num_workers: 2
+  separate_langs: True
+optim:
+  epochs: 100
+  learning_rate: 2e-5
+  num_warmup_updates: 1000  # warmup updates
+  grad_accumulation_steps: 1  # note: updates = steps / grad_accumulation_steps
+  max_grad_norm: 1.0  # gradient clipping
+  bnb_optimizer: False  # use bnb 8bit AdamW optimizer or not
+model:
+  name: multilingual  # model name
+  tokenizer: custom  # tokenizer type
+  tokenizer_path: "pretrained_models/data/multilingual_grl/vocab.txt"  # if 'custom' tokenizer, define the path want to use (should be vocab.txt)
+  audio_dir: "pretrained_models/data/multilingual_grl"
+  use_ctc_loss: True  # whether to use ctc loss
+  use_spk_enc: False
+  use_prosody_encoder: True
+  prosody_cfg_path: "pretrained_models/ckpts/prosody_encoder/pretssel_cfg.json"  # pretssel_cfg.json
+  prosody_ckpt_path: "pretrained_models/ckpts/prosody_encoder/prosody_encoder_UnitY2.pt"  # prosody_encoder_pretssel.pt
+  backbone: DiT
+  arch:
+    dim: 1024
+    depth: 22
+    heads: 16
+    ff_mult: 2
+    text_dim: 512
+    text_mask_padding: True
+    qk_norm: null  # null | rms_norm
+    conv_layers: 4
+    pe_attn_head: null
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
+  mel_spec:
+    target_sample_rate: 24000
+    n_mel_channels: 100
+    hop_length: 256
+    win_length: 1024
+    n_fft: 1024
+    mel_spec_type: vocos  # vocos | bigvgan
+  vocoder:
+    is_local: True  # use local offline ckpt or not
+    # Path in the original training environment; kept here for reference only.
+    # For the open-sourced LEMAS-TTS repo, use `pretrained_models/ckpts/vocos-mel-24khz`.
+    local_path: "pretrained_models/ckpts/vocos-mel-24khz"  # local vocoder path
+ckpts:
+  logger: tensorboard  # wandb | tensorboard | null
+  log_samples: True  # infer random sample per save checkpoint. wip, normal to fail with extra long samples
+  save_per_updates: 1000  # save checkpoint per updates
+  keep_last_n_checkpoints: -1  # -1 to keep all, 0 to not save intermediate, > 0 to keep last N checkpoints
+  last_per_updates: 1000  # save last checkpoint per updates
+  save_dir: ckpts/${model.name}_${model.mel_spec.mel_spec_type}_${model.tokenizer}_${datasets.name}

lemas_tts/infer/frontend.py ADDED Viewed

	@@ -0,0 +1,251 @@

+import os, re, regex
+import langid
+import uroman as ur
+import jieba, zhconv
+from num2words import num2words
+jieba.set_dictionary(dictionary_path=os.path.join(os.path.dirname(__file__) + "/../infer/text_norm/jieba_dict.txt"))
+# from pypinyin.core import Pinyin
+from pypinyin import pinyin, lazy_pinyin, Style
+from .text_norm.txt2pinyin import _PAUSE_SYMBOL, get_phoneme_from_char_and_pinyin
+from .text_norm.cn_tn import NSWNormalizer
+from .text_norm.tokenizer import TextTokenizer, txt2phone
+from pypinyin.contrib.tone_convert import to_initials, to_finals_tone3
+from pypinyin_dict.phrase_pinyin_data import large_pinyin  # large_pinyin  #  cc_cedict
+large_pinyin.load()
+class TextNorm():
+    def __init__(self, dtype="phone"):
+        # my_pinyin = Pinyin(MyConverter())
+        # self.pinyin_parser = my_pinyin.pinyin
+        cmn_lexicon = open(os.path.join(os.path.dirname(__file__)+'/../infer/text_norm/pinyin-lexicon-r.txt'),'r', encoding="utf-8").readlines()
+        cmn_lexicon = [x.strip().split() for x in cmn_lexicon]
+        self.cmn_dict = {x[0]:x[1:] for x in cmn_lexicon}
+        langid.set_languages(['es','pt','zh','en','de','fr','it','ru', 'vi','id','th','ja','ko','ar'])
+        langs = {"en":"en-us", "it":"it", "es":"es", "pt":"pt-br", "fr":"fr-fr", "de":"de", "ru":"ru", "vi":"vi", "id":"id", "th":"th", "ja":"ja", "ko":"ko"} # "zh":"cmn", "cmn":"cmn", "ar":"ar-sa"}
+        text_tokenizer = {}
+        for k,v in langs.items():
+            tokenizer = TextTokenizer(language=v, backend="espeak")
+            lang = "zh" if k == "cmn" else k
+            text_tokenizer[k] = (lang, tokenizer)
+        self.text_tokenizer = text_tokenizer
+        self.cn_tn = NSWNormalizer()
+        self.dtype = dtype
+    def detect_lang(self, text):
+        lang, _ = langid.classify(text)[0]
+        return lang
+    def sil_type(self, time_s):
+        if round(time_s) < 0.4:
+            return ""
+        elif round(time_s) >= 0.4 and round(time_s) < 0.8:
+            return "#1"
+        elif round(time_s) >= 0.8 and round(time_s) < 1.5:
+            return "#2"
+        elif round(time_s) >= 1.5 and round(time_s) < 3.0:
+            return "#3"
+        elif round(time_s) >= 3.0:
+            return "#4"
+    def add_sil_raw(self, sub_list, start_time, end_time, target_transcript):
+        txt = []
+        txt_list = [x["word"] for x in sub_list]
+        sil = self.sil_type(sub_list[0]["start"])
+        if len(sil) > 0:
+            txt.append(sil)
+        txt.append(txt_list[0])
+        for i in range(1, len(sub_list)):
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                txt.append(target_transcript)
+                target_transcript = ""
+            else:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txt.append(sil)
+                txt.append(txt_list[i])
+        return ' '.join(txt)
+    def add_sil(self, sub_list, start_time, end_time, target_transcript, src_lang, tar_lang):
+        txts = []
+        txt_list = [x["word"] for x in sub_list]
+        sil = self.sil_type(sub_list[0]["start"])
+        if len(sil) > 0:
+            txts.append([src_lang, sil])
+        if sub_list[0]["start"] < start_time:
+            txts.append([src_lang, txt_list[0]])
+        for i in range(1, len(sub_list)):
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                txts.append([tar_lang, target_transcript])
+                target_transcript = ""
+            else:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txts.append([src_lang, sil])
+                txts.append([src_lang, txt_list[i]])
+        target_txt = [txts[0]]
+        for txt in txts[1:]:
+            if txt[1] == "":
+                continue
+            if txt[0] != target_txt[-1][0]:
+                target_txt.append([txt[0], ""])
+            target_txt[-1][-1] += " " + txt[1]
+        return target_txt
+    def replace_numbers_with_words(self, sentence, lang="en"):
+        sentence = re.sub(r'(\d+)', r' \1 ', sentence) # add spaces around numbers
+        def replace_with_words(match):
+            num = match.group(0)
+            try:
+                return num2words(num, lang=lang) # Convert numbers to words
+            except:
+                return num # In case num2words fails (unlikely with digits but just to be safe)
+        return re.sub(r'\b\d+\b', replace_with_words, sentence) # Regular expression that matches numbers
+    def get_prompt(self, sub_list, start_time, end_time, src_lang):
+        txts = []
+        txt_list = [x["word"] for x in sub_list]
+        if start_time <= sub_list[0]["start"]:
+            sil = self.sil_type(sub_list[0]["start"])
+            if len(sil) > 0:
+                txts.append([src_lang, sil])
+            txts.append([src_lang, txt_list[0]])
+        for i in range(1, len(sub_list)):
+            # if sub_list[i]["start"] <= start_time and sub_list[i]["end"] <= end_time:
+            #     txts.append([tar_lang, target_transcript])
+            #     target_transcript = ""
+            if sub_list[i]["start"] >= start_time and sub_list[i]["end"] <= end_time:
+                sil = self.sil_type(sub_list[i]["start"] - sub_list[i-1]["end"])
+                if len(sil) > 0:
+                    txts.append([src_lang, sil])
+                txts.append([src_lang, txt_list[i]])
+        target_txt = [txts[0]]
+        for txt in txts[1:]:
+            if txt[1] == "":
+                continue
+            if txt[0] != target_txt[-1][0]:
+                target_txt.append([txt[0], ""])
+            target_txt[-1][-1] += " " + txt[1]
+        return target_txt
+    def txt2pinyin(self, text):
+        txts, phonemes = [], []
+        texts = re.split(r"(#\d)", text)
+        print("before norm: ", texts)
+        for text in texts:
+            if text in {'#1', '#2', '#3', '#4'}:
+                txts.append(text)
+                phonemes.append(text)
+                continue
+            text = self.cn_tn.normalize(text.strip())
+            text_list = list(jieba.cut(text))
+            print("jieba cut: ", text, text_list)
+            for words in text_list:
+                if words in _PAUSE_SYMBOL:
+                    # phonemes[-1] += _PAUSE_SYMBOL[words]
+                    phonemes.append(_PAUSE_SYMBOL[words])
+                    # phonemes.append('#1')
+                    txts[-1] += words
+                elif re.search("[\u4e00-\u9fa5]+", words):
+                    # pinyin = self.pinyin_parser(words, style=Style.TONE3, errors="ignore")
+                    pinyin = lazy_pinyin(words, style=Style.TONE3, tone_sandhi=True, neutral_tone_with_five=True)
+                    new_pinyin = []
+                    for x in pinyin:
+                        x = "".join(x)
+                        if "#" not in x:
+                            new_pinyin.append(x)
+                        else:
+                            phonemes.append(words)
+                            continue
+                    # new_pinyin = change_tone_in_bu_or_yi(words, new_pinyin) if len(words)>1 and words[-1] not in {"一","不"} else new_pinyin
+                    phoneme = get_phoneme_from_char_and_pinyin(words, new_pinyin)
+                    phonemes += phoneme
+                    txts += list(words)
+                elif re.search(r"[a-zA-Z]", words) or re.search(r"#[1-4]", words):
+                    phonemes.append(words.upper())
+                    txts.append(words.upper())
+                    # phonemes.append("#1")
+        # phones = " ".join(phonemes)
+        return txts, phonemes
+    def txt2pin_phns(self, text):
+        text = re.sub(r'(?<! )(' + r'[^\w\s]' + r')', r' \1', text)
+        text = re.sub(r'\s+', ' ', text).strip()
+        # print(text.split(" "))
+        res_list = []
+        for txt in text.split(" "):
+            if txt in self.cmn_dict:
+                # res_list +=  ["(zh)" + x for x in self.cmn_dict[txt]]
+                res_list.append("(zh)")
+                res_list.append(to_initials(txt, strict=False))
+                res_list.append(to_finals_tone3(txt, neutral_tone_with_five=True))
+            elif txt == '':
+                continue
+            elif txt[0] in {"#1", "#2", "#3", "#4"} or not bool(regex.search(r'\p{L}', txt[0][0])):
+                if len(res_list) > 0 and res_list[-1] == "_":
+                    res_list.pop()
+                res_list += [txt]
+                continue
+            else:
+                if len(res_list) > 0 and res_list[-1] == "_":
+                    res_list.pop()
+                lang = langid.classify(txt)[0]
+                lang = lang if lang in self.text_tokenizer else "en"
+                tokenizer = self.text_tokenizer[lang][1]
+                ipa = tokenizer.backend.phonemize([txt], separator=tokenizer.separator, strip=True, njobs=1)
+                phns = ipa[0] if ipa[0][0] == "(" else f"({lang})_" + ipa[0]
+                res_list += phns.replace("_", "|_|").split("|")
+                # lang = phns.split(")")[0][1:]
+                # phns = phns[len(lang)+3:].replace("_", "|_|")
+                # phns = phns.split("|")
+                # for i in range(len(phns)):
+                #     if phns[i] not in {"#1", "#2", "#3", "#4", "_", ",", ".", "?", "!"}:
+                #         phns[i] = f"({lang})" + phns[i]
+                # res_list += phns
+            res_list.append("_")
+        res = "|".join(res_list)
+        res = re.sub(r'(\|_)+', '|_', res)
+        return res
+    def text2phn(self, sentence, lang=None):
+        if not lang:
+            lang = langid.classify(sentence)[0]
+        if re.search("[\u4e00-\u9fa5]+", sentence):
+            txts, phones = self.txt2pinyin(sentence)
+            transcript_norm = " ".join(phones)
+            phones = self.txt2pin_phns(transcript_norm) # IPA mix Pinyin
+        else:
+            transcript = self.replace_numbers_with_words(sentence, lang=lang).split(' ')
+            transcript_norm = sentence
+            # All IPA
+            phones = txt2phone(self.text_tokenizer[lang][1], transcript_norm.strip().replace(".", ",").replace("。", ","))
+            phones = f"({lang})|" + phones if phones[0] != "(" else phones
+        return phones
+    def text2norm(self, sentence, lang=None):
+        if not lang:
+            lang = langid.classify(sentence)[0]
+        if re.search("[\u4e00-\u9fa5]+", sentence):
+            txts, phones = self.txt2pinyin(sentence)
+            transcript_norm = " ".join(phones)
+        else:
+            transcript = self.replace_numbers_with_words(sentence, lang=lang).split(' ')
+            transcript_norm = sentence
+        return (lang, transcript_norm)

lemas_tts/infer/infer_cli.py ADDED Viewed

	@@ -0,0 +1,386 @@

+import argparse
+import codecs
+import os
+import re
+from datetime import datetime
+from importlib.resources import files
+from pathlib import Path
+import numpy as np
+import soundfile as sf
+import tomli
+from cached_path import cached_path
+from hydra.utils import get_class
+from omegaconf import OmegaConf
+from lemas_tts.infer.utils_infer import (
+    mel_spec_type,
+    target_rms,
+    cross_fade_duration,
+    nfe_step,
+    cfg_strength,
+    sway_sampling_coef,
+    speed,
+    fix_duration,
+    device,
+    infer_process,
+    load_model,
+    load_vocoder,
+    preprocess_ref_audio_text,
+    remove_silence_for_generated_wav,
+)
+THIS_FILE = Path(__file__).resolve()
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+REPO_ROOT = _find_repo_root(THIS_FILE)
+PRETRAINED_ROOT = REPO_ROOT / "pretrained_models"
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+parser = argparse.ArgumentParser(
+    prog="python3 infer-cli.py",
+    description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
+    epilog="Specify options above to override one or more settings from config.",
+)
+parser.add_argument(
+    "-c",
+    "--config",
+    type=str,
+    default=os.path.join(files("lemas_tts").joinpath("infer/examples/basic"), "basic.toml"),
+    help="The configuration file, default see infer/examples/basic/basic.toml",
+)
+# Note. Not to provide default value here in order to read default from config file
+parser.add_argument(
+    "-m",
+    "--model",
+    type=str,
+    help="The model name: F5TTS_v1_Base | F5TTS_Base | E2TTS_Base | etc.",
+)
+parser.add_argument(
+    "-mc",
+    "--model_cfg",
+    type=str,
+    help="The path to F5-TTS model config file .yaml",
+)
+parser.add_argument(
+    "-p",
+    "--ckpt_file",
+    type=str,
+    help="The path to model checkpoint .pt, leave blank to use default",
+)
+parser.add_argument(
+    "-v",
+    "--vocab_file",
+    type=str,
+    help="The path to vocab file .txt, leave blank to use default",
+)
+parser.add_argument(
+    "-r",
+    "--ref_audio",
+    type=str,
+    help="The reference audio file.",
+)
+parser.add_argument(
+    "-s",
+    "--ref_text",
+    type=str,
+    help="The transcript/subtitle for the reference audio",
+)
+parser.add_argument(
+    "-t",
+    "--gen_text",
+    type=str,
+    help="The text to make model synthesize a speech",
+)
+parser.add_argument(
+    "-f",
+    "--gen_file",
+    type=str,
+    help="The file with text to generate, will ignore --gen_text",
+)
+parser.add_argument(
+    "-o",
+    "--output_dir",
+    type=str,
+    help="The path to output folder",
+)
+parser.add_argument(
+    "-w",
+    "--output_file",
+    type=str,
+    help="The name of output file",
+)
+parser.add_argument(
+    "--save_chunk",
+    action="store_true",
+    help="To save each audio chunks during inference",
+)
+parser.add_argument(
+    "--remove_silence",
+    action="store_true",
+    help="To remove long silence found in ouput",
+)
+parser.add_argument(
+    "--load_vocoder_from_local",
+    action="store_true",
+    help="To load vocoder from local dir, default to ../checkpoints/vocos-mel-24khz",
+)
+parser.add_argument(
+    "--vocoder_name",
+    type=str,
+    choices=["vocos", "bigvgan"],
+    help=f"Used vocoder name: vocos | bigvgan, default {mel_spec_type}",
+)
+parser.add_argument(
+    "--target_rms",
+    type=float,
+    help=f"Target output speech loudness normalization value, default {target_rms}",
+)
+parser.add_argument(
+    "--cross_fade_duration",
+    type=float,
+    help=f"Duration of cross-fade between audio segments in seconds, default {cross_fade_duration}",
+)
+parser.add_argument(
+    "--nfe_step",
+    type=int,
+    help=f"The number of function evaluation (denoising steps), default {nfe_step}",
+)
+parser.add_argument(
+    "--cfg_strength",
+    type=float,
+    help=f"Classifier-free guidance strength, default {cfg_strength}",
+)
+parser.add_argument(
+    "--sway_sampling_coef",
+    type=float,
+    help=f"Sway Sampling coefficient, default {sway_sampling_coef}",
+)
+parser.add_argument(
+    "--speed",
+    type=float,
+    help=f"The speed of the generated audio, default {speed}",
+)
+parser.add_argument(
+    "--fix_duration",
+    type=float,
+    help=f"Fix the total duration (ref and gen audios) in seconds, default {fix_duration}",
+)
+parser.add_argument(
+    "--device",
+    type=str,
+    help="Specify the device to run on",
+)
+args = parser.parse_args()
+# config file
+config = tomli.load(open(args.config, "rb"))
+# command-line interface parameters
+model = args.model or config.get("model", "F5TTS_v1_Base")
+ckpt_file = args.ckpt_file or config.get("ckpt_file", "")
+vocab_file = args.vocab_file or config.get("vocab_file", "")
+ref_audio = args.ref_audio or config.get("ref_audio", "infer/examples/basic/basic_ref_en.wav")
+ref_text = (
+    args.ref_text
+    if args.ref_text is not None
+    else config.get("ref_text", "Some call me nature, others call me mother nature.")
+)
+gen_text = args.gen_text or config.get("gen_text", "Here we generate something just for test.")
+gen_file = args.gen_file or config.get("gen_file", "")
+output_dir = args.output_dir or config.get("output_dir", "tests")
+output_file = args.output_file or config.get(
+    "output_file", f"infer_cli_{datetime.now().strftime(r'%Y%m%d_%H%M%S')}.wav"
+)
+save_chunk = args.save_chunk or config.get("save_chunk", False)
+remove_silence = args.remove_silence or config.get("remove_silence", False)
+load_vocoder_from_local = args.load_vocoder_from_local or config.get("load_vocoder_from_local", False)
+vocoder_name = args.vocoder_name or config.get("vocoder_name", mel_spec_type)
+target_rms = args.target_rms or config.get("target_rms", target_rms)
+cross_fade_duration = args.cross_fade_duration or config.get("cross_fade_duration", cross_fade_duration)
+nfe_step = args.nfe_step or config.get("nfe_step", nfe_step)
+cfg_strength = args.cfg_strength or config.get("cfg_strength", cfg_strength)
+sway_sampling_coef = args.sway_sampling_coef or config.get("sway_sampling_coef", sway_sampling_coef)
+speed = args.speed or config.get("speed", speed)
+fix_duration = args.fix_duration or config.get("fix_duration", fix_duration)
+device = args.device or config.get("device", device)
+# patches for pip pkg user
+if "infer/examples/" in ref_audio:
+    ref_audio = str(files("lemas_tts").joinpath(f"{ref_audio}"))
+if "infer/examples/" in gen_file:
+    gen_file = str(files("lemas_tts").joinpath(f"{gen_file}"))
+if "voices" in config:
+    for voice in config["voices"]:
+        voice_ref_audio = config["voices"][voice]["ref_audio"]
+        if "infer/examples/" in voice_ref_audio:
+            config["voices"][voice]["ref_audio"] = str(files("lemas_tts").joinpath(f"{voice_ref_audio}"))
+# ignore gen_text if gen_file provided
+if gen_file:
+    gen_text = codecs.open(gen_file, "r", "utf-8").read()
+# output path
+wave_path = Path(output_dir) / output_file
+# spectrogram_path = Path(output_dir) / "infer_cli_out.png"
+if save_chunk:
+    output_chunk_dir = os.path.join(output_dir, f"{Path(output_file).stem}_chunks")
+    if not os.path.exists(output_chunk_dir):
+        os.makedirs(output_chunk_dir)
+# load vocoder
+if vocoder_name == "vocos":
+    vocoder_local_path = str(CKPTS_ROOT / "vocos-mel-24khz")
+elif vocoder_name == "bigvgan":
+    vocoder_local_path = "../checkpoints/bigvgan_v2_24khz_100band_256x"
+vocoder = load_vocoder(
+    vocoder_name=vocoder_name, is_local=load_vocoder_from_local, local_path=vocoder_local_path, device=device
+)
+# load TTS model
+model_cfg = OmegaConf.load(
+    args.model_cfg or config.get("model_cfg", str(files("lemas_tts").joinpath(f"configs/{model}.yaml")))
+)
+model_cls = get_class(f"lemas_tts.model.{model_cfg.model.backbone}")
+model_arc = model_cfg.model.arch
+repo_name, ckpt_step, ckpt_type = "F5-TTS", 1250000, "safetensors"
+if model != "F5TTS_Base":
+    assert vocoder_name == model_cfg.model.mel_spec.mel_spec_type
+# override for previous models
+if model == "F5TTS_Base":
+    if vocoder_name == "vocos":
+        ckpt_step = 1200000
+    elif vocoder_name == "bigvgan":
+        model = "F5TTS_Base_bigvgan"
+        ckpt_type = "pt"
+elif model == "E2TTS_Base":
+    repo_name = "E2-TTS"
+    ckpt_step = 1200000
+if not ckpt_file:
+    ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{model}/model_{ckpt_step}.{ckpt_type}"))
+print(f"Using {model}...")
+ema_model = load_model(
+    model_cls, model_arc, ckpt_file, mel_spec_type=vocoder_name, vocab_file=vocab_file, device=device
+)
+# inference process
+def main():
+    main_voice = {"ref_audio": ref_audio, "ref_text": ref_text}
+    if "voices" not in config:
+        voices = {"main": main_voice}
+    else:
+        voices = config["voices"]
+        voices["main"] = main_voice
+    for voice in voices:
+        print("Voice:", voice)
+        print("ref_audio ", voices[voice]["ref_audio"])
+        voices[voice]["ref_audio"], voices[voice]["ref_text"] = preprocess_ref_audio_text(
+            voices[voice]["ref_audio"], voices[voice]["ref_text"]
+        )
+        print("ref_audio_", voices[voice]["ref_audio"], "\n\n")
+    generated_audio_segments = []
+    reg1 = r"(?=\[\w+\])"
+    chunks = re.split(reg1, gen_text)
+    reg2 = r"\[(\w+)\]"
+    for text in chunks:
+        if not text.strip():
+            continue
+        match = re.match(reg2, text)
+        if match:
+            voice = match[1]
+        else:
+            print("No voice tag found, using main.")
+            voice = "main"
+        if voice not in voices:
+            print(f"Voice {voice} not found, using main.")
+            voice = "main"
+        text = re.sub(reg2, "", text)
+        ref_audio_ = voices[voice]["ref_audio"]
+        ref_text_ = voices[voice]["ref_text"]
+        gen_text_ = text.strip()
+        print(f"Voice: {voice}")
+        audio_segment, final_sample_rate, spectragram = infer_process(
+            ref_audio_,
+            ref_text_,
+            gen_text_,
+            ema_model,
+            vocoder,
+            mel_spec_type=vocoder_name,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=device,
+        )
+        generated_audio_segments.append(audio_segment)
+        if save_chunk:
+            if len(gen_text_) > 200:
+                gen_text_ = gen_text_[:200] + " ... "
+            sf.write(
+                os.path.join(output_chunk_dir, f"{len(generated_audio_segments) - 1}_{gen_text_}.wav"),
+                audio_segment,
+                final_sample_rate,
+            )
+    if generated_audio_segments:
+        final_wave = np.concatenate(generated_audio_segments)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+        with open(wave_path, "wb") as f:
+            sf.write(f.name, final_wave, final_sample_rate)
+            # Remove silence
+            if remove_silence:
+                remove_silence_for_generated_wav(f.name)
+            print(f.name)
+if __name__ == "__main__":
+    main()

lemas_tts/infer/text_norm/__init__.py ADDED Viewed

File without changes

lemas_tts/infer/text_norm/cn_tn.py ADDED Viewed

	@@ -0,0 +1,824 @@

+#!/usr/bin/env python3
+# coding=utf-8
+# Authors:
+#   2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git)
+#   2019.9 Jiayu DU
+#
+# requirements:
+#   - python 3.X
+# notes: python 2.X WILL fail or produce misleading results
+import sys, os, argparse, codecs, string, re, unicodedata
+# ================================================================================ #
+#                                    basic constant
+# ================================================================================ #
+CHINESE_DIGIS = u'零一二三四五六七八九'
+BIG_CHINESE_DIGIS_SIMPLIFIED = u'零壹贰叁肆伍陆柒捌玖'
+BIG_CHINESE_DIGIS_TRADITIONAL = u'零壹貳參肆伍陸柒捌玖'
+SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = u'拾佰仟萬'
+LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'亿兆京垓秭穰沟涧正载'
+LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'億兆京垓秭穰溝澗正載'
+SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'拾佰仟萬'
+ZERO_ALT = u'〇'
+ONE_ALT = u'幺'
+TWO_ALTS = [u'两', u'兩']
+POSITIVE = [u'正', u'正']
+NEGATIVE = [u'负', u'負']
+POINT = [u'点', u'點']
+# PLUS = [u'加', u'加']
+# SIL = [u'杠', u'槓']
+# 中文数字系统类型
+NUMBERING_TYPES = ['low', 'mid', 'high']
+CURRENCY_NAMES = '(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|' \
+                 '里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)'
+CURRENCY_UNITS = '((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
+COM_QUANTIFIERS = '(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|' \
+                  '砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|' \
+                  '针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|' \
+                  '毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|' \
+                  '盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|' \
+                  '纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块)'
+# punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git)
+CHINESE_PUNC_STOP = '！？｡。'
+CHINESE_PUNC_NON_STOP = '＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
+CHINESE_PUNC_OTHER = '·〈〉-'
+CHINESE_PUNC_LIST = CHINESE_PUNC_STOP + CHINESE_PUNC_NON_STOP + CHINESE_PUNC_OTHER
+# ================================================================================ #
+#                                    basic class
+# ================================================================================ #
+class ChineseChar(object):
+    """
+    中文字符
+    每个字符对应简体和繁体,
+    e.g. 简体 = '负', 繁体 = '負'
+    转换时可转换为简体或繁体
+    """
+    def __init__(self, simplified, traditional):
+        self.simplified = simplified
+        self.traditional = traditional
+        #self.__repr__ = self.__str__
+    def __str__(self):
+        return self.simplified or self.traditional or None
+    def __repr__(self):
+        return self.__str__()
+class ChineseNumberUnit(ChineseChar):
+    """
+    中文数字/数位字符
+    每个字符除繁简体外还有一个额外的大写字符
+    e.g. '陆' 和 '陸'
+    """
+    def __init__(self, power, simplified, traditional, big_s, big_t):
+        super(ChineseNumberUnit, self).__init__(simplified, traditional)
+        self.power = power
+        self.big_s = big_s
+        self.big_t = big_t
+    def __str__(self):
+        return '10^{}'.format(self.power)
+    @classmethod
+    def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False):
+        if small_unit:
+            return ChineseNumberUnit(power=index + 1,
+                                     simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[0]:
+            return ChineseNumberUnit(power=index + 8,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[1]:
+            return ChineseNumberUnit(power=(index + 2) * 4,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[2]:
+            return ChineseNumberUnit(power=pow(2, index + 3),
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        else:
+            raise ValueError(
+                'Counting type should be in {0} ({1} provided).'.format(NUMBERING_TYPES, numbering_type))
+class ChineseNumberDigit(ChineseChar):
+    """
+    中文数字字符
+    """
+    def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None):
+        super(ChineseNumberDigit, self).__init__(simplified, traditional)
+        self.value = value
+        self.big_s = big_s
+        self.big_t = big_t
+        self.alt_s = alt_s
+        self.alt_t = alt_t
+    def __str__(self):
+        return str(self.value)
+    @classmethod
+    def create(cls, i, v):
+        return ChineseNumberDigit(i, v[0], v[1], v[2], v[3])
+class ChineseMath(ChineseChar):
+    """
+    中文数位字符
+    """
+    def __init__(self, simplified, traditional, symbol, expression=None):
+        super(ChineseMath, self).__init__(simplified, traditional)
+        self.symbol = symbol
+        self.expression = expression
+        self.big_s = simplified
+        self.big_t = traditional
+CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath
+class NumberSystem(object):
+    """
+    中文数字系统
+    """
+    pass
+class MathSymbol(object):
+    """
+    用于中文数字系统的数学符号 (繁/简体), e.g.
+    positive = ['正', '正']
+    negative = ['负', '負']
+    point = ['点', '點']
+    """
+    def __init__(self, positive, negative, point):
+        self.positive = positive
+        self.negative = negative
+        self.point = point
+    def __iter__(self):
+        for v in self.__dict__.values():
+            yield v
+# class OtherSymbol(object):
+#     """
+#     其他符号
+#     """
+#
+#     def __init__(self, sil):
+#         self.sil = sil
+#
+#     def __iter__(self):
+#         for v in self.__dict__.values():
+#             yield v
+# ================================================================================ #
+#                                    basic utils
+# ================================================================================ #
+def create_system(numbering_type=NUMBERING_TYPES[1]):
+    """
+    根据数字系统类型返回创建相应的数字系统，默认为 mid
+    NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型
+        low:  '兆' = '亿' * '十' = $10^{9}$,  '京' = '兆' * '十', etc.
+        mid:  '兆' = '亿' * '万' = $10^{12}$, '京' = '兆' * '万', etc.
+        high: '兆' = '亿' * '亿' = $10^{16}$, '京' = '兆' * '兆', etc.
+    返回对应的数字系统
+    """
+    # chinese number units of '亿' and larger
+    all_larger_units = zip(
+        LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    larger_units = [CNU.create(i, v, numbering_type, False)
+                    for i, v in enumerate(all_larger_units)]
+    # chinese number units of '十, 百, 千, 万'
+    all_smaller_units = zip(
+        SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    smaller_units = [CNU.create(i, v, small_unit=True)
+                     for i, v in enumerate(all_smaller_units)]
+    # digis
+    chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS,
+                        BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL)
+    digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)]
+    digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT
+    digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT
+    digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1]
+    # symbols
+    positive_cn = CM(POSITIVE[0], POSITIVE[1], '+', lambda x: x)
+    negative_cn = CM(NEGATIVE[0], NEGATIVE[1], '-', lambda x: -x)
+    point_cn = CM(POINT[0], POINT[1], '.', lambda x,
+                  y: float(str(x) + '.' + str(y)))
+    # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y)))
+    system = NumberSystem()
+    system.units = smaller_units + larger_units
+    system.digits = digits
+    system.math = MathSymbol(positive_cn, negative_cn, point_cn)
+    # system.symbols = OtherSymbol(sil_cn)
+    return system
+def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]):
+    def get_symbol(char, system):
+        for u in system.units:
+            if char in [u.traditional, u.simplified, u.big_s, u.big_t]:
+                return u
+        for d in system.digits:
+            if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]:
+                return d
+        for m in system.math:
+            if char in [m.traditional, m.simplified]:
+                return m
+    def string2symbols(chinese_string, system):
+        int_string, dec_string = chinese_string, ''
+        for p in [system.math.point.simplified, system.math.point.traditional]:
+            if p in chinese_string:
+                int_string, dec_string = chinese_string.split(p)
+                break
+        return [get_symbol(c, system) for c in int_string], \
+               [get_symbol(c, system) for c in dec_string]
+    def correct_symbols(integer_symbols, system):
+        """
+        一百八 to 一百八十
+        一亿一千三百万 to 一亿 一千万 三百万
+        """
+        if integer_symbols and isinstance(integer_symbols[0], CNU):
+            if integer_symbols[0].power == 1:
+                integer_symbols = [system.digits[1]] + integer_symbols
+        if len(integer_symbols) > 1:
+            if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU):
+                integer_symbols.append(
+                    CNU(integer_symbols[-2].power - 1, None, None, None, None))
+        result = []
+        unit_count = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                result.append(s)
+                unit_count = 0
+            elif isinstance(s, CNU):
+                current_unit = CNU(s.power, None, None, None, None)
+                unit_count += 1
+            if unit_count == 1:
+                result.append(current_unit)
+            elif unit_count > 1:
+                for i in range(len(result)):
+                    if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power:
+                        result[-i - 1] = CNU(result[-i - 1].power +
+                                             current_unit.power, None, None, None, None)
+        return result
+    def compute_value(integer_symbols):
+        """
+        Compute the value.
+        When current unit is larger than previous unit, current unit * all previous units will be used as all previous units.
+        e.g. '两千万' = 2000 * 10000 not 2000 + 10000
+        """
+        value = [0]
+        last_power = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                value[-1] = s.value
+            elif isinstance(s, CNU):
+                value[-1] *= pow(10, s.power)
+                if s.power > last_power:
+                    value[:-1] = list(map(lambda v: v *
+                                                    pow(10, s.power), value[:-1]))
+                    last_power = s.power
+                value.append(0)
+        return sum(value)
+    system = create_system(numbering_type)
+    int_part, dec_part = string2symbols(chinese_string, system)
+    int_part = correct_symbols(int_part, system)
+    int_str = str(compute_value(int_part))
+    dec_str = ''.join([str(d.value) for d in dec_part])
+    if dec_part:
+        return '{0}.{1}'.format(int_str, dec_str)
+    else:
+        return int_str
+def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False,
+            traditional=False, alt_zero=False, alt_one=False, alt_two=True,
+            use_zeros=True, use_units=True):
+    def get_value(value_string, use_zeros=True):
+        striped_string = value_string.lstrip('0')
+        # record nothing if all zeros
+        if not striped_string:
+            return []
+        # record one digits
+        elif len(striped_string) == 1:
+            if use_zeros and len(value_string) != len(striped_string):
+                return [system.digits[0], system.digits[int(striped_string)]]
+            else:
+                return [system.digits[int(striped_string)]]
+        # recursively record multiple digits
+        else:
+            result_unit = next(u for u in reversed(
+                system.units) if u.power < len(striped_string))
+            result_string = value_string[:-result_unit.power]
+            return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power:])
+    system = create_system(numbering_type)
+    int_dec = number_string.split('.')
+    if len(int_dec) == 1:
+        int_string = int_dec[0]
+        dec_string = ""
+    elif len(int_dec) == 2:
+        int_string = int_dec[0]
+        dec_string = int_dec[1]
+    else:
+        raise ValueError(
+            "invalid input num string with more than one dot: {}".format(number_string))
+    if use_units and len(int_string) > 1:
+        result_symbols = get_value(int_string)
+    else:
+        result_symbols = [system.digits[int(c)] for c in int_string]
+    dec_symbols = [system.digits[int(c)] for c in dec_string]
+    if dec_string:
+        result_symbols += [system.math.point] + dec_symbols
+    if alt_two:
+        liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t,
+                    system.digits[2].big_s, system.digits[2].big_t)
+        for i, v in enumerate(result_symbols):
+            if isinstance(v, CND) and v.value == 2:
+                next_symbol = result_symbols[i +
+                                             1] if i < len(result_symbols) - 1 else None
+                previous_symbol = result_symbols[i - 1] if i > 0 else None
+                if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))):
+                    if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)):
+                        result_symbols[i] = liang
+    # if big is True, '两' will not be used and `alt_two` has no impact on output
+    if big:
+        attr_name = 'big_'
+        if traditional:
+            attr_name += 't'
+        else:
+            attr_name += 's'
+    else:
+        if traditional:
+            attr_name = 'traditional'
+        else:
+            attr_name = 'simplified'
+    result = ''.join([getattr(s, attr_name) for s in result_symbols])
+    # if not use_zeros:
+    #     result = result.strip(getattr(system.digits[0], attr_name))
+    if alt_zero:
+        result = result.replace(
+            getattr(system.digits[0], attr_name), system.digits[0].alt_s)
+    if alt_one:
+        result = result.replace(
+            getattr(system.digits[1], attr_name), system.digits[1].alt_s)
+    for i, p in enumerate(POINT):
+        if result.startswith(p):
+            return CHINESE_DIGIS[0] + result
+    # ^10, 11, .., 19
+    if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0],
+                                          SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and \
+            result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]:
+        result = result[1:]
+    return result
+# ================================================================================ #
+#                          different types of rewriters
+# ================================================================================ #
+class Cardinal:
+    """
+    CARDINAL类
+    """
+    def __init__(self, cardinal=None, chntext=None):
+        self.cardinal = cardinal
+        self.chntext = chntext
+    def chntext2cardinal(self):
+        return chn2num(self.chntext)
+    def cardinal2chntext(self):
+        return num2chn(self.cardinal)
+class Digit:
+    """
+    DIGIT类
+    """
+    def __init__(self, digit=None, chntext=None):
+        self.digit = digit
+        self.chntext = chntext
+    # def chntext2digit(self):
+    #     return chn2num(self.chntext)
+    def digit2chntext(self):
+        return num2chn(self.digit, alt_two=False, use_units=False)
+class TelePhone:
+    """
+    TELEPHONE类
+    """
+    def __init__(self, telephone=None, raw_chntext=None, chntext=None):
+        self.telephone = telephone
+        self.raw_chntext = raw_chntext
+        self.chntext = chntext
+    # def chntext2telephone(self):
+    #     sil_parts = self.raw_chntext.split('<SIL>')
+    #     self.telephone = '-'.join([
+    #         str(chn2num(p)) for p in sil_parts
+    #     ])
+    #     return self.telephone
+    def telephone2chntext(self, fixed=False):
+        if fixed:
+            sil_parts = self.telephone.split('-')
+            self.raw_chntext = '<SIL>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sil_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SIL>', '')
+        else:
+            sp_parts = self.telephone.strip('+').split()
+            self.raw_chntext = '<SP>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sp_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SP>', '')
+        return self.chntext
+class Fraction:
+    """
+    FRACTION类
+    """
+    def __init__(self, fraction=None, chntext=None):
+        self.fraction = fraction
+        self.chntext = chntext
+    def chntext2fraction(self):
+        denominator, numerator = self.chntext.split('分之')
+        return chn2num(numerator) + '/' + chn2num(denominator)
+    def fraction2chntext(self):
+        numerator, denominator = self.fraction.split('/')
+        return num2chn(denominator) + '分之' + num2chn(numerator)
+class Date:
+    """
+    DATE类
+    """
+    def __init__(self, date=None, chntext=None):
+        self.date = date
+        self.chntext = chntext
+    # def chntext2date(self):
+    #     chntext = self.chntext
+    #     try:
+    #         year, other = chntext.strip().split('年', maxsplit=1)
+    #         year = Digit(chntext=year).digit2chntext() + '年'
+    #     except ValueError:
+    #         other = chntext
+    #         year = ''
+    #     if other:
+    #         try:
+    #             month, day = other.strip().split('月', maxsplit=1)
+    #             month = Cardinal(chntext=month).chntext2cardinal() + '月'
+    #         except ValueError:
+    #             day = chntext
+    #             month = ''
+    #         if day:
+    #             day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1]
+    #     else:
+    #         month = ''
+    #         day = ''
+    #     date = year + month + day
+    #     self.date = date
+    #     return self.date
+    def date2chntext(self):
+        date = self.date
+        try:
+            year, other = date.strip().split('年', 1)
+            year = Digit(digit=year).digit2chntext() + '年'
+        except ValueError:
+            other = date
+            year = ''
+        if other:
+            try:
+                month, day = other.strip().split('月', 1)
+                month = Cardinal(cardinal=month).cardinal2chntext() + '月'
+            except ValueError:
+                day = date
+                month = ''
+            if day:
+                day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1]
+        else:
+            month = ''
+            day = ''
+        chntext = year + month + day
+        self.chntext = chntext
+        return self.chntext
+class Time:
+    """
+    MONEY类
+    """
+    def __init__(self, time=None, chntext=None):
+        self.time = time
+        self.chntext = chntext
+    # def chntext2money(self):
+    #     return self.money
+    def time2chntext(self):
+        time = self.time.replace('-', '至')
+        pattern = re.compile(r'(\d{1,2}:\d{1,2}(:)?(\d{1,2})?)')
+        matchers = pattern.findall(time)
+        if matchers:
+            if len(matchers[0])>2:
+                time = time.replace(':', '时', 1)
+            time = time.replace(':', '分', 1)
+        self.chntext = time
+        return self.chntext
+class Money:
+    """
+    MONEY类
+    """
+    def __init__(self, money=None, chntext=None):
+        self.money = money
+        self.chntext = chntext
+    # def chntext2money(self):
+    #     return self.money
+    def money2chntext(self):
+        money = self.money
+        pattern = re.compile(r'(\d+(\.\d+)?)')
+        matchers = pattern.findall(money)
+        if matchers:
+            for matcher in matchers:
+                money = money.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext())
+        self.chntext = money
+        return self.chntext
+class Percentage:
+    """
+    PERCENTAGE类
+    """
+    def __init__(self, percentage=None, chntext=None):
+        self.percentage = percentage
+        self.chntext = chntext
+    def chntext2percentage(self):
+        return chn2num(self.chntext.strip().strip('百分之')) + '%'
+    def percentage2chntext(self):
+        return '百分之' + num2chn(self.percentage.strip().strip('%'))
+# ================================================================================ #
+#                            NSW Normalizer
+# ================================================================================ #
+class NSWNormalizer:
+    def __init__(self):
+        self.raw_text = ' ' # '^' + raw_text + '$'
+        self.norm_text = ''
+    def _particular(self):
+        text = self.norm_text
+        pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('particular')
+            for matcher in matchers:
+                text = text.replace(matcher[0], matcher[1]+'2'+matcher[2], 1)
+        self.norm_text = text
+        return self.norm_text
+    def normalize(self, raw_text):
+        self.raw_text = '^' + raw_text + '$'
+        text = unicodedata.normalize("NFKC", self.raw_text)
+        # 规范化日期
+        pattern = re.compile(r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('date')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1)
+        # 规范化时间
+        pattern = re.compile(r"\D+((\d{1,2}-)?\d{1,2}[时点:]((\d{1,2}-)?\d{1,2}[分:]((\d{1,2}-)?\d{1,2}秒)?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('time')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Time(time=matcher[0]).time2chntext(), 1)
+        # 规范化金钱
+        pattern = re.compile(r"\D+((\d+(\.\d+)?)[多余几]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('money')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Money(money=matcher[0]).money2chntext(), 1)
+        # 规范化固话/手机号码
+        # 手机
+        # http://www.jihaoba.com/news/show/13680
+        # 移动：139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198
+        # 联通：130、131、132、156、155、186、185、176
+        # 电信：133、153、189、180、181、177
+        pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1)
+        # 固话
+        pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('fixed telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True), 1)
+        # 规范化分数
+        pattern = re.compile(r"(\d+/\d+)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('fraction')
+            for matcher in matchers:
+                text = text.replace(matcher, Fraction(fraction=matcher).fraction2chntext(), 1)
+        # 规范化百分数
+        text = text.replace('％', '%')
+        pattern = re.compile(r"(\d+(\.\d+)?%)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('percentage')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Percentage(percentage=matcher[0]).percentage2chntext(), 1)
+        # 规范化纯数+量词
+        pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS)
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal+quantifier')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+        # 规范化数字编号
+        pattern = re.compile(r"(\d{2,32})")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('digit')
+            for matcher in matchers:
+                text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1)
+        # 规范化纯数
+        pattern = re.compile(r"(\d+(\.\d+)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+        self.norm_text = text
+        self._particular()
+        return self.norm_text.lstrip('^').rstrip('$')
+def nsw_test_case(raw_text):
+    print('I:' + raw_text)
+    print('O:' + NSWNormalizer(raw_text).normalize())
+    print('')
+def nsw_test():
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('手机：+86 19859213959或15659451527。')
+    nsw_test_case('分数：32477/76391。')
+    nsw_test_case('百分数：80.03%。')
+    nsw_test_case('编号：31520181154418。')
+    nsw_test_case('纯数：2983.07克或12345.60米。')
+    nsw_test_case('日期：1999年2月20日或09年3月15号。')
+    nsw_test_case('金钱：12块5，34.5元，20.1万')
+    nsw_test_case('特殊：O2O或B2C。')
+    nsw_test_case('3456万吨')
+    nsw_test_case('2938个')
+    nsw_test_case('938')
+    nsw_test_case('今天吃了115个小笼包231个馒头')
+    nsw_test_case('有62％的概率')
+if __name__ == '__main__':
+    #nsw_test()
+    p = argparse.ArgumentParser()
+    p.add_argument('ifile', help='input filename, assume utf-8 encoding')
+    p.add_argument('ofile', help='output filename')
+    p.add_argument('--to_upper', action='store_true', help='convert to upper case')
+    p.add_argument('--to_lower', action='store_true', help='convert to lower case')
+    p.add_argument('--has_key', action='store_true', help="input text has Kaldi's key as first field.")
+    p.add_argument('--log_interval', type=int, default=100000, help='log interval in number of processed lines')
+    args = p.parse_args()
+    ifile = codecs.open(args.ifile, 'r', 'utf8')
+    ofile = codecs.open(args.ofile, 'w+', 'utf8')
+    n = 0
+    for l in ifile:
+        key = ''
+        text = ''
+        if args.has_key:
+            cols = l.split(maxsplit=1)
+            key = cols[0]
+            if len(cols) == 2:
+                text = cols[1].strip()
+            else:
+                text = ''
+        else:
+            text = l.strip()
+        # cases
+        if args.to_upper and args.to_lower:
+            sys.stderr.write('cn_tn.py: to_upper OR to_lower?')
+            exit(1)
+        if args.to_upper:
+            text = text.upper()
+        if args.to_lower:
+            text = text.lower()
+        # NSW(Non-Standard-Word) normalization
+        text = NSWNormalizer(text).normalize()
+        # Punctuations removal
+        old_chars = CHINESE_PUNC_LIST + string.punctuation # includes all CN and EN punctuations
+        new_chars = ' ' * len(old_chars)
+        del_chars = ''
+        text = text.translate(str.maketrans(old_chars, new_chars, del_chars))
+        #
+        if args.has_key:
+            ofile.write(key + '\t' + text + '\n')
+        else:
+            if text.strip() != '': # skip empty line in pure text format(without Kaldi's utt key)
+                ofile.write(text + '\n')
+        n += 1
+        if n % args.log_interval == 0:
+            sys.stderr.write("cn_tn.py: {} lines done.\n".format(n))
+            sys.stderr.flush()
+    sys.stderr.write("cn_tn.py: {} lines done in total.\n".format(n))
+    sys.stderr.flush()
+    ifile.close()
+    ofile.close()

lemas_tts/infer/text_norm/en_tn.py ADDED Viewed

	@@ -0,0 +1,178 @@

+# -*- coding: utf-8 -*-
+# Copyright (c) 2017 Keith Ito
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+import re
+from unidecode import unidecode
+import inflect
+_inflect = inflect.engine()
+_comma_number_re = re.compile(r"([0-9][0-9\,]+[0-9])")
+_decimal_number_re = re.compile(r"([0-9]+\.[0-9]+)")
+_pounds_re = re.compile(r"£([0-9\,]*[0-9]+)")
+_dollars_re = re.compile(r"\$([0-9\.\,]*[0-9]+)")
+_ordinal_re = re.compile(r"[0-9]+(st|nd|rd|th)")
+_number_re = re.compile(r"[0-9]+")
+def _remove_commas(m):
+    return m.group(1).replace(",", "")
+def _expand_decimal_point(m):
+    return m.group(1).replace(".", " point ")
+def _expand_dollars(m):
+    match = m.group(1)
+    parts = match.split(".")
+    if len(parts) > 2:
+        return match + " dollars"  # Unexpected format
+    dollars = int(parts[0]) if parts[0] else 0
+    cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
+    if dollars and cents:
+        dollar_unit = "dollar" if dollars == 1 else "dollars"
+        cent_unit = "cent" if cents == 1 else "cents"
+        return "%s %s, %s %s" % (dollars, dollar_unit, cents, cent_unit)
+    elif dollars:
+        dollar_unit = "dollar" if dollars == 1 else "dollars"
+        return "%s %s" % (dollars, dollar_unit)
+    elif cents:
+        cent_unit = "cent" if cents == 1 else "cents"
+        return "%s %s" % (cents, cent_unit)
+    else:
+        return "zero dollars"
+def _expand_ordinal(m):
+    return _inflect.number_to_words(m.group(0))
+def _expand_number(m):
+    num = int(m.group(0))
+    if num > 1000 and num < 3000:
+        if num == 2000:
+            return "two thousand"
+        elif num > 2000 and num < 2010:
+            return "two thousand " + _inflect.number_to_words(num % 100)
+        elif num % 100 == 0:
+            return _inflect.number_to_words(num // 100) + " hundred"
+        else:
+            return _inflect.number_to_words(
+                num, andword="", zero="oh", group=2
+            ).replace(", ", " ")
+    else:
+        return _inflect.number_to_words(num, andword="")
+def normalize_numbers(text):
+    text = re.sub(_comma_number_re, _remove_commas, text)
+    text = re.sub(_pounds_re, r"\1 pounds", text)
+    text = re.sub(_dollars_re, _expand_dollars, text)
+    text = re.sub(_decimal_number_re, _expand_decimal_point, text)
+    text = re.sub(_ordinal_re, _expand_ordinal, text)
+    text = re.sub(_number_re, _expand_number, text)
+    return text
+# Regular expression matching whitespace:
+_whitespace_re = re.compile(r"\s+")
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = [
+    (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1])
+    for x in [
+        ("mrs", "misess"),
+        ("mr", "mister"),
+        ("dr", "doctor"),
+        ("st", "saint"),
+        ("co", "company"),
+        ("jr", "junior"),
+        ("maj", "major"),
+        ("gen", "general"),
+        ("drs", "doctors"),
+        ("rev", "reverend"),
+        ("lt", "lieutenant"),
+        ("hon", "honorable"),
+        ("sgt", "sergeant"),
+        ("capt", "captain"),
+        ("esq", "esquire"),
+        ("ltd", "limited"),
+        ("col", "colonel"),
+        ("ft", "fort"),
+    ]
+]
+def expand_abbreviations(text):
+    for regex, replacement in _abbreviations:
+        text = re.sub(regex, replacement, text)
+    return text
+def expand_numbers(text):
+    return normalize_numbers(text)
+def lowercase(text):
+    return text.lower()
+def collapse_whitespace(text):
+    return re.sub(_whitespace_re, " ", text)
+def convert_to_ascii(text):
+    return unidecode(text)
+def basic_cleaners(text):
+    """Basic pipeline that lowercases and collapses whitespace without transliteration."""
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def transliteration_cleaners(text):
+    """Pipeline for non-English text that transliterates to ASCII."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def english_cleaners(text):
+    """Pipeline for English text, including number and abbreviation expansion."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = expand_numbers(text)
+    text = expand_abbreviations(text)
+    text = collapse_whitespace(text)
+    return text
+def read_lexicon(lex_path):
+    lexicon = {}
+    with open(lex_path) as f:
+        for line in f:
+            temp = re.split(r"\s+", line.strip("\n"))
+            word = temp[0]
+            phones = temp[1:]
+            if word not in lexicon:
+                lexicon[word] = phones
+    return lexicon

lemas_tts/infer/text_norm/gp2py.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import argparse
+import copy
+import os
+from typing import List
+import jieba
+import pypinyin
+SPECIAL_NOTES = '。？！?!.;；:,，:'
+def read_vocab(file: os.PathLike) -> List[str]:
+    with open(file) as f:
+        vocab = f.read().split('\n')
+        vocab = [v for v in vocab if len(v) > 0 and v != '\n']
+    return vocab
+class TextNormal:
+    def __init__(self,
+                 gp_vocab_file: os.PathLike,
+                 py_vocab_file: os.PathLike,
+                 add_sp1=False,
+                 fix_er=False,
+                 add_sil=True):
+        if gp_vocab_file is not None:
+            self.gp_vocab = read_vocab(gp_vocab_file)
+        if py_vocab_file is not None:
+            self.py_vocab = read_vocab(py_vocab_file)
+            self.in_py_vocab = dict([(p, True) for p in self.py_vocab])
+        self.add_sp1 = add_sp1
+        self.add_sil = add_sil
+        self.fix_er = fix_er
+        # gp2idx = dict([(c, i) for i, c in enumerate(self.gp_vocab)])
+        # idx2gp = dict([(i, c) for i, c in enumerate(self.gp_vocab)])
+    def _split2sent(self, text):
+        new_sub = [text]
+        while True:
+            sub = copy.deepcopy(new_sub)
+            new_sub = []
+            for s in sub:
+                sp = False
+                for t in SPECIAL_NOTES:
+                    if t in s:
+                        new_sub += s.split(t)
+                        sp = True
+                        break
+                if not sp and len(s) > 0:
+                    new_sub += [s]
+            if len(new_sub) == len(sub):
+                break
+        tokens = [a for a in text if a in SPECIAL_NOTES]
+        return new_sub, tokens
+    def _correct_tone3(self, pys: List[str]) -> List[str]:
+        """Fix the continuous tone3 pronunciation problem"""
+        for i in range(2, len(pys)):
+            if pys[i][-1] == '3' and pys[i - 1][-1] == '3' and pys[i - 2][-1] == '3':
+                pys[i - 1] = pys[i - 1][:-1] + '2'  # change the middle one
+        for i in range(1, len(pys)):
+            if pys[i][-1] == '3':
+                if pys[i - 1][-1] == '3':
+                    pys[i - 1] = pys[i - 1][:-1] + '2'
+        return pys
+    def _correct_tone4(self, pys: List[str]) -> List[str]:
+        """Fixed the problem of pronouncing 不 bu2 yao4 / bu4 neng2"""
+        for i in range(len(pys) - 1):
+            if pys[i] == 'bu4':
+                if pys[i + 1][-1] == '4':
+                    pys[i] = 'bu2'
+        return pys
+    def _replace_with_sp(self, pys: List[str]) -> List[str]:
+        for i, p in enumerate(pys):
+            if p in ',，、':
+                pys[i] = 'sp1'
+        return pys
+    def _correct_tone5(self, pys: List[str]) -> List[str]:
+        for i in range(len(pys)):
+            if pys[i][-1] not in '1234':
+                pys[i] += '5'
+        return pys
+    def gp2py(self, gp_text: str) -> List[str]:
+        gp_sent_list, tokens = self._split2sent(gp_text)
+        py_sent_list = []
+        for sent in gp_sent_list:
+            pys = []
+            for words in list(jieba.cut(sent)):
+                py = pypinyin.pinyin(words, pypinyin.TONE3)
+                py = [p[0] for p in py]
+                pys += py
+            if self.add_sp1:
+                pys = self._replace_with_sp(pys)
+            pys = self._correct_tone3(pys)
+            pys = self._correct_tone4(pys)
+            pys = self._correct_tone5(pys)
+            if self.add_sil:
+                py_sent_list += [' '.join(['sil'] + pys + ['sil'])]
+            else:
+                py_sent_list += [' '.join(pys)]
+        if self.add_sil:
+            gp_sent_list = ['sil ' + ' '.join(list(gp)) + ' sil' for gp in gp_sent_list]
+        else:
+            gp_sent_list = [' '.join(list(gp)) for gp in gp_sent_list]
+        if self.fix_er:
+            new_py_sent_list = []
+            for py, gp in zip(py_sent_list, gp_sent_list):
+                py = self._convert_er2(py, gp)
+                new_py_sent_list += [py]
+            py_sent_list = new_py_sent_list
+            print(new_py_sent_list)
+        return py_sent_list, gp_sent_list
+    def _convert_er2(self, py, gp):
+        py2hz = dict([(p, h) for p, h in zip(py.split(), gp.split())])
+        py_list = py.split()
+        for i, p in enumerate(py_list):
+            if (p == 'er2' and py2hz[p] == '儿' and i > 1 and len(py_list[i - 1]) > 2 and py_list[i - 1][-1] in '1234'):
+                py_er = py_list[i - 1][:-1] + 'r' + py_list[i - 1][-1]
+                if self.in_py_vocab.get(py_er, False):  # must in vocab
+                    py_list[i - 1] = py_er
+                    py_list[i] = 'r'
+        py = ' '.join(py_list)
+        return py
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-t', '--text', type=str)
+    args = parser.parse_args()
+    text = args.text
+    tn = TextNormal('gp.vocab', 'py.vocab', add_sp1=True, fix_er=True)
+    py_list, gp_list = tn.gp2py(text)
+    for py, gp in zip(py_list, gp_list):
+        print(py + '|' + gp)

lemas_tts/infer/text_norm/id_tn.py ADDED Viewed

	@@ -0,0 +1,275 @@

+# Indonesian TTS Text Normalization for YouTube subtitles
+# Requirements: pip install num2words
+import re
+from num2words import num2words
+# --- small slang map (expandable) ---
+SLANG_MAP = {
+    "gpp": "nggak apa-apa",
+    "gak": "nggak", "ga": "nggak", "gk": "nggak",
+    "sy": "saya", "sya": "saya",
+    "km": "kamu",
+    "tp": "tapi", "tpi": "tapi",
+    "jd": "jadi",
+    "bgt": "banget",
+    "blm": "belum",
+    "trs": "terus",
+    "sm": "sama",
+    "wkwk": "wkwk",  # keep as-is (laugh token) or strip later
+    "wkwkwk": "wkwk"
+}
+# emoji pattern: removes most emoji blocks
+EMOJI_PATTERN = re.compile(
+    "["
+    "\U0001F600-\U0001F64F"  # emoticons
+    "\U0001F300-\U0001F5FF"  # symbols & pictographs
+    "\U0001F680-\U0001F6FF"  # transport & map symbols
+    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
+    "\U00002700-\U000027BF"  # dingbats
+    "\U000024C2-\U0001F251"
+    "]+", flags=re.UNICODE)
+# units map
+UNITS = {
+    "kg": "kilogram","g": "gram","km": "kilometer",
+    "m": "meter","cm": "sentimeter","mm": "milimeter",
+    "l": "liter"
+}
+# helper: safe num2words for Indonesian
+def num_to_words_ind(num_str):
+    """Convert numeric string to Indonesian words.
+       - Handles integers and simple decimals like '1.5' (reads digits after decimal).
+       - Removes grouping dots in Indonesian numbers (e.g. '10.000').
+    """
+    num_str = num_str.strip()
+    # remove thousand separators commonly used in Indonesian (dot)
+    # but if decimal point (like '1,5' or '1.5'), assume '.' is decimal point (we expect '.' used)
+    # We'll treat commas as thousand separators too if no decimal comma present.
+    if re.match(r'^\d+[.,]\d+$', num_str):
+        # decimal number: normalize to use '.' then split
+        s = num_str.replace(',', '.')
+        left, right = s.split('.', 1)
+        try:
+            left_w = num2words(int(left), lang='id')
+        except:
+            left_w = left
+        # read each decimal digit separately
+        right_w = " ".join(num2words(int(d), lang='id') for d in right if d.isdigit())
+        return f"{left_w} koma {right_w}"
+    else:
+        # remove non-digit separators like dots or commas used as thousand separators
+        cleaned = re.sub(r'[.,]', '', num_str)
+        try:
+            return num2words(int(cleaned), lang='id')
+        except:
+            return num_str
+# helper: per-digit reader for phone numbers (default)
+def read_digits_per_digit(number_str, prefix_plus=False):
+    digits = re.findall(r'\d', number_str)
+    words = " ".join(num2words(int(d), lang='id') for d in digits)
+    if prefix_plus:
+        return "plus " + words
+    return words
+# noise removal rule for tokens like 'yyy6yy' or other long mixed garbage:
+def is_noise_token(tok):
+    # remove tokens that:
+    # - length >=4 and contain at least one digit and at least one letter (typical ASR/keyboard noise)
+    # - or tokens of a single repeated char length >=4 (e.g., 'aaaa', '!!!!!!' but punctuation handled earlier)
+    if len(tok) < 4:
+        return False
+    if re.search(r'[A-Za-z]', tok) and re.search(r'\d', tok):
+        return True
+    if re.fullmatch(r'(.)\1{3,}', tok):  # same char repeated >=4
+        return True
+    return False
+# --- 新增：标点规范化函数 ---
+def punctuation_normalize(text):
+    """
+    - 替换除 . , ! ? 之外的所有标点为逗号
+    - 统一多重逗号为单逗号
+    - 去掉开头多余逗号、省略号
+    - 逗号后空格规范化
+    """
+    # 替换括号、引号、冒号、分号、破折号、省略号等为逗号
+    text = re.sub(r'[:;()\[\]{}"“”«»…—–/\\]', ',', text)
+    # 多个逗号替换成一个
+    text = re.sub(r',+', ',', text)
+    # 开头去掉逗号和省略号
+    text = re.sub(r'^(,|\.\.\.|…)+\s*', '', text)
+    # 逗号后空格规范
+    text = re.sub(r'\s*,\s*', ', ', text)
+    # 多余空白合并
+    text = re.sub(r'\s+', ' ', text).strip()
+    return text
+def normalize_id_tts(text):
+    """
+    Main normalization pipeline tailored for:
+    - Indonesian YouTube subtitles (mostly ASR/MT)
+    - TTS frontend requirements:
+      * Remove emojis
+      * Keep . , ! ? as sentence/phrase delimiters
+      * Replace other punctuation with comma
+      * Expand numbers, percents, currency, units, times, dates
+      * Remove keyboard noise like 'yyy6yy'
+      * Keep English words as-is
+      * Keep repeated words (do not collapse)
+    """
+    if not text:
+        return text
+    # 1) Normalize whitespace and trim
+    text = text.strip()
+    text = re.sub(r'\s+', ' ', text)
+    # 2) Remove emojis
+    text = EMOJI_PATTERN.sub('', text)
+    # 3) 标点规范化（替代原有 PUNCT_TO_COMMA 替换）
+    text = punctuation_normalize(text)
+    # 保护时间和日期的代码（防止被逗号破坏）
+    text = re.sub(r'(\d{1,2}):(\d{2})', lambda m: f"__TIME_{m.group(1)}_{m.group(2)}__", text)
+    text = re.sub(r'(\d{1,4})[\/-](\d{1,2})[\/-](\d{1,4})', lambda m: f"__DATE_{m.group(1)}_{m.group(2)}_{m.group(3)}__", text)
+    # 恢复时间日期标记
+    text = re.sub(r'__TIME_(\d{1,2})_(\d{2})__', lambda m: f"{m.group(1)}:{m.group(2)}", text)
+    text = re.sub(r'__DATE_(\d{1,4})_(\d{1,2})_(\d{1,4})__', lambda m: f"{m.group(1)}/{m.group(2)}/{m.group(3)}", text)
+    # 4) Tokenize loosely by spaces and punctuation
+    tokens = re.split(r'(\s+|[,.!?])', text)  # keep delimiters
+    out_tokens = []
+    for tok in tokens:
+        if not tok or tok.isspace():
+            out_tokens.append(tok)
+            continue
+        # keep punctuation .,!? as-is
+        if tok in ['.', ',', '!', '?']:
+            out_tokens.append(tok)
+            continue
+        # remove any remaining emojis or control chars
+        if EMOJI_PATTERN.search(tok):
+            continue
+        # slang normalization
+        lower_tok = tok.lower()
+        if lower_tok in SLANG_MAP:
+            out_tokens.append(SLANG_MAP[lower_tok])
+            continue
+        # remove noise tokens
+        if is_noise_token(tok):
+            continue
+        # currency: Rp 10.000 or rp10.000
+        m = re.match(r'^(Rp|rp)\s*([0-9\.,]+)$', tok)
+        if m:
+            num = m.group(2)
+            cleaned = re.sub(r'[.,]', '', num)
+            out_tokens.append(f"{num_to_words_ind(cleaned)} rupiah")
+            continue
+        # percent like 30%
+        m = re.match(r'^(\d+)%$', tok)
+        if m:
+            out_tokens.append(f"{num_to_words_ind(m.group(1))} persen")
+            continue
+        # phone numbers +62..., 0812...
+        m = re.match(r'^\+?\d[\d\-\s]{6,}\d$', tok)
+        if m:
+            prefix_plus = tok.startswith('+')
+            out_tokens.append(read_digits_per_digit(tok, prefix_plus=prefix_plus))
+            continue
+        # time hh:mm
+        m = re.match(r'^(\d{1,2}):(\d{2})$', tok)
+        if m:
+            h, mi = m.group(1), m.group(2)
+            h_w = num_to_words_ind(h.lstrip('0') or '0')
+            mi_w = num_to_words_ind(mi.lstrip('0') or '0')
+            out_tokens.append(f"pukul {h_w} lewat {mi_w} menit")
+            continue
+        # date yyyy/mm/dd or dd/mm/yyyy
+        m = re.match(r'^(\d{1,4})\/(\d{1,2})\/(\d{1,4})$', tok)
+        if m:
+            a,b,c = m.group(1), m.group(2).zfill(2), m.group(3)
+            if len(a) == 4:
+                year, month, day = a, b, c
+            elif len(c) == 4:
+                day, month, year = a, b, c
+            else:
+                day, month, year = a, b, c
+            MONTHS = {
+                "01": "Januari","02": "Februari","03": "Maret","04": "April",
+                "05": "Mei","06": "Juni","07": "Juli","08": "Agustus",
+                "09": "September","10": "Oktober","11": "November","12": "Desember"
+            }
+            day_w = num_to_words_ind(day.lstrip('0') or '0')
+            year_w = num_to_words_ind(year)
+            month_name = MONTHS.get(month, month)
+            out_tokens.append(f"{day_w} {month_name} {year_w}")
+            continue
+        # units like 30kg
+        m = re.match(r'^(\d+)\s*(kg|g|km|m|cm|mm|l)$', tok, flags=re.I)
+        if m:
+            num, unit = m.group(1), m.group(2).lower()
+            unit_word = UNITS.get(unit, unit)
+            out_tokens.append(f"{num_to_words_ind(num)} {unit_word}")
+            continue
+        # plain integers
+        if re.fullmatch(r'\d+', tok):
+            out_tokens.append(num_to_words_ind(tok))
+            continue
+        # numbers with separators
+        if re.fullmatch(r'[\d\.,]+', tok) and re.search(r'[.,]', tok):
+            out_tokens.append(num_to_words_ind(tok))
+            continue
+        # keep English/as-is tokens
+        out_tokens.append(tok)
+    normalized = "".join(out_tokens)
+    # final cleanup: spacing around punctuation
+    normalized = re.sub(r'\s+,', ',', normalized)
+    normalized = re.sub(r',\s*', ', ', normalized)
+    normalized = re.sub(r'\s+\.', '.', normalized)
+    normalized = re.sub(r'\s+!', '!', normalized)
+    normalized = re.sub(r'\s+\?', '?', normalized)
+    normalized = re.sub(r'\s+', ' ', normalized).strip()
+    # 如果你不想全部小写，注释掉下面这行
+    normalized = normalized.lower()
+    return normalized
+# -------------------------
+# Example usage and tests
+# -------------------------
+if __name__ == "__main__":
+    examples = [
+        "kita cek Project nadi PHP pemberi harapan palsu tuh yyy6yy 46 ini ini usernya ini di bagian user",
+        "Harga Rp 10.000, diskon 30%! Buka jam 09:30 (hari 2025/11/28).",
+        "Call +62 812-3456-7890 sekarang!",
+        "angka kecil 3.14 dan 1,234 serta 1000",
+        "[musik]",
+        "... atau mungkin juga jumlah anggota keluarga mereka."
+    ]
+    for ex in examples:
+        print("IN: ", ex)
+        print("OUT:", normalize_id_tts(ex))
+        print("-"*60)

lemas_tts/infer/text_norm/jieba_dict.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

lemas_tts/infer/text_norm/pinyin-lexicon-r.txt ADDED Viewed

	@@ -0,0 +1,4120 @@

+a1 a1
+a2 a2
+a3 a3
+a4 a4
+a5 a5
+ai1 ai1
+ai2 ai2
+ai3 ai3
+ai4 ai4
+ai5 ai5
+an1 an1
+an2 an2
+an3 an3
+an4 an4
+an5 an5
+ang1 ang1
+ang2 ang2
+ang3 ang3
+ang4 ang4
+ang5 ang5
+ao1 ao1
+ao2 ao2
+ao3 ao3
+ao4 ao4
+ao5 ao5
+ba1 b a1
+ba2 b a2
+ba3 b a3
+ba4 b a4
+ba5 b a5
+bai1 b ai1
+bai2 b ai2
+bai3 b ai3
+bai4 b ai4
+bai5 b ai5
+ban1 b an1
+ban2 b an2
+ban3 b an3
+ban4 b an4
+ban5 b an5
+bang1 b ang1
+bang2 b ang2
+bang3 b ang3
+bang4 b ang4
+bang5 b ang5
+bao1 b ao1
+bao2 b ao2
+bao3 b ao3
+bao4 b ao4
+bao5 b ao5
+bei1 b ei1
+bei2 b ei2
+bei3 b ei3
+bei4 b ei4
+bei5 b ei5
+ben1 b en1
+ben2 b en2
+ben3 b en3
+ben4 b en4
+ben5 b en5
+beng1 b eng1
+beng2 b eng2
+beng3 b eng3
+beng4 b eng4
+beng5 b eng5
+bi1 b i1
+bi2 b i2
+bi3 b i3
+bi4 b i4
+bi5 b i5
+bian1 b ian1
+bian2 b ian2
+bian3 b ian3
+bian4 b ian4
+bian5 b ian5
+biao1 b iao1
+biao2 b iao2
+biao3 b iao3
+biao4 b iao4
+biao5 b iao5
+bie1 b ie1
+bie2 b ie2
+bie3 b ie3
+bie4 b ie4
+bie5 b ie5
+bin1 b in1
+bin2 b in2
+bin3 b in3
+bin4 b in4
+bin5 b in5
+bing1 b ing1
+bing2 b ing2
+bing3 b ing3
+bing4 b ing4
+bing5 b ing5
+bo1 b o1
+bo2 b o2
+bo3 b o3
+bo4 b o4
+bo5 b o5
+bu1 b u1
+bu2 b u2
+bu3 b u3
+bu4 b u4
+bu5 b u5
+ca1 c a1
+ca2 c a2
+ca3 c a3
+ca4 c a4
+ca5 c a5
+cai1 c ai1
+cai2 c ai2
+cai3 c ai3
+cai4 c ai4
+cai5 c ai5
+can1 c an1
+can2 c an2
+can3 c an3
+can4 c an4
+can5 c an5
+cang1 c ang1
+cang2 c ang2
+cang3 c ang3
+cang4 c ang4
+cang5 c ang5
+cao1 c ao1
+cao2 c ao2
+cao3 c ao3
+cao4 c ao4
+cao5 c ao5
+ce1 c e1
+ce2 c e2
+ce3 c e3
+ce4 c e4
+ce5 c e5
+cen1 c en1
+cen2 c en2
+cen3 c en3
+cen4 c en4
+cen5 c en5
+ceng1 c eng1
+ceng2 c eng2
+ceng3 c eng3
+ceng4 c eng4
+ceng5 c eng5
+cha1 ch a1
+cha2 ch a2
+cha3 ch a3
+cha4 ch a4
+cha5 ch a5
+chai1 ch ai1
+chai2 ch ai2
+chai3 ch ai3
+chai4 ch ai4
+chai5 ch ai5
+chan1 ch an1
+chan2 ch an2
+chan3 ch an3
+chan4 ch an4
+chan5 ch an5
+chang1 ch ang1
+chang2 ch ang2
+chang3 ch ang3
+chang4 ch ang4
+chang5 ch ang5
+chao1 ch ao1
+chao2 ch ao2
+chao3 ch ao3
+chao4 ch ao4
+chao5 ch ao5
+che1 ch e1
+che2 ch e2
+che3 ch e3
+che4 ch e4
+che5 ch e5
+chen1 ch en1
+chen2 ch en2
+chen3 ch en3
+chen4 ch en4
+chen5 ch en5
+cheng1 ch eng1
+cheng2 ch eng2
+cheng3 ch eng3
+cheng4 ch eng4
+cheng5 ch eng5
+chi1 ch iii1
+chi2 ch iii2
+chi3 ch iii3
+chi4 ch iii4
+chi5 ch iii5
+chong1 ch ong1
+chong2 ch ong2
+chong3 ch ong3
+chong4 ch ong4
+chong5 ch ong5
+chou1 ch ou1
+chou2 ch ou2
+chou3 ch ou3
+chou4 ch ou4
+chou5 ch ou5
+chu1 ch u1
+chu2 ch u2
+chu3 ch u3
+chu4 ch u4
+chu5 ch u5
+chuai1 ch uai1
+chuai2 ch uai2
+chuai3 ch uai3
+chuai4 ch uai4
+chuai5 ch uai5
+chuan1 ch uan1
+chuan2 ch uan2
+chuan3 ch uan3
+chuan4 ch uan4
+chuan5 ch uan5
+chuang1 ch uang1
+chuang2 ch uang2
+chuang3 ch uang3
+chuang4 ch uang4
+chuang5 ch uang5
+chui1 ch uei1
+chui2 ch uei2
+chui3 ch uei3
+chui4 ch uei4
+chui5 ch uei5
+chun1 ch uen1
+chun2 ch uen2
+chun3 ch uen3
+chun4 ch uen4
+chun5 ch uen5
+chuo1 ch uo1
+chuo2 ch uo2
+chuo3 ch uo3
+chuo4 ch uo4
+chuo5 ch uo5
+ci1 c ii1
+ci2 c ii2
+ci3 c ii3
+ci4 c ii4
+ci5 c ii5
+cong1 c ong1
+cong2 c ong2
+cong3 c ong3
+cong4 c ong4
+cong5 c ong5
+cou1 c ou1
+cou2 c ou2
+cou3 c ou3
+cou4 c ou4
+cou5 c ou5
+cu1 c u1
+cu2 c u2
+cu3 c u3
+cu4 c u4
+cu5 c u5
+cuan1 c uan1
+cuan2 c uan2
+cuan3 c uan3
+cuan4 c uan4
+cuan5 c uan5
+cui1 c uei1
+cui2 c uei2
+cui3 c uei3
+cui4 c uei4
+cui5 c uei5
+cun1 c uen1
+cun2 c uen2
+cun3 c uen3
+cun4 c uen4
+cun5 c uen5
+cuo1 c uo1
+cuo2 c uo2
+cuo3 c uo3
+cuo4 c uo4
+cuo5 c uo5
+da1 d a1
+da2 d a2
+da3 d a3
+da4 d a4
+da5 d a5
+dai1 d ai1
+dai2 d ai2
+dai3 d ai3
+dai4 d ai4
+dai5 d ai5
+dan1 d an1
+dan2 d an2
+dan3 d an3
+dan4 d an4
+dan5 d an5
+dang1 d ang1
+dang2 d ang2
+dang3 d ang3
+dang4 d ang4
+dang5 d ang5
+dao1 d ao1
+dao2 d ao2
+dao3 d ao3
+dao4 d ao4
+dao5 d ao5
+de1 d e1
+de2 d e2
+de3 d e3
+de4 d e4
+de5 d e5
+dei1 d ei1
+dei2 d ei2
+dei3 d ei3
+dei4 d ei4
+dei5 d ei5
+den1 d en1
+den2 d en2
+den3 d en3
+den4 d en4
+den5 d en5
+deng1 d eng1
+deng2 d eng2
+deng3 d eng3
+deng4 d eng4
+deng5 d eng5
+di1 d i1
+di2 d i2
+di3 d i3
+di4 d i4
+di5 d i5
+dia1 d ia1
+dia2 d ia2
+dia3 d ia3
+dia4 d ia4
+dia5 d ia5
+dian1 d ian1
+dian2 d ian2
+dian3 d ian3
+dian4 d ian4
+dian5 d ian5
+diao1 d iao1
+diao2 d iao2
+diao3 d iao3
+diao4 d iao4
+diao5 d iao5
+die1 d ie1
+die2 d ie2
+die3 d ie3
+die4 d ie4
+die5 d ie5
+ding1 d ing1
+ding2 d ing2
+ding3 d ing3
+ding4 d ing4
+ding5 d ing5
+diu1 d iou1
+diu2 d iou2
+diu3 d iou3
+diu4 d iou4
+diu5 d iou5
+dong1 d ong1
+dong2 d ong2
+dong3 d ong3
+dong4 d ong4
+dong5 d ong5
+dou1 d ou1
+dou2 d ou2
+dou3 d ou3
+dou4 d ou4
+dou5 d ou5
+du1 d u1
+du2 d u2
+du3 d u3
+du4 d u4
+du5 d u5
+duan1 d uan1
+duan2 d uan2
+duan3 d uan3
+duan4 d uan4
+duan5 d uan5
+dui1 d uei1
+dui2 d uei2
+dui3 d uei3
+dui4 d uei4
+dui5 d uei5
+dun1 d uen1
+dun2 d uen2
+dun3 d uen3
+dun4 d uen4
+dun5 d uen5
+duo1 d uo1
+duo2 d uo2
+duo3 d uo3
+duo4 d uo4
+duo5 d uo5
+e1 e1
+e2 e2
+e3 e3
+e4 e4
+e5 e5
+ei1 ei1
+ei2 ei2
+ei3 ei3
+ei4 ei4
+ei5 ei5
+en1 en1
+en2 en2
+en3 en3
+en4 en4
+en5 en5
+eng1 eng1
+eng2 eng2
+eng3 eng3
+eng4 eng4
+eng5 eng5
+r1 er1
+r2 er2
+r3 er3
+r4 er4
+r5 er5
+er1 er1
+er2 er2
+er3 er3
+er4 er4
+er5 er5
+fa1 f a1
+fa2 f a2
+fa3 f a3
+fa4 f a4
+fa5 f a5
+fan1 f an1
+fan2 f an2
+fan3 f an3
+fan4 f an4
+fan5 f an5
+fang1 f ang1
+fang2 f ang2
+fang3 f ang3
+fang4 f ang4
+fang5 f ang5
+fei1 f ei1
+fei2 f ei2
+fei3 f ei3
+fei4 f ei4
+fei5 f ei5
+fen1 f en1
+fen2 f en2
+fen3 f en3
+fen4 f en4
+fen5 f en5
+feng1 f eng1
+feng2 f eng2
+feng3 f eng3
+feng4 f eng4
+feng5 f eng5
+fo1 f o1
+fo2 f o2
+fo3 f o3
+fo4 f o4
+fo5 f o5
+fou1 f ou1
+fou2 f ou2
+fou3 f ou3
+fou4 f ou4
+fou5 f ou5
+fu1 f u1
+fu2 f u2
+fu3 f u3
+fu4 f u4
+fu5 f u5
+ga1 g a1
+ga2 g a2
+ga3 g a3
+ga4 g a4
+ga5 g a5
+gai1 g ai1
+gai2 g ai2
+gai3 g ai3
+gai4 g ai4
+gai5 g ai5
+gan1 g an1
+gan2 g an2
+gan3 g an3
+gan4 g an4
+gan5 g an5
+gang1 g ang1
+gang2 g ang2
+gang3 g ang3
+gang4 g ang4
+gang5 g ang5
+gao1 g ao1
+gao2 g ao2
+gao3 g ao3
+gao4 g ao4
+gao5 g ao5
+ge1 g e1
+ge2 g e2
+ge3 g e3
+ge4 g e4
+ge5 g e5
+gei1 g ei1
+gei2 g ei2
+gei3 g ei3
+gei4 g ei4
+gei5 g ei5
+gen1 g en1
+gen2 g en2
+gen3 g en3
+gen4 g en4
+gen5 g en5
+geng1 g eng1
+geng2 g eng2
+geng3 g eng3
+geng4 g eng4
+geng5 g eng5
+gong1 g ong1
+gong2 g ong2
+gong3 g ong3
+gong4 g ong4
+gong5 g ong5
+gou1 g ou1
+gou2 g ou2
+gou3 g ou3
+gou4 g ou4
+gou5 g ou5
+gu1 g u1
+gu2 g u2
+gu3 g u3
+gu4 g u4
+gu5 g u5
+gua1 g ua1
+gua2 g ua2
+gua3 g ua3
+gua4 g ua4
+gua5 g ua5
+guai1 g uai1
+guai2 g uai2
+guai3 g uai3
+guai4 g uai4
+guai5 g uai5
+guan1 g uan1
+guan2 g uan2
+guan3 g uan3
+guan4 g uan4
+guan5 g uan5
+guang1 g uang1
+guang2 g uang2
+guang3 g uang3
+guang4 g uang4
+guang5 g uang5
+gui1 g uei1
+gui2 g uei2
+gui3 g uei3
+gui4 g uei4
+gui5 g uei5
+gun1 g uen1
+gun2 g uen2
+gun3 g uen3
+gun4 g uen4
+gun5 g uen5
+guo1 g uo1
+guo2 g uo2
+guo3 g uo3
+guo4 g uo4
+guo5 g uo5
+ha1 h a1
+ha2 h a2
+ha3 h a3
+ha4 h a4
+ha5 h a5
+hai1 h ai1
+hai2 h ai2
+hai3 h ai3
+hai4 h ai4
+hai5 h ai5
+han1 h an1
+han2 h an2
+han3 h an3
+han4 h an4
+han5 h an5
+hang1 h ang1
+hang2 h ang2
+hang3 h ang3
+hang4 h ang4
+hang5 h ang5
+hao1 h ao1
+hao2 h ao2
+hao3 h ao3
+hao4 h ao4
+hao5 h ao5
+he1 h e1
+he2 h e2
+he3 h e3
+he4 h e4
+he5 h e5
+hei1 h ei1
+hei2 h ei2
+hei3 h ei3
+hei4 h ei4
+hei5 h ei5
+hen1 h en1
+hen2 h en2
+hen3 h en3
+hen4 h en4
+hen5 h en5
+heng1 h eng1
+heng2 h eng2
+heng3 h eng3
+heng4 h eng4
+heng5 h eng5
+hong1 h ong1
+hong2 h ong2
+hong3 h ong3
+hong4 h ong4
+hong5 h ong5
+hou1 h ou1
+hou2 h ou2
+hou3 h ou3
+hou4 h ou4
+hou5 h ou5
+hu1 h u1
+hu2 h u2
+hu3 h u3
+hu4 h u4
+hu5 h u5
+hua1 h ua1
+hua2 h ua2
+hua3 h ua3
+hua4 h ua4
+hua5 h ua5
+huai1 h uai1
+huai2 h uai2
+huai3 h uai3
+huai4 h uai4
+huai5 h uai5
+huan1 h uan1
+huan2 h uan2
+huan3 h uan3
+huan4 h uan4
+huan5 h uan5
+huang1 h uang1
+huang2 h uang2
+huang3 h uang3
+huang4 h uang4
+huang5 h uang5
+hui1 h uei1
+hui2 h uei2
+hui3 h uei3
+hui4 h uei4
+hui5 h uei5
+hun1 h uen1
+hun2 h uen2
+hun3 h uen3
+hun4 h uen4
+hun5 h uen5
+huo1 h uo1
+huo2 h uo2
+huo3 h uo3
+huo4 h uo4
+huo5 h uo5
+ji1 j i1
+ji2 j i2
+ji3 j i3
+ji4 j i4
+ji5 j i5
+jia1 j ia1
+jia2 j ia2
+jia3 j ia3
+jia4 j ia4
+jia5 j ia5
+jian1 j ian1
+jian2 j ian2
+jian3 j ian3
+jian4 j ian4
+jian5 j ian5
+jiang1 j iang1
+jiang2 j iang2
+jiang3 j iang3
+jiang4 j iang4
+jiang5 j iang5
+jiao1 j iao1
+jiao2 j iao2
+jiao3 j iao3
+jiao4 j iao4
+jiao5 j iao5
+jie1 j ie1
+jie2 j ie2
+jie3 j ie3
+jie4 j ie4
+jie5 j ie5
+jin1 j in1
+jin2 j in2
+jin3 j in3
+jin4 j in4
+jin5 j in5
+jing1 j ing1
+jing2 j ing2
+jing3 j ing3
+jing4 j ing4
+jing5 j ing5
+jiong1 j iong1
+jiong2 j iong2
+jiong3 j iong3
+jiong4 j iong4
+jiong5 j iong5
+jiu1 j iou1
+jiu2 j iou2
+jiu3 j iou3
+jiu4 j iou4
+jiu5 j iou5
+ju1 j v1
+ju2 j v2
+ju3 j v3
+ju4 j v4
+ju5 j v5
+juan1 j van1
+juan2 j van2
+juan3 j van3
+juan4 j van4
+juan5 j van5
+jue1 j ve1
+jue2 j ve2
+jue3 j ve3
+jue4 j ve4
+jue5 j ve5
+jun1 j vn1
+jun2 j vn2
+jun3 j vn3
+jun4 j vn4
+jun5 j vn5
+ka1 k a1
+ka2 k a2
+ka3 k a3
+ka4 k a4
+ka5 k a5
+kai1 k ai1
+kai2 k ai2
+kai3 k ai3
+kai4 k ai4
+kai5 k ai5
+kan1 k an1
+kan2 k an2
+kan3 k an3
+kan4 k an4
+kan5 k an5
+kang1 k ang1
+kang2 k ang2
+kang3 k ang3
+kang4 k ang4
+kang5 k ang5
+kao1 k ao1
+kao2 k ao2
+kao3 k ao3
+kao4 k ao4
+kao5 k ao5
+ke1 k e1
+ke2 k e2
+ke3 k e3
+ke4 k e4
+ke5 k e5
+kei1 k ei1
+kei2 k ei2
+kei3 k ei3
+kei4 k ei4
+kei5 k ei5
+ken1 k en1
+ken2 k en2
+ken3 k en3
+ken4 k en4
+ken5 k en5
+keng1 k eng1
+keng2 k eng2
+keng3 k eng3
+keng4 k eng4
+keng5 k eng5
+kong1 k ong1
+kong2 k ong2
+kong3 k ong3
+kong4 k ong4
+kong5 k ong5
+kou1 k ou1
+kou2 k ou2
+kou3 k ou3
+kou4 k ou4
+kou5 k ou5
+ku1 k u1
+ku2 k u2
+ku3 k u3
+ku4 k u4
+ku5 k u5
+kua1 k ua1
+kua2 k ua2
+kua3 k ua3
+kua4 k ua4
+kua5 k ua5
+kuai1 k uai1
+kuai2 k uai2
+kuai3 k uai3
+kuai4 k uai4
+kuai5 k uai5
+kuan1 k uan1
+kuan2 k uan2
+kuan3 k uan3
+kuan4 k uan4
+kuan5 k uan5
+kuang1 k uang1
+kuang2 k uang2
+kuang3 k uang3
+kuang4 k uang4
+kuang5 k uang5
+kui1 k uei1
+kui2 k uei2
+kui3 k uei3
+kui4 k uei4
+kui5 k uei5
+kun1 k uen1
+kun2 k uen2
+kun3 k uen3
+kun4 k uen4
+kun5 k uen5
+kuo1 k uo1
+kuo2 k uo2
+kuo3 k uo3
+kuo4 k uo4
+kuo5 k uo5
+la1 l a1
+la2 l a2
+la3 l a3
+la4 l a4
+la5 l a5
+lai1 l ai1
+lai2 l ai2
+lai3 l ai3
+lai4 l ai4
+lai5 l ai5
+lan1 l an1
+lan2 l an2
+lan3 l an3
+lan4 l an4
+lan5 l an5
+lang1 l ang1
+lang2 l ang2
+lang3 l ang3
+lang4 l ang4
+lang5 l ang5
+lao1 l ao1
+lao2 l ao2
+lao3 l ao3
+lao4 l ao4
+lao5 l ao5
+le1 l e1
+le2 l e2
+le3 l e3
+le4 l e4
+le5 l e5
+lei1 l ei1
+lei2 l ei2
+lei3 l ei3
+lei4 l ei4
+lei5 l ei5
+leng1 l eng1
+leng2 l eng2
+leng3 l eng3
+leng4 l eng4
+leng5 l eng5
+li1 l i1
+li2 l i2
+li3 l i3
+li4 l i4
+li5 l i5
+lia1 l ia1
+lia2 l ia2
+lia3 l ia3
+lia4 l ia4
+lia5 l ia5
+lian1 l ian1
+lian2 l ian2
+lian3 l ian3
+lian4 l ian4
+lian5 l ian5
+liang1 l iang1
+liang2 l iang2
+liang3 l iang3
+liang4 l iang4
+liang5 l iang5
+liao1 l iao1
+liao2 l iao2
+liao3 l iao3
+liao4 l iao4
+liao5 l iao5
+lie1 l ie1
+lie2 l ie2
+lie3 l ie3
+lie4 l ie4
+lie5 l ie5
+lin1 l in1
+lin2 l in2
+lin3 l in3
+lin4 l in4
+lin5 l in5
+ling1 l ing1
+ling2 l ing2
+ling3 l ing3
+ling4 l ing4
+ling5 l ing5
+liu1 l iou1
+liu2 l iou2
+liu3 l iou3
+liu4 l iou4
+liu5 l iou5
+lo1 l o1
+lo2 l o2
+lo3 l o3
+lo4 l o4
+lo5 l o5
+long1 l ong1
+long2 l ong2
+long3 l ong3
+long4 l ong4
+long5 l ong5
+lou1 l ou1
+lou2 l ou2
+lou3 l ou3
+lou4 l ou4
+lou5 l ou5
+lu1 l u1
+lu2 l u2
+lu3 l u3
+lu4 l u4
+lu5 l u5
+luan1 l uan1
+luan2 l uan2
+luan3 l uan3
+luan4 l uan4
+luan5 l uan5
+lue1 l ve1
+lue2 l ve2
+lue3 l ve3
+lue4 l ve4
+lue5 l ve5
+lve1 l ve1
+lve2 l ve2
+lve3 l ve3
+lve4 l ve4
+lve5 l ve5
+lun1 l uen1
+lun2 l uen2
+lun3 l uen3
+lun4 l uen4
+lun5 l uen5
+luo1 l uo1
+luo2 l uo2
+luo3 l uo3
+luo4 l uo4
+luo5 l uo5
+lv1 l v1
+lv2 l v2
+lv3 l v3
+lv4 l v4
+lv5 l v5
+ma1 m a1
+ma2 m a2
+ma3 m a3
+ma4 m a4
+ma5 m a5
+mai1 m ai1
+mai2 m ai2
+mai3 m ai3
+mai4 m ai4
+mai5 m ai5
+man1 m an1
+man2 m an2
+man3 m an3
+man4 m an4
+man5 m an5
+mang1 m ang1
+mang2 m ang2
+mang3 m ang3
+mang4 m ang4
+mang5 m ang5
+mao1 m ao1
+mao2 m ao2
+mao3 m ao3
+mao4 m ao4
+mao5 m ao5
+me1 m e1
+me2 m e2
+me3 m e3
+me4 m e4
+me5 m e5
+mei1 m ei1
+mei2 m ei2
+mei3 m ei3
+mei4 m ei4
+mei5 m ei5
+men1 m en1
+men2 m en2
+men3 m en3
+men4 m en4
+men5 m en5
+meng1 m eng1
+meng2 m eng2
+meng3 m eng3
+meng4 m eng4
+meng5 m eng5
+mi1 m i1
+mi2 m i2
+mi3 m i3
+mi4 m i4
+mi5 m i5
+mian1 m ian1
+mian2 m ian2
+mian3 m ian3
+mian4 m ian4
+mian5 m ian5
+miao1 m iao1
+miao2 m iao2
+miao3 m iao3
+miao4 m iao4
+miao5 m iao5
+mie1 m ie1
+mie2 m ie2
+mie3 m ie3
+mie4 m ie4
+mie5 m ie5
+min1 m in1
+min2 m in2
+min3 m in3
+min4 m in4
+min5 m in5
+ming1 m ing1
+ming2 m ing2
+ming3 m ing3
+ming4 m ing4
+ming5 m ing5
+miu1 m iou1
+miu2 m iou2
+miu3 m iou3
+miu4 m iou4
+miu5 m iou5
+mo1 m o1
+mo2 m o2
+mo3 m o3
+mo4 m o4
+mo5 m o5
+mou1 m ou1
+mou2 m ou2
+mou3 m ou3
+mou4 m ou4
+mou5 m ou5
+mu1 m u1
+mu2 m u2
+mu3 m u3
+mu4 m u4
+mu5 m u5
+na1 n a1
+na2 n a2
+na3 n a3
+na4 n a4
+na5 n a5
+nai1 n ai1
+nai2 n ai2
+nai3 n ai3
+nai4 n ai4
+nai5 n ai5
+nan1 n an1
+nan2 n an2
+nan3 n an3
+nan4 n an4
+nan5 n an5
+nang1 n ang1
+nang2 n ang2
+nang3 n ang3
+nang4 n ang4
+nang5 n ang5
+nao1 n ao1
+nao2 n ao2
+nao3 n ao3
+nao4 n ao4
+nao5 n ao5
+ne1 n e1
+ne2 n e2
+ne3 n e3
+ne4 n e4
+ne5 n e5
+nei1 n ei1
+nei2 n ei2
+nei3 n ei3
+nei4 n ei4
+nei5 n ei5
+nen1 n en1
+nen2 n en2
+nen3 n en3
+nen4 n en4
+nen5 n en5
+neng1 n eng1
+neng2 n eng2
+neng3 n eng3
+neng4 n eng4
+neng5 n eng5
+ni1 n i1
+ni2 n i2
+ni3 n i3
+ni4 n i4
+ni5 n i5
+nian1 n ian1
+nian2 n ian2
+nian3 n ian3
+nian4 n ian4
+nian5 n ian5
+niang1 n iang1
+niang2 n iang2
+niang3 n iang3
+niang4 n iang4
+niang5 n iang5
+niao1 n iao1
+niao2 n iao2
+niao3 n iao3
+niao4 n iao4
+niao5 n iao5
+nie1 n ie1
+nie2 n ie2
+nie3 n ie3
+nie4 n ie4
+nie5 n ie5
+nin1 n in1
+nin2 n in2
+nin3 n in3
+nin4 n in4
+nin5 n in5
+ning1 n ing1
+ning2 n ing2
+ning3 n ing3
+ning4 n ing4
+ning5 n ing5
+niu1 n iou1
+niu2 n iou2
+niu3 n iou3
+niu4 n iou4
+niu5 n iou5
+nong1 n ong1
+nong2 n ong2
+nong3 n ong3
+nong4 n ong4
+nong5 n ong5
+nou1 n ou1
+nou2 n ou2
+nou3 n ou3
+nou4 n ou4
+nou5 n ou5
+nu1 n u1
+nu2 n u2
+nu3 n u3
+nu4 n u4
+nu5 n u5
+nuan1 n uan1
+nuan2 n uan2
+nuan3 n uan3
+nuan4 n uan4
+nuan5 n uan5
+nue1 n ve1
+nue2 n ve2
+nue3 n ve3
+nue4 n ve4
+nue5 n ve5
+nve1 n ve1
+nve2 n ve2
+nve3 n ve3
+nve4 n ve4
+nve5 n ve5
+nuo1 n uo1
+nuo2 n uo2
+nuo3 n uo3
+nuo4 n uo4
+nuo5 n uo5
+nv1 n v1
+nv2 n v2
+nv3 n v3
+nv4 n v4
+nv5 n v5
+o1 o1
+o2 o2
+o3 o3
+o4 o4
+o5 o5
+ou1 ou1
+ou2 ou2
+ou3 ou3
+ou4 ou4
+ou5 ou5
+pa1 p a1
+pa2 p a2
+pa3 p a3
+pa4 p a4
+pa5 p a5
+pai1 p ai1
+pai2 p ai2
+pai3 p ai3
+pai4 p ai4
+pai5 p ai5
+pan1 p an1
+pan2 p an2
+pan3 p an3
+pan4 p an4
+pan5 p an5
+pang1 p ang1
+pang2 p ang2
+pang3 p ang3
+pang4 p ang4
+pang5 p ang5
+pao1 p ao1
+pao2 p ao2
+pao3 p ao3
+pao4 p ao4
+pao5 p ao5
+pei1 p ei1
+pei2 p ei2
+pei3 p ei3
+pei4 p ei4
+pei5 p ei5
+pen1 p en1
+pen2 p en2
+pen3 p en3
+pen4 p en4
+pen5 p en5
+peng1 p eng1
+peng2 p eng2
+peng3 p eng3
+peng4 p eng4
+peng5 p eng5
+pi1 p i1
+pi2 p i2
+pi3 p i3
+pi4 p i4
+pi5 p i5
+pian1 p ian1
+pian2 p ian2
+pian3 p ian3
+pian4 p ian4
+pian5 p ian5
+piao1 p iao1
+piao2 p iao2
+piao3 p iao3
+piao4 p iao4
+piao5 p iao5
+pie1 p ie1
+pie2 p ie2
+pie3 p ie3
+pie4 p ie4
+pie5 p ie5
+pin1 p in1
+pin2 p in2
+pin3 p in3
+pin4 p in4
+pin5 p in5
+ping1 p ing1
+ping2 p ing2
+ping3 p ing3
+ping4 p ing4
+ping5 p ing5
+po1 p o1
+po2 p o2
+po3 p o3
+po4 p o4
+po5 p o5
+pou1 p ou1
+pou2 p ou2
+pou3 p ou3
+pou4 p ou4
+pou5 p ou5
+pu1 p u1
+pu2 p u2
+pu3 p u3
+pu4 p u4
+pu5 p u5
+qi1 q i1
+qi2 q i2
+qi3 q i3
+qi4 q i4
+qi5 q i5
+qia1 q ia1
+qia2 q ia2
+qia3 q ia3
+qia4 q ia4
+qia5 q ia5
+qian1 q ian1
+qian2 q ian2
+qian3 q ian3
+qian4 q ian4
+qian5 q ian5
+qiang1 q iang1
+qiang2 q iang2
+qiang3 q iang3
+qiang4 q iang4
+qiang5 q iang5
+qiao1 q iao1
+qiao2 q iao2
+qiao3 q iao3
+qiao4 q iao4
+qiao5 q iao5
+qie1 q ie1
+qie2 q ie2
+qie3 q ie3
+qie4 q ie4
+qie5 q ie5
+qin1 q in1
+qin2 q in2
+qin3 q in3
+qin4 q in4
+qin5 q in5
+qing1 q ing1
+qing2 q ing2
+qing3 q ing3
+qing4 q ing4
+qing5 q ing5
+qiong1 q iong1
+qiong2 q iong2
+qiong3 q iong3
+qiong4 q iong4
+qiong5 q iong5
+qiu1 q iou1
+qiu2 q iou2
+qiu3 q iou3
+qiu4 q iou4
+qiu5 q iou5
+qu1 q v1
+qu2 q v2
+qu3 q v3
+qu4 q v4
+qu5 q v5
+quan1 q van1
+quan2 q van2
+quan3 q van3
+quan4 q van4
+quan5 q van5
+que1 q ve1
+que2 q ve2
+que3 q ve3
+que4 q ve4
+que5 q ve5
+qun1 q vn1
+qun2 q vn2
+qun3 q vn3
+qun4 q vn4
+qun5 q vn5
+ran1 r an1
+ran2 r an2
+ran3 r an3
+ran4 r an4
+ran5 r an5
+rang1 r ang1
+rang2 r ang2
+rang3 r ang3
+rang4 r ang4
+rang5 r ang5
+rao1 r ao1
+rao2 r ao2
+rao3 r ao3
+rao4 r ao4
+rao5 r ao5
+re1 r e1
+re2 r e2
+re3 r e3
+re4 r e4
+re5 r e5
+ren1 r en1
+ren2 r en2
+ren3 r en3
+ren4 r en4
+ren5 r en5
+reng1 r eng1
+reng2 r eng2
+reng3 r eng3
+reng4 r eng4
+reng5 r eng5
+ri1 r iii1
+ri2 r iii2
+ri3 r iii3
+ri4 r iii4
+ri5 r iii5
+rong1 r ong1
+rong2 r ong2
+rong3 r ong3
+rong4 r ong4
+rong5 r ong5
+rou1 r ou1
+rou2 r ou2
+rou3 r ou3
+rou4 r ou4
+rou5 r ou5
+ru1 r u1
+ru2 r u2
+ru3 r u3
+ru4 r u4
+ru5 r u5
+rua1 r ua1
+rua2 r ua2
+rua3 r ua3
+rua4 r ua4
+rua5 r ua5
+ruan1 r uan1
+ruan2 r uan2
+ruan3 r uan3
+ruan4 r uan4
+ruan5 r uan5
+rui1 r uei1
+rui2 r uei2
+rui3 r uei3
+rui4 r uei4
+rui5 r uei5
+run1 r uen1
+run2 r uen2
+run3 r uen3
+run4 r uen4
+run5 r uen5
+ruo1 r uo1
+ruo2 r uo2
+ruo3 r uo3
+ruo4 r uo4
+ruo5 r uo5
+sa1 s a1
+sa2 s a2
+sa3 s a3
+sa4 s a4
+sa5 s a5
+sai1 s ai1
+sai2 s ai2
+sai3 s ai3
+sai4 s ai4
+sai5 s ai5
+san1 s an1
+san2 s an2
+san3 s an3
+san4 s an4
+san5 s an5
+sang1 s ang1
+sang2 s ang2
+sang3 s ang3
+sang4 s ang4
+sang5 s ang5
+sao1 s ao1
+sao2 s ao2
+sao3 s ao3
+sao4 s ao4
+sao5 s ao5
+se1 s e1
+se2 s e2
+se3 s e3
+se4 s e4
+se5 s e5
+sen1 s en1
+sen2 s en2
+sen3 s en3
+sen4 s en4
+sen5 s en5
+seng1 s eng1
+seng2 s eng2
+seng3 s eng3
+seng4 s eng4
+seng5 s eng5
+sha1 sh a1
+sha2 sh a2
+sha3 sh a3
+sha4 sh a4
+sha5 sh a5
+shai1 sh ai1
+shai2 sh ai2
+shai3 sh ai3
+shai4 sh ai4
+shai5 sh ai5
+shan1 sh an1
+shan2 sh an2
+shan3 sh an3
+shan4 sh an4
+shan5 sh an5
+shang1 sh ang1
+shang2 sh ang2
+shang3 sh ang3
+shang4 sh ang4
+shang5 sh ang5
+shao1 sh ao1
+shao2 sh ao2
+shao3 sh ao3
+shao4 sh ao4
+shao5 sh ao5
+she1 sh e1
+she2 sh e2
+she3 sh e3
+she4 sh e4
+she5 sh e5
+shei1 sh ei1
+shei2 sh ei2
+shei3 sh ei3
+shei4 sh ei4
+shei5 sh ei5
+shen1 sh en1
+shen2 sh en2
+shen3 sh en3
+shen4 sh en4
+shen5 sh en5
+sheng1 sh eng1
+sheng2 sh eng2
+sheng3 sh eng3
+sheng4 sh eng4
+sheng5 sh eng5
+shi1 sh iii1
+shi2 sh iii2
+shi3 sh iii3
+shi4 sh iii4
+shi5 sh iii5
+shou1 sh ou1
+shou2 sh ou2
+shou3 sh ou3
+shou4 sh ou4
+shou5 sh ou5
+shu1 sh u1
+shu2 sh u2
+shu3 sh u3
+shu4 sh u4
+shu5 sh u5
+shua1 sh ua1
+shua2 sh ua2
+shua3 sh ua3
+shua4 sh ua4
+shua5 sh ua5
+shuai1 sh uai1
+shuai2 sh uai2
+shuai3 sh uai3
+shuai4 sh uai4
+shuai5 sh uai5
+shuan1 sh uan1
+shuan2 sh uan2
+shuan3 sh uan3
+shuan4 sh uan4
+shuan5 sh uan5
+shuang1 sh uang1
+shuang2 sh uang2
+shuang3 sh uang3
+shuang4 sh uang4
+shuang5 sh uang5
+shui1 sh uei1
+shui2 sh uei2
+shui3 sh uei3
+shui4 sh uei4
+shui5 sh uei5
+shun1 sh uen1
+shun2 sh uen2
+shun3 sh uen3
+shun4 sh uen4
+shun5 sh uen5
+shuo1 sh uo1
+shuo2 sh uo2
+shuo3 sh uo3
+shuo4 sh uo4
+shuo5 sh uo5
+si1 s ii1
+si2 s ii2
+si3 s ii3
+si4 s ii4
+si5 s ii5
+song1 s ong1
+song2 s ong2
+song3 s ong3
+song4 s ong4
+song5 s ong5
+sou1 s ou1
+sou2 s ou2
+sou3 s ou3
+sou4 s ou4
+sou5 s ou5
+su1 s u1
+su2 s u2
+su3 s u3
+su4 s u4
+su5 s u5
+suan1 s uan1
+suan2 s uan2
+suan3 s uan3
+suan4 s uan4
+suan5 s uan5
+sui1 s uei1
+sui2 s uei2
+sui3 s uei3
+sui4 s uei4
+sui5 s uei5
+sun1 s uen1
+sun2 s uen2
+sun3 s uen3
+sun4 s uen4
+sun5 s uen5
+suo1 s uo1
+suo2 s uo2
+suo3 s uo3
+suo4 s uo4
+suo5 s uo5
+ta1 t a1
+ta2 t a2
+ta3 t a3
+ta4 t a4
+ta5 t a5
+tai1 t ai1
+tai2 t ai2
+tai3 t ai3
+tai4 t ai4
+tai5 t ai5
+tan1 t an1
+tan2 t an2
+tan3 t an3
+tan4 t an4
+tan5 t an5
+tang1 t ang1
+tang2 t ang2
+tang3 t ang3
+tang4 t ang4
+tang5 t ang5
+tao1 t ao1
+tao2 t ao2
+tao3 t ao3
+tao4 t ao4
+tao5 t ao5
+te1 t e1
+te2 t e2
+te3 t e3
+te4 t e4
+te5 t e5
+tei1 t ei1
+tei2 t ei2
+tei3 t ei3
+tei4 t ei4
+tei5 t ei5
+teng1 t eng1
+teng2 t eng2
+teng3 t eng3
+teng4 t eng4
+teng5 t eng5
+ti1 t i1
+ti2 t i2
+ti3 t i3
+ti4 t i4
+ti5 t i5
+tian1 t ian1
+tian2 t ian2
+tian3 t ian3
+tian4 t ian4
+tian5 t ian5
+tiao1 t iao1
+tiao2 t iao2
+tiao3 t iao3
+tiao4 t iao4
+tiao5 t iao5
+tie1 t ie1
+tie2 t ie2
+tie3 t ie3
+tie4 t ie4
+tie5 t ie5
+ting1 t ing1
+ting2 t ing2
+ting3 t ing3
+ting4 t ing4
+ting5 t ing5
+tong1 t ong1
+tong2 t ong2
+tong3 t ong3
+tong4 t ong4
+tong5 t ong5
+tou1 t ou1
+tou2 t ou2
+tou3 t ou3
+tou4 t ou4
+tou5 t ou5
+tu1 t u1
+tu2 t u2
+tu3 t u3
+tu4 t u4
+tu5 t u5
+tuan1 t uan1
+tuan2 t uan2
+tuan3 t uan3
+tuan4 t uan4
+tuan5 t uan5
+tui1 t uei1
+tui2 t uei2
+tui3 t uei3
+tui4 t uei4
+tui5 t uei5
+tun1 t uen1
+tun2 t uen2
+tun3 t uen3
+tun4 t uen4
+tun5 t uen5
+tuo1 t uo1
+tuo2 t uo2
+tuo3 t uo3
+tuo4 t uo4
+tuo5 t uo5
+wa1 w ua1
+wa2 w ua2
+wa3 w ua3
+wa4 w ua4
+wa5 w ua5
+wai1 w uai1
+wai2 w uai2
+wai3 w uai3
+wai4 w uai4
+wai5 w uai5
+wan1 w uan1
+wan2 w uan2
+wan3 w uan3
+wan4 w uan4
+wan5 w uan5
+wang1 w uang1
+wang2 w uang2
+wang3 w uang3
+wang4 w uang4
+wang5 w uang5
+wei1 w uei1
+wei2 w uei2
+wei3 w uei3
+wei4 w uei4
+wei5 w uei5
+wen1 w uen1
+wen2 w uen2
+wen3 w uen3
+wen4 w uen4
+wen5 w uen5
+weng1 w uen1
+weng2 w uen2
+weng3 w uen3
+weng4 w uen4
+weng5 w uen5
+wo1 w uo1
+wo2 w uo2
+wo3 w uo3
+wo4 w uo4
+wo5 w uo5
+wu1 w u1
+wu2 w u2
+wu3 w u3
+wu4 w u4
+wu5 w u5
+xi1 x i1
+xi2 x i2
+xi3 x i3
+xi4 x i4
+xi5 x i5
+xia1 x ia1
+xia2 x ia2
+xia3 x ia3
+xia4 x ia4
+xia5 x ia5
+xian1 x ian1
+xian2 x ian2
+xian3 x ian3
+xian4 x ian4
+xian5 x ian5
+xiang1 x iang1
+xiang2 x iang2
+xiang3 x iang3
+xiang4 x iang4
+xiang5 x iang5
+xiao1 x iao1
+xiao2 x iao2
+xiao3 x iao3
+xiao4 x iao4
+xiao5 x iao5
+xie1 x ie1
+xie2 x ie2
+xie3 x ie3
+xie4 x ie4
+xie5 x ie5
+xin1 x in1
+xin2 x in2
+xin3 x in3
+xin4 x in4
+xin5 x in5
+xing1 x ing1
+xing2 x ing2
+xing3 x ing3
+xing4 x ing4
+xing5 x ing5
+xiong1 x iong1
+xiong2 x iong2
+xiong3 x iong3
+xiong4 x iong4
+xiong5 x iong5
+xiu1 x iou1
+xiu2 x iou2
+xiu3 x iou3
+xiu4 x iou4
+xiu5 x iou5
+xu1 x v1
+xu2 x v2
+xu3 x v3
+xu4 x v4
+xu5 x v5
+xuan1 x van1
+xuan2 x van2
+xuan3 x van3
+xuan4 x van4
+xuan5 x van5
+xue1 x ve1
+xue2 x ve2
+xue3 x ve3
+xue4 x ve4
+xue5 x ve5
+xun1 x vn1
+xun2 x vn2
+xun3 x vn3
+xun4 x vn4
+xun5 x vn5
+ya1 y ia1
+ya2 y ia2
+ya3 y ia3
+ya4 y ia4
+ya5 y ia5
+yan1 y ian1
+yan2 y ian2
+yan3 y ian3
+yan4 y ian4
+yan5 y ian5
+yang1 y iang1
+yang2 y iang2
+yang3 y iang3
+yang4 y iang4
+yang5 y iang5
+yao1 y iao1
+yao2 y iao2
+yao3 y iao3
+yao4 y iao4
+yao5 y iao5
+ye1 y ie1
+ye2 y ie2
+ye3 y ie3
+ye4 y ie4
+ye5 y ie5
+yi1 y i1
+yi2 y i2
+yi3 y i3
+yi4 y i4
+yi5 y i5
+yin1 y in1
+yin2 y in2
+yin3 y in3
+yin4 y in4
+yin5 y in5
+ying1 y ing1
+ying2 y ing2
+ying3 y ing3
+ying4 y ing4
+ying5 y ing5
+yo1 y iou1
+yo2 y iou2
+yo3 y iou3
+yo4 y iou4
+yo5 y iou5
+yong1 y iong1
+yong2 y iong2
+yong3 y iong3
+yong4 y iong4
+yong5 y iong5
+you1 y iou1
+you2 y iou2
+you3 y iou3
+you4 y iou4
+you5 y iou5
+yu1 y v1
+yu2 y v2
+yu3 y v3
+yu4 y v4
+yu5 y v5
+yuan1 y van1
+yuan2 y van2
+yuan3 y van3
+yuan4 y van4
+yuan5 y van5
+yue1 y ve1
+yue2 y ve2
+yue3 y ve3
+yue4 y ve4
+yue5 y ve5
+yun1 y vn1
+yun2 y vn2
+yun3 y vn3
+yun4 y vn4
+yun5 y vn5
+za1 z a1
+za2 z a2
+za3 z a3
+za4 z a4
+za5 z a5
+zai1 z ai1
+zai2 z ai2
+zai3 z ai3
+zai4 z ai4
+zai5 z ai5
+zan1 z an1
+zan2 z an2
+zan3 z an3
+zan4 z an4
+zan5 z an5
+zang1 z ang1
+zang2 z ang2
+zang3 z ang3
+zang4 z ang4
+zang5 z ang5
+zao1 z ao1
+zao2 z ao2
+zao3 z ao3
+zao4 z ao4
+zao5 z ao5
+ze1 z e1
+ze2 z e2
+ze3 z e3
+ze4 z e4
+ze5 z e5
+zei1 z ei1
+zei2 z ei2
+zei3 z ei3
+zei4 z ei4
+zei5 z ei5
+zen1 z en1
+zen2 z en2
+zen3 z en3
+zen4 z en4
+zen5 z en5
+zeng1 z eng1
+zeng2 z eng2
+zeng3 z eng3
+zeng4 z eng4
+zeng5 z eng5
+zha1 zh a1
+zha2 zh a2
+zha3 zh a3
+zha4 zh a4
+zha5 zh a5
+zhai1 zh ai1
+zhai2 zh ai2
+zhai3 zh ai3
+zhai4 zh ai4
+zhai5 zh ai5
+zhan1 zh an1
+zhan2 zh an2
+zhan3 zh an3
+zhan4 zh an4
+zhan5 zh an5
+zhang1 zh ang1
+zhang2 zh ang2
+zhang3 zh ang3
+zhang4 zh ang4
+zhang5 zh ang5
+zhao1 zh ao1
+zhao2 zh ao2
+zhao3 zh ao3
+zhao4 zh ao4
+zhao5 zh ao5
+zhe1 zh e1
+zhe2 zh e2
+zhe3 zh e3
+zhe4 zh e4
+zhe5 zh e5
+zhei1 zh ei1
+zhei2 zh ei2
+zhei3 zh ei3
+zhei4 zh ei4
+zhei5 zh ei5
+zhen1 zh en1
+zhen2 zh en2
+zhen3 zh en3
+zhen4 zh en4
+zhen5 zh en5
+zheng1 zh eng1
+zheng2 zh eng2
+zheng3 zh eng3
+zheng4 zh eng4
+zheng5 zh eng5
+zhi1 zh iii1
+zhi2 zh iii2
+zhi3 zh iii3
+zhi4 zh iii4
+zhi5 zh iii5
+zhong1 zh ong1
+zhong2 zh ong2
+zhong3 zh ong3
+zhong4 zh ong4
+zhong5 zh ong5
+zhou1 zh ou1
+zhou2 zh ou2
+zhou3 zh ou3
+zhou4 zh ou4
+zhou5 zh ou5
+zhu1 zh u1
+zhu2 zh u2
+zhu3 zh u3
+zhu4 zh u4
+zhu5 zh u5
+zhua1 zh ua1
+zhua2 zh ua2
+zhua3 zh ua3
+zhua4 zh ua4
+zhua5 zh ua5
+zhuai1 zh uai1
+zhuai2 zh uai2
+zhuai3 zh uai3
+zhuai4 zh uai4
+zhuai5 zh uai5
+zhuan1 zh uan1
+zhuan2 zh uan2
+zhuan3 zh uan3
+zhuan4 zh uan4
+zhuan5 zh uan5
+zhuang1 zh uang1
+zhuang2 zh uang2
+zhuang3 zh uang3
+zhuang4 zh uang4
+zhuang5 zh uang5
+zhui1 zh uei1
+zhui2 zh uei2
+zhui3 zh uei3
+zhui4 zh uei4
+zhui5 zh uei5
+zhun1 zh uen1
+zhun2 zh uen2
+zhun3 zh uen3
+zhun4 zh uen4
+zhun5 zh uen5
+zhuo1 zh uo1
+zhuo2 zh uo2
+zhuo3 zh uo3
+zhuo4 zh uo4
+zhuo5 zh uo5
+zi1 z ii1
+zi2 z ii2
+zi3 z ii3
+zi4 z ii4
+zi5 z ii5
+zong1 z ong1
+zong2 z ong2
+zong3 z ong3
+zong4 z ong4
+zong5 z ong5
+zou1 z ou1
+zou2 z ou2
+zou3 z ou3
+zou4 z ou4
+zou5 z ou5
+zu1 z u1
+zu2 z u2
+zu3 z u3
+zu4 z u4
+zu5 z u5
+zuan1 z uan1
+zuan2 z uan2
+zuan3 z uan3
+zuan4 z uan4
+zuan5 z uan5
+zui1 z uei1
+zui2 z uei2
+zui3 z uei3
+zui4 z uei4
+zui5 z uei5
+zun1 z uen1
+zun2 z uen2
+zun3 z uen3
+zun4 z uen4
+zun5 z uen5
+zuo1 z uo1
+zuo2 z uo2
+zuo3 z uo3
+zuo4 z uo4
+zuo5 z uo5
+ar1 a1 rr
+ar2 a2 rr
+ar3 a3 rr
+ar4 a4 rr
+ar5 a5 rr
+air1 ai1 rr
+air2 ai2 rr
+air3 ai3 rr
+air4 ai4 rr
+air5 ai5 rr
+anr1 an1 rr
+anr2 an2 rr
+anr3 an3 rr
+anr4 an4 rr
+anr5 an5 rr
+angr1 ang1 rr
+angr2 ang2 rr
+angr3 ang3 rr
+angr4 ang4 rr
+angr5 ang5 rr
+aor1 ao1 rr
+aor2 ao2 rr
+aor3 ao3 rr
+aor4 ao4 rr
+aor5 ao5 rr
+bar1 b a1 rr
+bar2 b a2 rr
+bar3 b a3 rr
+bar4 b a4 rr
+bar5 b a5 rr
+bair1 b ai1 rr
+bair2 b ai2 rr
+bair3 b ai3 rr
+bair4 b ai4 rr
+bair5 b ai5 rr
+banr1 b an1 rr
+banr2 b an2 rr
+banr3 b an3 rr
+banr4 b an4 rr
+banr5 b an5 rr
+bangr1 b ang1 rr
+bangr2 b ang2 rr
+bangr3 b ang3 rr
+bangr4 b ang4 rr
+bangr5 b ang5 rr
+baor1 b ao1 rr
+baor2 b ao2 rr
+baor3 b ao3 rr
+baor4 b ao4 rr
+baor5 b ao5 rr
+beir1 b ei1 rr
+beir2 b ei2 rr
+beir3 b ei3 rr
+beir4 b ei4 rr
+beir5 b ei5 rr
+benr1 b en1 rr
+benr2 b en2 rr
+benr3 b en3 rr
+benr4 b en4 rr
+benr5 b en5 rr
+bengr1 b eng1 rr
+bengr2 b eng2 rr
+bengr3 b eng3 rr
+bengr4 b eng4 rr
+bengr5 b eng5 rr
+bir1 b i1 rr
+bir2 b i2 rr
+bir3 b i3 rr
+bir4 b i4 rr
+bir5 b i5 rr
+bianr1 b ian1 rr
+bianr2 b ian2 rr
+bianr3 b ian3 rr
+bianr4 b ian4 rr
+bianr5 b ian5 rr
+biaor1 b iao1 rr
+biaor2 b iao2 rr
+biaor3 b iao3 rr
+biaor4 b iao4 rr
+biaor5 b iao5 rr
+bier1 b ie1 rr
+bier2 b ie2 rr
+bier3 b ie3 rr
+bier4 b ie4 rr
+bier5 b ie5 rr
+binr1 b in1 rr
+binr2 b in2 rr
+binr3 b in3 rr
+binr4 b in4 rr
+binr5 b in5 rr
+bingr1 b ing1 rr
+bingr2 b ing2 rr
+bingr3 b ing3 rr
+bingr4 b ing4 rr
+bingr5 b ing5 rr
+bor1 b o1 rr
+bor2 b o2 rr
+bor3 b o3 rr
+bor4 b o4 rr
+bor5 b o5 rr
+bur1 b u1 rr
+bur2 b u2 rr
+bur3 b u3 rr
+bur4 b u4 rr
+bur5 b u5 rr
+car1 c a1 rr
+car2 c a2 rr
+car3 c a3 rr
+car4 c a4 rr
+car5 c a5 rr
+cair1 c ai1 rr
+cair2 c ai2 rr
+cair3 c ai3 rr
+cair4 c ai4 rr
+cair5 c ai5 rr
+canr1 c an1 rr
+canr2 c an2 rr
+canr3 c an3 rr
+canr4 c an4 rr
+canr5 c an5 rr
+cangr1 c ang1 rr
+cangr2 c ang2 rr
+cangr3 c ang3 rr
+cangr4 c ang4 rr
+cangr5 c ang5 rr
+caor1 c ao1 rr
+caor2 c ao2 rr
+caor3 c ao3 rr
+caor4 c ao4 rr
+caor5 c ao5 rr
+cer1 c e1 rr
+cer2 c e2 rr
+cer3 c e3 rr
+cer4 c e4 rr
+cer5 c e5 rr
+cenr1 c en1 rr
+cenr2 c en2 rr
+cenr3 c en3 rr
+cenr4 c en4 rr
+cenr5 c en5 rr
+cengr1 c eng1 rr
+cengr2 c eng2 rr
+cengr3 c eng3 rr
+cengr4 c eng4 rr
+cengr5 c eng5 rr
+char1 ch a1 rr
+char2 ch a2 rr
+char3 ch a3 rr
+char4 ch a4 rr
+char5 ch a5 rr
+chair1 ch ai1 rr
+chair2 ch ai2 rr
+chair3 ch ai3 rr
+chair4 ch ai4 rr
+chair5 ch ai5 rr
+chanr1 ch an1 rr
+chanr2 ch an2 rr
+chanr3 ch an3 rr
+chanr4 ch an4 rr
+chanr5 ch an5 rr
+changr1 ch ang1 rr
+changr2 ch ang2 rr
+changr3 ch ang3 rr
+changr4 ch ang4 rr
+changr5 ch ang5 rr
+chaor1 ch ao1 rr
+chaor2 ch ao2 rr
+chaor3 ch ao3 rr
+chaor4 ch ao4 rr
+chaor5 ch ao5 rr
+cher1 ch e1 rr
+cher2 ch e2 rr
+cher3 ch e3 rr
+cher4 ch e4 rr
+cher5 ch e5 rr
+chenr1 ch en1 rr
+chenr2 ch en2 rr
+chenr3 ch en3 rr
+chenr4 ch en4 rr
+chenr5 ch en5 rr
+chengr1 ch eng1 rr
+chengr2 ch eng2 rr
+chengr3 ch eng3 rr
+chengr4 ch eng4 rr
+chengr5 ch eng5 rr
+chir1 ch iii1 rr
+chir2 ch iii2 rr
+chir3 ch iii3 rr
+chir4 ch iii4 rr
+chir5 ch iii5 rr
+chongr1 ch ong1 rr
+chongr2 ch ong2 rr
+chongr3 ch ong3 rr
+chongr4 ch ong4 rr
+chongr5 ch ong5 rr
+chour1 ch ou1 rr
+chour2 ch ou2 rr
+chour3 ch ou3 rr
+chour4 ch ou4 rr
+chour5 ch ou5 rr
+chur1 ch u1 rr
+chur2 ch u2 rr
+chur3 ch u3 rr
+chur4 ch u4 rr
+chur5 ch u5 rr
+chuair1 ch uai1 rr
+chuair2 ch uai2 rr
+chuair3 ch uai3 rr
+chuair4 ch uai4 rr
+chuair5 ch uai5 rr
+chuanr1 ch uan1 rr
+chuanr2 ch uan2 rr
+chuanr3 ch uan3 rr
+chuanr4 ch uan4 rr
+chuanr5 ch uan5 rr
+chuangr1 ch uang1 rr
+chuangr2 ch uang2 rr
+chuangr3 ch uang3 rr
+chuangr4 ch uang4 rr
+chuangr5 ch uang5 rr
+chuir1 ch uei1 rr
+chuir2 ch uei2 rr
+chuir3 ch uei3 rr
+chuir4 ch uei4 rr
+chuir5 ch uei5 rr
+chunr1 ch uen1 rr
+chunr2 ch uen2 rr
+chunr3 ch uen3 rr
+chunr4 ch uen4 rr
+chunr5 ch uen5 rr
+chuor1 ch uo1 rr
+chuor2 ch uo2 rr
+chuor3 ch uo3 rr
+chuor4 ch uo4 rr
+chuor5 ch uo5 rr
+cir1 c ii1 rr
+cir2 c ii2 rr
+cir3 c ii3 rr
+cir4 c ii4 rr
+cir5 c ii5 rr
+congr1 c ong1 rr
+congr2 c ong2 rr
+congr3 c ong3 rr
+congr4 c ong4 rr
+congr5 c ong5 rr
+cour1 c ou1 rr
+cour2 c ou2 rr
+cour3 c ou3 rr
+cour4 c ou4 rr
+cour5 c ou5 rr
+cur1 c u1 rr
+cur2 c u2 rr
+cur3 c u3 rr
+cur4 c u4 rr
+cur5 c u5 rr
+cuanr1 c uan1 rr
+cuanr2 c uan2 rr
+cuanr3 c uan3 rr
+cuanr4 c uan4 rr
+cuanr5 c uan5 rr
+cuir1 c uei1 rr
+cuir2 c uei2 rr
+cuir3 c uei3 rr
+cuir4 c uei4 rr
+cuir5 c uei5 rr
+cunr1 c uen1 rr
+cunr2 c uen2 rr
+cunr3 c uen3 rr
+cunr4 c uen4 rr
+cunr5 c uen5 rr
+cuor1 c uo1 rr
+cuor2 c uo2 rr
+cuor3 c uo3 rr
+cuor4 c uo4 rr
+cuor5 c uo5 rr
+dar1 d a1 rr
+dar2 d a2 rr
+dar3 d a3 rr
+dar4 d a4 rr
+dar5 d a5 rr
+dair1 d ai1 rr
+dair2 d ai2 rr
+dair3 d ai3 rr
+dair4 d ai4 rr
+dair5 d ai5 rr
+danr1 d an1 rr
+danr2 d an2 rr
+danr3 d an3 rr
+danr4 d an4 rr
+danr5 d an5 rr
+dangr1 d ang1 rr
+dangr2 d ang2 rr
+dangr3 d ang3 rr
+dangr4 d ang4 rr
+dangr5 d ang5 rr
+daor1 d ao1 rr
+daor2 d ao2 rr
+daor3 d ao3 rr
+daor4 d ao4 rr
+daor5 d ao5 rr
+der1 d e1 rr
+der2 d e2 rr
+der3 d e3 rr
+der4 d e4 rr
+der5 d e5 rr
+deir1 d ei1 rr
+deir2 d ei2 rr
+deir3 d ei3 rr
+deir4 d ei4 rr
+deir5 d ei5 rr
+denr1 d en1 rr
+denr2 d en2 rr
+denr3 d en3 rr
+denr4 d en4 rr
+denr5 d en5 rr
+dengr1 d eng1 rr
+dengr2 d eng2 rr
+dengr3 d eng3 rr
+dengr4 d eng4 rr
+dengr5 d eng5 rr
+dir1 d i1 rr
+dir2 d i2 rr
+dir3 d i3 rr
+dir4 d i4 rr
+dir5 d i5 rr
+diar1 d ia1 rr
+diar2 d ia2 rr
+diar3 d ia3 rr
+diar4 d ia4 rr
+diar5 d ia5 rr
+dianr1 d ian1 rr
+dianr2 d ian2 rr
+dianr3 d ian3 rr
+dianr4 d ian4 rr
+dianr5 d ian5 rr
+diaor1 d iao1 rr
+diaor2 d iao2 rr
+diaor3 d iao3 rr
+diaor4 d iao4 rr
+diaor5 d iao5 rr
+dier1 d ie1 rr
+dier2 d ie2 rr
+dier3 d ie3 rr
+dier4 d ie4 rr
+dier5 d ie5 rr
+dingr1 d ing1 rr
+dingr2 d ing2 rr
+dingr3 d ing3 rr
+dingr4 d ing4 rr
+dingr5 d ing5 rr
+diur1 d iou1 rr
+diur2 d iou2 rr
+diur3 d iou3 rr
+diur4 d iou4 rr
+diur5 d iou5 rr
+dongr1 d ong1 rr
+dongr2 d ong2 rr
+dongr3 d ong3 rr
+dongr4 d ong4 rr
+dongr5 d ong5 rr
+dour1 d ou1 rr
+dour2 d ou2 rr
+dour3 d ou3 rr
+dour4 d ou4 rr
+dour5 d ou5 rr
+dur1 d u1 rr
+dur2 d u2 rr
+dur3 d u3 rr
+dur4 d u4 rr
+dur5 d u5 rr
+duanr1 d uan1 rr
+duanr2 d uan2 rr
+duanr3 d uan3 rr
+duanr4 d uan4 rr
+duanr5 d uan5 rr
+duir1 d uei1 rr
+duir2 d uei2 rr
+duir3 d uei3 rr
+duir4 d uei4 rr
+duir5 d uei5 rr
+dunr1 d uen1 rr
+dunr2 d uen2 rr
+dunr3 d uen3 rr
+dunr4 d uen4 rr
+dunr5 d uen5 rr
+duor1 d uo1 rr
+duor2 d uo2 rr
+duor3 d uo3 rr
+duor4 d uo4 rr
+duor5 d uo5 rr
+er1 e1 rr
+er2 e2 rr
+er3 e3 rr
+er4 e4 rr
+er5 e5 rr
+eir1 ei1 rr
+eir2 ei2 rr
+eir3 ei3 rr
+eir4 ei4 rr
+eir5 ei5 rr
+enr1 en1 rr
+enr2 en2 rr
+enr3 en3 rr
+enr4 en4 rr
+enr5 en5 rr
+engr1 eng1 rr
+engr2 eng2 rr
+engr3 eng3 rr
+engr4 eng4 rr
+engr5 eng5 rr
+far1 f a1 rr
+far2 f a2 rr
+far3 f a3 rr
+far4 f a4 rr
+far5 f a5 rr
+fanr1 f an1 rr
+fanr2 f an2 rr
+fanr3 f an3 rr
+fanr4 f an4 rr
+fanr5 f an5 rr
+fangr1 f ang1 rr
+fangr2 f ang2 rr
+fangr3 f ang3 rr
+fangr4 f ang4 rr
+fangr5 f ang5 rr
+feir1 f ei1 rr
+feir2 f ei2 rr
+feir3 f ei3 rr
+feir4 f ei4 rr
+feir5 f ei5 rr
+fenr1 f en1 rr
+fenr2 f en2 rr
+fenr3 f en3 rr
+fenr4 f en4 rr
+fenr5 f en5 rr
+fengr1 f eng1 rr
+fengr2 f eng2 rr
+fengr3 f eng3 rr
+fengr4 f eng4 rr
+fengr5 f eng5 rr
+for1 f o1 rr
+for2 f o2 rr
+for3 f o3 rr
+for4 f o4 rr
+for5 f o5 rr
+four1 f ou1 rr
+four2 f ou2 rr
+four3 f ou3 rr
+four4 f ou4 rr
+four5 f ou5 rr
+fur1 f u1 rr
+fur2 f u2 rr
+fur3 f u3 rr
+fur4 f u4 rr
+fur5 f u5 rr
+gar1 g a1 rr
+gar2 g a2 rr
+gar3 g a3 rr
+gar4 g a4 rr
+gar5 g a5 rr
+gair1 g ai1 rr
+gair2 g ai2 rr
+gair3 g ai3 rr
+gair4 g ai4 rr
+gair5 g ai5 rr
+ganr1 g an1 rr
+ganr2 g an2 rr
+ganr3 g an3 rr
+ganr4 g an4 rr
+ganr5 g an5 rr
+gangr1 g ang1 rr
+gangr2 g ang2 rr
+gangr3 g ang3 rr
+gangr4 g ang4 rr
+gangr5 g ang5 rr
+gaor1 g ao1 rr
+gaor2 g ao2 rr
+gaor3 g ao3 rr
+gaor4 g ao4 rr
+gaor5 g ao5 rr
+ger1 g e1 rr
+ger2 g e2 rr
+ger3 g e3 rr
+ger4 g e4 rr
+ger5 g e5 rr
+geir1 g ei1 rr
+geir2 g ei2 rr
+geir3 g ei3 rr
+geir4 g ei4 rr
+geir5 g ei5 rr
+genr1 g en1 rr
+genr2 g en2 rr
+genr3 g en3 rr
+genr4 g en4 rr
+genr5 g en5 rr
+gengr1 g eng1 rr
+gengr2 g eng2 rr
+gengr3 g eng3 rr
+gengr4 g eng4 rr
+gengr5 g eng5 rr
+gongr1 g ong1 rr
+gongr2 g ong2 rr
+gongr3 g ong3 rr
+gongr4 g ong4 rr
+gongr5 g ong5 rr
+gour1 g ou1 rr
+gour2 g ou2 rr
+gour3 g ou3 rr
+gour4 g ou4 rr
+gour5 g ou5 rr
+gur1 g u1 rr
+gur2 g u2 rr
+gur3 g u3 rr
+gur4 g u4 rr
+gur5 g u5 rr
+guar1 g ua1 rr
+guar2 g ua2 rr
+guar3 g ua3 rr
+guar4 g ua4 rr
+guar5 g ua5 rr
+guair1 g uai1 rr
+guair2 g uai2 rr
+guair3 g uai3 rr
+guair4 g uai4 rr
+guair5 g uai5 rr
+guanr1 g uan1 rr
+guanr2 g uan2 rr
+guanr3 g uan3 rr
+guanr4 g uan4 rr
+guanr5 g uan5 rr
+guangr1 g uang1 rr
+guangr2 g uang2 rr
+guangr3 g uang3 rr
+guangr4 g uang4 rr
+guangr5 g uang5 rr
+guir1 g uei1 rr
+guir2 g uei2 rr
+guir3 g uei3 rr
+guir4 g uei4 rr
+guir5 g uei5 rr
+gunr1 g uen1 rr
+gunr2 g uen2 rr
+gunr3 g uen3 rr
+gunr4 g uen4 rr
+gunr5 g uen5 rr
+guor1 g uo1 rr
+guor2 g uo2 rr
+guor3 g uo3 rr
+guor4 g uo4 rr
+guor5 g uo5 rr
+har1 h a1 rr
+har2 h a2 rr
+har3 h a3 rr
+har4 h a4 rr
+har5 h a5 rr
+hair1 h ai1 rr
+hair2 h ai2 rr
+hair3 h ai3 rr
+hair4 h ai4 rr
+hair5 h ai5 rr
+hanr1 h an1 rr
+hanr2 h an2 rr
+hanr3 h an3 rr
+hanr4 h an4 rr
+hanr5 h an5 rr
+hangr1 h ang1 rr
+hangr2 h ang2 rr
+hangr3 h ang3 rr
+hangr4 h ang4 rr
+hangr5 h ang5 rr
+haor1 h ao1 rr
+haor2 h ao2 rr
+haor3 h ao3 rr
+haor4 h ao4 rr
+haor5 h ao5 rr
+her1 h e1 rr
+her2 h e2 rr
+her3 h e3 rr
+her4 h e4 rr
+her5 h e5 rr
+heir1 h ei1 rr
+heir2 h ei2 rr
+heir3 h ei3 rr
+heir4 h ei4 rr
+heir5 h ei5 rr
+henr1 h en1 rr
+henr2 h en2 rr
+henr3 h en3 rr
+henr4 h en4 rr
+henr5 h en5 rr
+hengr1 h eng1 rr
+hengr2 h eng2 rr
+hengr3 h eng3 rr
+hengr4 h eng4 rr
+hengr5 h eng5 rr
+hongr1 h ong1 rr
+hongr2 h ong2 rr
+hongr3 h ong3 rr
+hongr4 h ong4 rr
+hongr5 h ong5 rr
+hour1 h ou1 rr
+hour2 h ou2 rr
+hour3 h ou3 rr
+hour4 h ou4 rr
+hour5 h ou5 rr
+hur1 h u1 rr
+hur2 h u2 rr
+hur3 h u3 rr
+hur4 h u4 rr
+hur5 h u5 rr
+huar1 h ua1 rr
+huar2 h ua2 rr
+huar3 h ua3 rr
+huar4 h ua4 rr
+huar5 h ua5 rr
+huair1 h uai1 rr
+huair2 h uai2 rr
+huair3 h uai3 rr
+huair4 h uai4 rr
+huair5 h uai5 rr
+huanr1 h uan1 rr
+huanr2 h uan2 rr
+huanr3 h uan3 rr
+huanr4 h uan4 rr
+huanr5 h uan5 rr
+huangr1 h uang1 rr
+huangr2 h uang2 rr
+huangr3 h uang3 rr
+huangr4 h uang4 rr
+huangr5 h uang5 rr
+huir1 h uei1 rr
+huir2 h uei2 rr
+huir3 h uei3 rr
+huir4 h uei4 rr
+huir5 h uei5 rr
+hunr1 h uen1 rr
+hunr2 h uen2 rr
+hunr3 h uen3 rr
+hunr4 h uen4 rr
+hunr5 h uen5 rr
+huor1 h uo1 rr
+huor2 h uo2 rr
+huor3 h uo3 rr
+huor4 h uo4 rr
+huor5 h uo5 rr
+jir1 j i1 rr
+jir2 j i2 rr
+jir3 j i3 rr
+jir4 j i4 rr
+jir5 j i5 rr
+jiar1 j ia1 rr
+jiar2 j ia2 rr
+jiar3 j ia3 rr
+jiar4 j ia4 rr
+jiar5 j ia5 rr
+jianr1 j ian1 rr
+jianr2 j ian2 rr
+jianr3 j ian3 rr
+jianr4 j ian4 rr
+jianr5 j ian5 rr
+jiangr1 j iang1 rr
+jiangr2 j iang2 rr
+jiangr3 j iang3 rr
+jiangr4 j iang4 rr
+jiangr5 j iang5 rr
+jiaor1 j iao1 rr
+jiaor2 j iao2 rr
+jiaor3 j iao3 rr
+jiaor4 j iao4 rr
+jiaor5 j iao5 rr
+jier1 j ie1 rr
+jier2 j ie2 rr
+jier3 j ie3 rr
+jier4 j ie4 rr
+jier5 j ie5 rr
+jinr1 j in1 rr
+jinr2 j in2 rr
+jinr3 j in3 rr
+jinr4 j in4 rr
+jinr5 j in5 rr
+jingr1 j ing1 rr
+jingr2 j ing2 rr
+jingr3 j ing3 rr
+jingr4 j ing4 rr
+jingr5 j ing5 rr
+jiongr1 j iong1 rr
+jiongr2 j iong2 rr
+jiongr3 j iong3 rr
+jiongr4 j iong4 rr
+jiongr5 j iong5 rr
+jiur1 j iou1 rr
+jiur2 j iou2 rr
+jiur3 j iou3 rr
+jiur4 j iou4 rr
+jiur5 j iou5 rr
+jur1 j v1 rr
+jur2 j v2 rr
+jur3 j v3 rr
+jur4 j v4 rr
+jur5 j v5 rr
+juanr1 j van1 rr
+juanr2 j van2 rr
+juanr3 j van3 rr
+juanr4 j van4 rr
+juanr5 j van5 rr
+juer1 j ve1 rr
+juer2 j ve2 rr
+juer3 j ve3 rr
+juer4 j ve4 rr
+juer5 j ve5 rr
+junr1 j vn1 rr
+junr2 j vn2 rr
+junr3 j vn3 rr
+junr4 j vn4 rr
+junr5 j vn5 rr
+kar1 k a1 rr
+kar2 k a2 rr
+kar3 k a3 rr
+kar4 k a4 rr
+kar5 k a5 rr
+kair1 k ai1 rr
+kair2 k ai2 rr
+kair3 k ai3 rr
+kair4 k ai4 rr
+kair5 k ai5 rr
+kanr1 k an1 rr
+kanr2 k an2 rr
+kanr3 k an3 rr
+kanr4 k an4 rr
+kanr5 k an5 rr
+kangr1 k ang1 rr
+kangr2 k ang2 rr
+kangr3 k ang3 rr
+kangr4 k ang4 rr
+kangr5 k ang5 rr
+kaor1 k ao1 rr
+kaor2 k ao2 rr
+kaor3 k ao3 rr
+kaor4 k ao4 rr
+kaor5 k ao5 rr
+ker1 k e1 rr
+ker2 k e2 rr
+ker3 k e3 rr
+ker4 k e4 rr
+ker5 k e5 rr
+keir1 k ei1 rr
+keir2 k ei2 rr
+keir3 k ei3 rr
+keir4 k ei4 rr
+keir5 k ei5 rr
+kenr1 k en1 rr
+kenr2 k en2 rr
+kenr3 k en3 rr
+kenr4 k en4 rr
+kenr5 k en5 rr
+kengr1 k eng1 rr
+kengr2 k eng2 rr
+kengr3 k eng3 rr
+kengr4 k eng4 rr
+kengr5 k eng5 rr
+kongr1 k ong1 rr
+kongr2 k ong2 rr
+kongr3 k ong3 rr
+kongr4 k ong4 rr
+kongr5 k ong5 rr
+kour1 k ou1 rr
+kour2 k ou2 rr
+kour3 k ou3 rr
+kour4 k ou4 rr
+kour5 k ou5 rr
+kur1 k u1 rr
+kur2 k u2 rr
+kur3 k u3 rr
+kur4 k u4 rr
+kur5 k u5 rr
+kuar1 k ua1 rr
+kuar2 k ua2 rr
+kuar3 k ua3 rr
+kuar4 k ua4 rr
+kuar5 k ua5 rr
+kuair1 k uai1 rr
+kuair2 k uai2 rr
+kuair3 k uai3 rr
+kuair4 k uai4 rr
+kuair5 k uai5 rr
+kuanr1 k uan1 rr
+kuanr2 k uan2 rr
+kuanr3 k uan3 rr
+kuanr4 k uan4 rr
+kuanr5 k uan5 rr
+kuangr1 k uang1 rr
+kuangr2 k uang2 rr
+kuangr3 k uang3 rr
+kuangr4 k uang4 rr
+kuangr5 k uang5 rr
+kuir1 k uei1 rr
+kuir2 k uei2 rr
+kuir3 k uei3 rr
+kuir4 k uei4 rr
+kuir5 k uei5 rr
+kunr1 k uen1 rr
+kunr2 k uen2 rr
+kunr3 k uen3 rr
+kunr4 k uen4 rr
+kunr5 k uen5 rr
+kuor1 k uo1 rr
+kuor2 k uo2 rr
+kuor3 k uo3 rr
+kuor4 k uo4 rr
+kuor5 k uo5 rr
+lar1 l a1 rr
+lar2 l a2 rr
+lar3 l a3 rr
+lar4 l a4 rr
+lar5 l a5 rr
+lair1 l ai1 rr
+lair2 l ai2 rr
+lair3 l ai3 rr
+lair4 l ai4 rr
+lair5 l ai5 rr
+lanr1 l an1 rr
+lanr2 l an2 rr
+lanr3 l an3 rr
+lanr4 l an4 rr
+lanr5 l an5 rr
+langr1 l ang1 rr
+langr2 l ang2 rr
+langr3 l ang3 rr
+langr4 l ang4 rr
+langr5 l ang5 rr
+laor1 l ao1 rr
+laor2 l ao2 rr
+laor3 l ao3 rr
+laor4 l ao4 rr
+laor5 l ao5 rr
+ler1 l e1 rr
+ler2 l e2 rr
+ler3 l e3 rr
+ler4 l e4 rr
+ler5 l e5 rr
+leir1 l ei1 rr
+leir2 l ei2 rr
+leir3 l ei3 rr
+leir4 l ei4 rr
+leir5 l ei5 rr
+lengr1 l eng1 rr
+lengr2 l eng2 rr
+lengr3 l eng3 rr
+lengr4 l eng4 rr
+lengr5 l eng5 rr
+lir1 l i1 rr
+lir2 l i2 rr
+lir3 l i3 rr
+lir4 l i4 rr
+lir5 l i5 rr
+liar1 l ia1 rr
+liar2 l ia2 rr
+liar3 l ia3 rr
+liar4 l ia4 rr
+liar5 l ia5 rr
+lianr1 l ian1 rr
+lianr2 l ian2 rr
+lianr3 l ian3 rr
+lianr4 l ian4 rr
+lianr5 l ian5 rr
+liangr1 l iang1 rr
+liangr2 l iang2 rr
+liangr3 l iang3 rr
+liangr4 l iang4 rr
+liangr5 l iang5 rr
+liaor1 l iao1 rr
+liaor2 l iao2 rr
+liaor3 l iao3 rr
+liaor4 l iao4 rr
+liaor5 l iao5 rr
+lier1 l ie1 rr
+lier2 l ie2 rr
+lier3 l ie3 rr
+lier4 l ie4 rr
+lier5 l ie5 rr
+linr1 l in1 rr
+linr2 l in2 rr
+linr3 l in3 rr
+linr4 l in4 rr
+linr5 l in5 rr
+lingr1 l ing1 rr
+lingr2 l ing2 rr
+lingr3 l ing3 rr
+lingr4 l ing4 rr
+lingr5 l ing5 rr
+liur1 l iou1 rr
+liur2 l iou2 rr
+liur3 l iou3 rr
+liur4 l iou4 rr
+liur5 l iou5 rr
+lor1 l o1 rr
+lor2 l o2 rr
+lor3 l o3 rr
+lor4 l o4 rr
+lor5 l o5 rr
+longr1 l ong1 rr
+longr2 l ong2 rr
+longr3 l ong3 rr
+longr4 l ong4 rr
+longr5 l ong5 rr
+lour1 l ou1 rr
+lour2 l ou2 rr
+lour3 l ou3 rr
+lour4 l ou4 rr
+lour5 l ou5 rr
+lur1 l u1 rr
+lur2 l u2 rr
+lur3 l u3 rr
+lur4 l u4 rr
+lur5 l u5 rr
+luanr1 l uan1 rr
+luanr2 l uan2 rr
+luanr3 l uan3 rr
+luanr4 l uan4 rr
+luanr5 l uan5 rr
+luer1 l ve1 rr
+luer2 l ve2 rr
+luer3 l ve3 rr
+luer4 l ve4 rr
+luer5 l ve5 rr
+lver1 l ve1 rr
+lver2 l ve2 rr
+lver3 l ve3 rr
+lver4 l ve4 rr
+lver5 l ve5 rr
+lunr1 l uen1 rr
+lunr2 l uen2 rr
+lunr3 l uen3 rr
+lunr4 l uen4 rr
+lunr5 l uen5 rr
+luor1 l uo1 rr
+luor2 l uo2 rr
+luor3 l uo3 rr
+luor4 l uo4 rr
+luor5 l uo5 rr
+lvr1 l v1 rr
+lvr2 l v2 rr
+lvr3 l v3 rr
+lvr4 l v4 rr
+lvr5 l v5 rr
+mar1 m a1 rr
+mar2 m a2 rr
+mar3 m a3 rr
+mar4 m a4 rr
+mar5 m a5 rr
+mair1 m ai1 rr
+mair2 m ai2 rr
+mair3 m ai3 rr
+mair4 m ai4 rr
+mair5 m ai5 rr
+manr1 m an1 rr
+manr2 m an2 rr
+manr3 m an3 rr
+manr4 m an4 rr
+manr5 m an5 rr
+mangr1 m ang1 rr
+mangr2 m ang2 rr
+mangr3 m ang3 rr
+mangr4 m ang4 rr
+mangr5 m ang5 rr
+maor1 m ao1 rr
+maor2 m ao2 rr
+maor3 m ao3 rr
+maor4 m ao4 rr
+maor5 m ao5 rr
+mer1 m e1 rr
+mer2 m e2 rr
+mer3 m e3 rr
+mer4 m e4 rr
+mer5 m e5 rr
+meir1 m ei1 rr
+meir2 m ei2 rr
+meir3 m ei3 rr
+meir4 m ei4 rr
+meir5 m ei5 rr
+menr1 m en1 rr
+menr2 m en2 rr
+menr3 m en3 rr
+menr4 m en4 rr
+menr5 m en5 rr
+mengr1 m eng1 rr
+mengr2 m eng2 rr
+mengr3 m eng3 rr
+mengr4 m eng4 rr
+mengr5 m eng5 rr
+mir1 m i1 rr
+mir2 m i2 rr
+mir3 m i3 rr
+mir4 m i4 rr
+mir5 m i5 rr
+mianr1 m ian1 rr
+mianr2 m ian2 rr
+mianr3 m ian3 rr
+mianr4 m ian4 rr
+mianr5 m ian5 rr
+miaor1 m iao1 rr
+miaor2 m iao2 rr
+miaor3 m iao3 rr
+miaor4 m iao4 rr
+miaor5 m iao5 rr
+mier1 m ie1 rr
+mier2 m ie2 rr
+mier3 m ie3 rr
+mier4 m ie4 rr
+mier5 m ie5 rr
+minr1 m in1 rr
+minr2 m in2 rr
+minr3 m in3 rr
+minr4 m in4 rr
+minr5 m in5 rr
+mingr1 m ing1 rr
+mingr2 m ing2 rr
+mingr3 m ing3 rr
+mingr4 m ing4 rr
+mingr5 m ing5 rr
+miur1 m iou1 rr
+miur2 m iou2 rr
+miur3 m iou3 rr
+miur4 m iou4 rr
+miur5 m iou5 rr
+mor1 m o1 rr
+mor2 m o2 rr
+mor3 m o3 rr
+mor4 m o4 rr
+mor5 m o5 rr
+mour1 m ou1 rr
+mour2 m ou2 rr
+mour3 m ou3 rr
+mour4 m ou4 rr
+mour5 m ou5 rr
+mur1 m u1 rr
+mur2 m u2 rr
+mur3 m u3 rr
+mur4 m u4 rr
+mur5 m u5 rr
+nar1 n a1 rr
+nar2 n a2 rr
+nar3 n a3 rr
+nar4 n a4 rr
+nar5 n a5 rr
+nair1 n ai1 rr
+nair2 n ai2 rr
+nair3 n ai3 rr
+nair4 n ai4 rr
+nair5 n ai5 rr
+nanr1 n an1 rr
+nanr2 n an2 rr
+nanr3 n an3 rr
+nanr4 n an4 rr
+nanr5 n an5 rr
+nangr1 n ang1 rr
+nangr2 n ang2 rr
+nangr3 n ang3 rr
+nangr4 n ang4 rr
+nangr5 n ang5 rr
+naor1 n ao1 rr
+naor2 n ao2 rr
+naor3 n ao3 rr
+naor4 n ao4 rr
+naor5 n ao5 rr
+ner1 n e1 rr
+ner2 n e2 rr
+ner3 n e3 rr
+ner4 n e4 rr
+ner5 n e5 rr
+neir1 n ei1 rr
+neir2 n ei2 rr
+neir3 n ei3 rr
+neir4 n ei4 rr
+neir5 n ei5 rr
+nenr1 n en1 rr
+nenr2 n en2 rr
+nenr3 n en3 rr
+nenr4 n en4 rr
+nenr5 n en5 rr
+nengr1 n eng1 rr
+nengr2 n eng2 rr
+nengr3 n eng3 rr
+nengr4 n eng4 rr
+nengr5 n eng5 rr
+nir1 n i1 rr
+nir2 n i2 rr
+nir3 n i3 rr
+nir4 n i4 rr
+nir5 n i5 rr
+nianr1 n ian1 rr
+nianr2 n ian2 rr
+nianr3 n ian3 rr
+nianr4 n ian4 rr
+nianr5 n ian5 rr
+niangr1 n iang1 rr
+niangr2 n iang2 rr
+niangr3 n iang3 rr
+niangr4 n iang4 rr
+niangr5 n iang5 rr
+niaor1 n iao1 rr
+niaor2 n iao2 rr
+niaor3 n iao3 rr
+niaor4 n iao4 rr
+niaor5 n iao5 rr
+nier1 n ie1 rr
+nier2 n ie2 rr
+nier3 n ie3 rr
+nier4 n ie4 rr
+nier5 n ie5 rr
+ninr1 n in1 rr
+ninr2 n in2 rr
+ninr3 n in3 rr
+ninr4 n in4 rr
+ninr5 n in5 rr
+ningr1 n ing1 rr
+ningr2 n ing2 rr
+ningr3 n ing3 rr
+ningr4 n ing4 rr
+ningr5 n ing5 rr
+niur1 n iou1 rr
+niur2 n iou2 rr
+niur3 n iou3 rr
+niur4 n iou4 rr
+niur5 n iou5 rr
+nongr1 n ong1 rr
+nongr2 n ong2 rr
+nongr3 n ong3 rr
+nongr4 n ong4 rr
+nongr5 n ong5 rr
+nour1 n ou1 rr
+nour2 n ou2 rr
+nour3 n ou3 rr
+nour4 n ou4 rr
+nour5 n ou5 rr
+nur1 n u1 rr
+nur2 n u2 rr
+nur3 n u3 rr
+nur4 n u4 rr
+nur5 n u5 rr
+nuanr1 n uan1 rr
+nuanr2 n uan2 rr
+nuanr3 n uan3 rr
+nuanr4 n uan4 rr
+nuanr5 n uan5 rr
+nuer1 n ve1 rr
+nuer2 n ve2 rr
+nuer3 n ve3 rr
+nuer4 n ve4 rr
+nuer5 n ve5 rr
+nver1 n ve1 rr
+nver2 n ve2 rr
+nver3 n ve3 rr
+nver4 n ve4 rr
+nver5 n ve5 rr
+nuor1 n uo1 rr
+nuor2 n uo2 rr
+nuor3 n uo3 rr
+nuor4 n uo4 rr
+nuor5 n uo5 rr
+nvr1 n v1 rr
+nvr2 n v2 rr
+nvr3 n v3 rr
+nvr4 n v4 rr
+nvr5 n v5 rr
+or1 o1 rr
+or2 o2 rr
+or3 o3 rr
+or4 o4 rr
+or5 o5 rr
+our1 ou1 rr
+our2 ou2 rr
+our3 ou3 rr
+our4 ou4 rr
+our5 ou5 rr
+par1 p a1 rr
+par2 p a2 rr
+par3 p a3 rr
+par4 p a4 rr
+par5 p a5 rr
+pair1 p ai1 rr
+pair2 p ai2 rr
+pair3 p ai3 rr
+pair4 p ai4 rr
+pair5 p ai5 rr
+panr1 p an1 rr
+panr2 p an2 rr
+panr3 p an3 rr
+panr4 p an4 rr
+panr5 p an5 rr
+pangr1 p ang1 rr
+pangr2 p ang2 rr
+pangr3 p ang3 rr
+pangr4 p ang4 rr
+pangr5 p ang5 rr
+paor1 p ao1 rr
+paor2 p ao2 rr
+paor3 p ao3 rr
+paor4 p ao4 rr
+paor5 p ao5 rr
+peir1 p ei1 rr
+peir2 p ei2 rr
+peir3 p ei3 rr
+peir4 p ei4 rr
+peir5 p ei5 rr
+penr1 p en1 rr
+penr2 p en2 rr
+penr3 p en3 rr
+penr4 p en4 rr
+penr5 p en5 rr
+pengr1 p eng1 rr
+pengr2 p eng2 rr
+pengr3 p eng3 rr
+pengr4 p eng4 rr
+pengr5 p eng5 rr
+pir1 p i1 rr
+pir2 p i2 rr
+pir3 p i3 rr
+pir4 p i4 rr
+pir5 p i5 rr
+pianr1 p ian1 rr
+pianr2 p ian2 rr
+pianr3 p ian3 rr
+pianr4 p ian4 rr
+pianr5 p ian5 rr
+piaor1 p iao1 rr
+piaor2 p iao2 rr
+piaor3 p iao3 rr
+piaor4 p iao4 rr
+piaor5 p iao5 rr
+pier1 p ie1 rr
+pier2 p ie2 rr
+pier3 p ie3 rr
+pier4 p ie4 rr
+pier5 p ie5 rr
+pinr1 p in1 rr
+pinr2 p in2 rr
+pinr3 p in3 rr
+pinr4 p in4 rr
+pinr5 p in5 rr
+pingr1 p ing1 rr
+pingr2 p ing2 rr
+pingr3 p ing3 rr
+pingr4 p ing4 rr
+pingr5 p ing5 rr
+por1 p o1 rr
+por2 p o2 rr
+por3 p o3 rr
+por4 p o4 rr
+por5 p o5 rr
+pour1 p ou1 rr
+pour2 p ou2 rr
+pour3 p ou3 rr
+pour4 p ou4 rr
+pour5 p ou5 rr
+pur1 p u1 rr
+pur2 p u2 rr
+pur3 p u3 rr
+pur4 p u4 rr
+pur5 p u5 rr
+qir1 q i1 rr
+qir2 q i2 rr
+qir3 q i3 rr
+qir4 q i4 rr
+qir5 q i5 rr
+qiar1 q ia1 rr
+qiar2 q ia2 rr
+qiar3 q ia3 rr
+qiar4 q ia4 rr
+qiar5 q ia5 rr
+qianr1 q ian1 rr
+qianr2 q ian2 rr
+qianr3 q ian3 rr
+qianr4 q ian4 rr
+qianr5 q ian5 rr
+qiangr1 q iang1 rr
+qiangr2 q iang2 rr
+qiangr3 q iang3 rr
+qiangr4 q iang4 rr
+qiangr5 q iang5 rr
+qiaor1 q iao1 rr
+qiaor2 q iao2 rr
+qiaor3 q iao3 rr
+qiaor4 q iao4 rr
+qiaor5 q iao5 rr
+qier1 q ie1 rr
+qier2 q ie2 rr
+qier3 q ie3 rr
+qier4 q ie4 rr
+qier5 q ie5 rr
+qinr1 q in1 rr
+qinr2 q in2 rr
+qinr3 q in3 rr
+qinr4 q in4 rr
+qinr5 q in5 rr
+qingr1 q ing1 rr
+qingr2 q ing2 rr
+qingr3 q ing3 rr
+qingr4 q ing4 rr
+qingr5 q ing5 rr
+qiongr1 q iong1 rr
+qiongr2 q iong2 rr
+qiongr3 q iong3 rr
+qiongr4 q iong4 rr
+qiongr5 q iong5 rr
+qiur1 q iou1 rr
+qiur2 q iou2 rr
+qiur3 q iou3 rr
+qiur4 q iou4 rr
+qiur5 q iou5 rr
+qur1 q v1 rr
+qur2 q v2 rr
+qur3 q v3 rr
+qur4 q v4 rr
+qur5 q v5 rr
+quanr1 q van1 rr
+quanr2 q van2 rr
+quanr3 q van3 rr
+quanr4 q van4 rr
+quanr5 q van5 rr
+quer1 q ve1 rr
+quer2 q ve2 rr
+quer3 q ve3 rr
+quer4 q ve4 rr
+quer5 q ve5 rr
+qunr1 q vn1 rr
+qunr2 q vn2 rr
+qunr3 q vn3 rr
+qunr4 q vn4 rr
+qunr5 q vn5 rr
+ranr1 r an1 rr
+ranr2 r an2 rr
+ranr3 r an3 rr
+ranr4 r an4 rr
+ranr5 r an5 rr
+rangr1 r ang1 rr
+rangr2 r ang2 rr
+rangr3 r ang3 rr
+rangr4 r ang4 rr
+rangr5 r ang5 rr
+raor1 r ao1 rr
+raor2 r ao2 rr
+raor3 r ao3 rr
+raor4 r ao4 rr
+raor5 r ao5 rr
+rer1 r e1 rr
+rer2 r e2 rr
+rer3 r e3 rr
+rer4 r e4 rr
+rer5 r e5 rr
+renr1 r en1 rr
+renr2 r en2 rr
+renr3 r en3 rr
+renr4 r en4 rr
+renr5 r en5 rr
+rengr1 r eng1 rr
+rengr2 r eng2 rr
+rengr3 r eng3 rr
+rengr4 r eng4 rr
+rengr5 r eng5 rr
+rir1 r iii1 rr
+rir2 r iii2 rr
+rir3 r iii3 rr
+rir4 r iii4 rr
+rir5 r iii5 rr
+rongr1 r ong1 rr
+rongr2 r ong2 rr
+rongr3 r ong3 rr
+rongr4 r ong4 rr
+rongr5 r ong5 rr
+rour1 r ou1 rr
+rour2 r ou2 rr
+rour3 r ou3 rr
+rour4 r ou4 rr
+rour5 r ou5 rr
+rur1 r u1 rr
+rur2 r u2 rr
+rur3 r u3 rr
+rur4 r u4 rr
+rur5 r u5 rr
+ruar1 r ua1 rr
+ruar2 r ua2 rr
+ruar3 r ua3 rr
+ruar4 r ua4 rr
+ruar5 r ua5 rr
+ruanr1 r uan1 rr
+ruanr2 r uan2 rr
+ruanr3 r uan3 rr
+ruanr4 r uan4 rr
+ruanr5 r uan5 rr
+ruir1 r uei1 rr
+ruir2 r uei2 rr
+ruir3 r uei3 rr
+ruir4 r uei4 rr
+ruir5 r uei5 rr
+runr1 r uen1 rr
+runr2 r uen2 rr
+runr3 r uen3 rr
+runr4 r uen4 rr
+runr5 r uen5 rr
+ruor1 r uo1 rr
+ruor2 r uo2 rr
+ruor3 r uo3 rr
+ruor4 r uo4 rr
+ruor5 r uo5 rr
+sar1 s a1 rr
+sar2 s a2 rr
+sar3 s a3 rr
+sar4 s a4 rr
+sar5 s a5 rr
+sair1 s ai1 rr
+sair2 s ai2 rr
+sair3 s ai3 rr
+sair4 s ai4 rr
+sair5 s ai5 rr
+sanr1 s an1 rr
+sanr2 s an2 rr
+sanr3 s an3 rr
+sanr4 s an4 rr
+sanr5 s an5 rr
+sangr1 s ang1 rr
+sangr2 s ang2 rr
+sangr3 s ang3 rr
+sangr4 s ang4 rr
+sangr5 s ang5 rr
+saor1 s ao1 rr
+saor2 s ao2 rr
+saor3 s ao3 rr
+saor4 s ao4 rr
+saor5 s ao5 rr
+ser1 s e1 rr
+ser2 s e2 rr
+ser3 s e3 rr
+ser4 s e4 rr
+ser5 s e5 rr
+senr1 s en1 rr
+senr2 s en2 rr
+senr3 s en3 rr
+senr4 s en4 rr
+senr5 s en5 rr
+sengr1 s eng1 rr
+sengr2 s eng2 rr
+sengr3 s eng3 rr
+sengr4 s eng4 rr
+sengr5 s eng5 rr
+shar1 sh a1 rr
+shar2 sh a2 rr
+shar3 sh a3 rr
+shar4 sh a4 rr
+shar5 sh a5 rr
+shair1 sh ai1 rr
+shair2 sh ai2 rr
+shair3 sh ai3 rr
+shair4 sh ai4 rr
+shair5 sh ai5 rr
+shanr1 sh an1 rr
+shanr2 sh an2 rr
+shanr3 sh an3 rr
+shanr4 sh an4 rr
+shanr5 sh an5 rr
+shangr1 sh ang1 rr
+shangr2 sh ang2 rr
+shangr3 sh ang3 rr
+shangr4 sh ang4 rr
+shangr5 sh ang5 rr
+shaor1 sh ao1 rr
+shaor2 sh ao2 rr
+shaor3 sh ao3 rr
+shaor4 sh ao4 rr
+shaor5 sh ao5 rr
+sher1 sh e1 rr
+sher2 sh e2 rr
+sher3 sh e3 rr
+sher4 sh e4 rr
+sher5 sh e5 rr
+sheir1 sh ei1 rr
+sheir2 sh ei2 rr
+sheir3 sh ei3 rr
+sheir4 sh ei4 rr
+sheir5 sh ei5 rr
+shenr1 sh en1 rr
+shenr2 sh en2 rr
+shenr3 sh en3 rr
+shenr4 sh en4 rr
+shenr5 sh en5 rr
+shengr1 sh eng1 rr
+shengr2 sh eng2 rr
+shengr3 sh eng3 rr
+shengr4 sh eng4 rr
+shengr5 sh eng5 rr
+shir1 sh iii1 rr
+shir2 sh iii2 rr
+shir3 sh iii3 rr
+shir4 sh iii4 rr
+shir5 sh iii5 rr
+shour1 sh ou1 rr
+shour2 sh ou2 rr
+shour3 sh ou3 rr
+shour4 sh ou4 rr
+shour5 sh ou5 rr
+shur1 sh u1 rr
+shur2 sh u2 rr
+shur3 sh u3 rr
+shur4 sh u4 rr
+shur5 sh u5 rr
+shuar1 sh ua1 rr
+shuar2 sh ua2 rr
+shuar3 sh ua3 rr
+shuar4 sh ua4 rr
+shuar5 sh ua5 rr
+shuair1 sh uai1 rr
+shuair2 sh uai2 rr
+shuair3 sh uai3 rr
+shuair4 sh uai4 rr
+shuair5 sh uai5 rr
+shuanr1 sh uan1 rr
+shuanr2 sh uan2 rr
+shuanr3 sh uan3 rr
+shuanr4 sh uan4 rr
+shuanr5 sh uan5 rr
+shuangr1 sh uang1 rr
+shuangr2 sh uang2 rr
+shuangr3 sh uang3 rr
+shuangr4 sh uang4 rr
+shuangr5 sh uang5 rr
+shuir1 sh uei1 rr
+shuir2 sh uei2 rr
+shuir3 sh uei3 rr
+shuir4 sh uei4 rr
+shuir5 sh uei5 rr
+shunr1 sh uen1 rr
+shunr2 sh uen2 rr
+shunr3 sh uen3 rr
+shunr4 sh uen4 rr
+shunr5 sh uen5 rr
+shuor1 sh uo1 rr
+shuor2 sh uo2 rr
+shuor3 sh uo3 rr
+shuor4 sh uo4 rr
+shuor5 sh uo5 rr
+sir1 s ii1 rr
+sir2 s ii2 rr
+sir3 s ii3 rr
+sir4 s ii4 rr
+sir5 s ii5 rr
+songr1 s ong1 rr
+songr2 s ong2 rr
+songr3 s ong3 rr
+songr4 s ong4 rr
+songr5 s ong5 rr
+sour1 s ou1 rr
+sour2 s ou2 rr
+sour3 s ou3 rr
+sour4 s ou4 rr
+sour5 s ou5 rr
+sur1 s u1 rr
+sur2 s u2 rr
+sur3 s u3 rr
+sur4 s u4 rr
+sur5 s u5 rr
+suanr1 s uan1 rr
+suanr2 s uan2 rr
+suanr3 s uan3 rr
+suanr4 s uan4 rr
+suanr5 s uan5 rr
+suir1 s uei1 rr
+suir2 s uei2 rr
+suir3 s uei3 rr
+suir4 s uei4 rr
+suir5 s uei5 rr
+sunr1 s uen1 rr
+sunr2 s uen2 rr
+sunr3 s uen3 rr
+sunr4 s uen4 rr
+sunr5 s uen5 rr
+suor1 s uo1 rr
+suor2 s uo2 rr
+suor3 s uo3 rr
+suor4 s uo4 rr
+suor5 s uo5 rr
+tar1 t a1 rr
+tar2 t a2 rr
+tar3 t a3 rr
+tar4 t a4 rr
+tar5 t a5 rr
+tair1 t ai1 rr
+tair2 t ai2 rr
+tair3 t ai3 rr
+tair4 t ai4 rr
+tair5 t ai5 rr
+tanr1 t an1 rr
+tanr2 t an2 rr
+tanr3 t an3 rr
+tanr4 t an4 rr
+tanr5 t an5 rr
+tangr1 t ang1 rr
+tangr2 t ang2 rr
+tangr3 t ang3 rr
+tangr4 t ang4 rr
+tangr5 t ang5 rr
+taor1 t ao1 rr
+taor2 t ao2 rr
+taor3 t ao3 rr
+taor4 t ao4 rr
+taor5 t ao5 rr
+ter1 t e1 rr
+ter2 t e2 rr
+ter3 t e3 rr
+ter4 t e4 rr
+ter5 t e5 rr
+teir1 t ei1 rr
+teir2 t ei2 rr
+teir3 t ei3 rr
+teir4 t ei4 rr
+teir5 t ei5 rr
+tengr1 t eng1 rr
+tengr2 t eng2 rr
+tengr3 t eng3 rr
+tengr4 t eng4 rr
+tengr5 t eng5 rr
+tir1 t i1 rr
+tir2 t i2 rr
+tir3 t i3 rr
+tir4 t i4 rr
+tir5 t i5 rr
+tianr1 t ian1 rr
+tianr2 t ian2 rr
+tianr3 t ian3 rr
+tianr4 t ian4 rr
+tianr5 t ian5 rr
+tiaor1 t iao1 rr
+tiaor2 t iao2 rr
+tiaor3 t iao3 rr
+tiaor4 t iao4 rr
+tiaor5 t iao5 rr
+tier1 t ie1 rr
+tier2 t ie2 rr
+tier3 t ie3 rr
+tier4 t ie4 rr
+tier5 t ie5 rr
+tingr1 t ing1 rr
+tingr2 t ing2 rr
+tingr3 t ing3 rr
+tingr4 t ing4 rr
+tingr5 t ing5 rr
+tongr1 t ong1 rr
+tongr2 t ong2 rr
+tongr3 t ong3 rr
+tongr4 t ong4 rr
+tongr5 t ong5 rr
+tour1 t ou1 rr
+tour2 t ou2 rr
+tour3 t ou3 rr
+tour4 t ou4 rr
+tour5 t ou5 rr
+tur1 t u1 rr
+tur2 t u2 rr
+tur3 t u3 rr
+tur4 t u4 rr
+tur5 t u5 rr
+tuanr1 t uan1 rr
+tuanr2 t uan2 rr
+tuanr3 t uan3 rr
+tuanr4 t uan4 rr
+tuanr5 t uan5 rr
+tuir1 t uei1 rr
+tuir2 t uei2 rr
+tuir3 t uei3 rr
+tuir4 t uei4 rr
+tuir5 t uei5 rr
+tunr1 t uen1 rr
+tunr2 t uen2 rr
+tunr3 t uen3 rr
+tunr4 t uen4 rr
+tunr5 t uen5 rr
+tuor1 t uo1 rr
+tuor2 t uo2 rr
+tuor3 t uo3 rr
+tuor4 t uo4 rr
+tuor5 t uo5 rr
+war1 w ua1 rr
+war2 w ua2 rr
+war3 w ua3 rr
+war4 w ua4 rr
+war5 w ua5 rr
+wair1 w uai1 rr
+wair2 w uai2 rr
+wair3 w uai3 rr
+wair4 w uai4 rr
+wair5 w uai5 rr
+wanr1 w uan1 rr
+wanr2 w uan2 rr
+wanr3 w uan3 rr
+wanr4 w uan4 rr
+wanr5 w uan5 rr
+wangr1 w uang1 rr
+wangr2 w uang2 rr
+wangr3 w uang3 rr
+wangr4 w uang4 rr
+wangr5 w uang5 rr
+weir1 w uei1 rr
+weir2 w uei2 rr
+weir3 w uei3 rr
+weir4 w uei4 rr
+weir5 w uei5 rr
+wenr1 w uen1 rr
+wenr2 w uen2 rr
+wenr3 w uen3 rr
+wenr4 w uen4 rr
+wenr5 w uen5 rr
+wengr1 w uen1 rr
+wengr2 w uen2 rr
+wengr3 w uen3 rr
+wengr4 w uen4 rr
+wengr5 w uen5 rr
+wor1 w uo1 rr
+wor2 w uo2 rr
+wor3 w uo3 rr
+wor4 w uo4 rr
+wor5 w uo5 rr
+wur1 w u1 rr
+wur2 w u2 rr
+wur3 w u3 rr
+wur4 w u4 rr
+wur5 w u5 rr
+xir1 x i1 rr
+xir2 x i2 rr
+xir3 x i3 rr
+xir4 x i4 rr
+xir5 x i5 rr
+xiar1 x ia1 rr
+xiar2 x ia2 rr
+xiar3 x ia3 rr
+xiar4 x ia4 rr
+xiar5 x ia5 rr
+xianr1 x ian1 rr
+xianr2 x ian2 rr
+xianr3 x ian3 rr
+xianr4 x ian4 rr
+xianr5 x ian5 rr
+xiangr1 x iang1 rr
+xiangr2 x iang2 rr
+xiangr3 x iang3 rr
+xiangr4 x iang4 rr
+xiangr5 x iang5 rr
+xiaor1 x iao1 rr
+xiaor2 x iao2 rr
+xiaor3 x iao3 rr
+xiaor4 x iao4 rr
+xiaor5 x iao5 rr
+xier1 x ie1 rr
+xier2 x ie2 rr
+xier3 x ie3 rr
+xier4 x ie4 rr
+xier5 x ie5 rr
+xinr1 x in1 rr
+xinr2 x in2 rr
+xinr3 x in3 rr
+xinr4 x in4 rr
+xinr5 x in5 rr
+xingr1 x ing1 rr
+xingr2 x ing2 rr
+xingr3 x ing3 rr
+xingr4 x ing4 rr
+xingr5 x ing5 rr
+xiongr1 x iong1 rr
+xiongr2 x iong2 rr
+xiongr3 x iong3 rr
+xiongr4 x iong4 rr
+xiongr5 x iong5 rr
+xiur1 x iou1 rr
+xiur2 x iou2 rr
+xiur3 x iou3 rr
+xiur4 x iou4 rr
+xiur5 x iou5 rr
+xur1 x v1 rr
+xur2 x v2 rr
+xur3 x v3 rr
+xur4 x v4 rr
+xur5 x v5 rr
+xuanr1 x van1 rr
+xuanr2 x van2 rr
+xuanr3 x van3 rr
+xuanr4 x van4 rr
+xuanr5 x van5 rr
+xuer1 x ve1 rr
+xuer2 x ve2 rr
+xuer3 x ve3 rr
+xuer4 x ve4 rr
+xuer5 x ve5 rr
+xunr1 x vn1 rr
+xunr2 x vn2 rr
+xunr3 x vn3 rr
+xunr4 x vn4 rr
+xunr5 x vn5 rr
+yar1 y ia1 rr
+yar2 y ia2 rr
+yar3 y ia3 rr
+yar4 y ia4 rr
+yar5 y ia5 rr
+yanr1 y ian1 rr
+yanr2 y ian2 rr
+yanr3 y ian3 rr
+yanr4 y ian4 rr
+yanr5 y ian5 rr
+yangr1 y iang1 rr
+yangr2 y iang2 rr
+yangr3 y iang3 rr
+yangr4 y iang4 rr
+yangr5 y iang5 rr
+yaor1 y iao1 rr
+yaor2 y iao2 rr
+yaor3 y iao3 rr
+yaor4 y iao4 rr
+yaor5 y iao5 rr
+yer1 y ie1 rr
+yer2 y ie2 rr
+yer3 y ie3 rr
+yer4 y ie4 rr
+yer5 y ie5 rr
+yir1 y i1 rr
+yir2 y i2 rr
+yir3 y i3 rr
+yir4 y i4 rr
+yir5 y i5 rr
+yinr1 y in1 rr
+yinr2 y in2 rr
+yinr3 y in3 rr
+yinr4 y in4 rr
+yinr5 y in5 rr
+yingr1 y ing1 rr
+yingr2 y ing2 rr
+yingr3 y ing3 rr
+yingr4 y ing4 rr
+yingr5 y ing5 rr
+yor1 y iou1 rr
+yor2 y iou2 rr
+yor3 y iou3 rr
+yor4 y iou4 rr
+yor5 y iou5 rr
+yongr1 y iong1 rr
+yongr2 y iong2 rr
+yongr3 y iong3 rr
+yongr4 y iong4 rr
+yongr5 y iong5 rr
+your1 y iou1 rr
+your2 y iou2 rr
+your3 y iou3 rr
+your4 y iou4 rr
+your5 y iou5 rr
+yur1 y v1 rr
+yur2 y v2 rr
+yur3 y v3 rr
+yur4 y v4 rr
+yur5 y v5 rr
+yuanr1 y van1 rr
+yuanr2 y van2 rr
+yuanr3 y van3 rr
+yuanr4 y van4 rr
+yuanr5 y van5 rr
+yuer1 y ve1 rr
+yuer2 y ve2 rr
+yuer3 y ve3 rr
+yuer4 y ve4 rr
+yuer5 y ve5 rr
+yunr1 y vn1 rr
+yunr2 y vn2 rr
+yunr3 y vn3 rr
+yunr4 y vn4 rr
+yunr5 y vn5 rr
+zar1 z a1 rr
+zar2 z a2 rr
+zar3 z a3 rr
+zar4 z a4 rr
+zar5 z a5 rr
+zair1 z ai1 rr
+zair2 z ai2 rr
+zair3 z ai3 rr
+zair4 z ai4 rr
+zair5 z ai5 rr
+zanr1 z an1 rr
+zanr2 z an2 rr
+zanr3 z an3 rr
+zanr4 z an4 rr
+zanr5 z an5 rr
+zangr1 z ang1 rr
+zangr2 z ang2 rr
+zangr3 z ang3 rr
+zangr4 z ang4 rr
+zangr5 z ang5 rr
+zaor1 z ao1 rr
+zaor2 z ao2 rr
+zaor3 z ao3 rr
+zaor4 z ao4 rr
+zaor5 z ao5 rr
+zer1 z e1 rr
+zer2 z e2 rr
+zer3 z e3 rr
+zer4 z e4 rr
+zer5 z e5 rr
+zeir1 z ei1 rr
+zeir2 z ei2 rr
+zeir3 z ei3 rr
+zeir4 z ei4 rr
+zeir5 z ei5 rr
+zenr1 z en1 rr
+zenr2 z en2 rr
+zenr3 z en3 rr
+zenr4 z en4 rr
+zenr5 z en5 rr
+zengr1 z eng1 rr
+zengr2 z eng2 rr
+zengr3 z eng3 rr
+zengr4 z eng4 rr
+zengr5 z eng5 rr
+zhar1 zh a1 rr
+zhar2 zh a2 rr
+zhar3 zh a3 rr
+zhar4 zh a4 rr
+zhar5 zh a5 rr
+zhair1 zh ai1 rr
+zhair2 zh ai2 rr
+zhair3 zh ai3 rr
+zhair4 zh ai4 rr
+zhair5 zh ai5 rr
+zhanr1 zh an1 rr
+zhanr2 zh an2 rr
+zhanr3 zh an3 rr
+zhanr4 zh an4 rr
+zhanr5 zh an5 rr
+zhangr1 zh ang1 rr
+zhangr2 zh ang2 rr
+zhangr3 zh ang3 rr
+zhangr4 zh ang4 rr
+zhangr5 zh ang5 rr
+zhaor1 zh ao1 rr
+zhaor2 zh ao2 rr
+zhaor3 zh ao3 rr
+zhaor4 zh ao4 rr
+zhaor5 zh ao5 rr
+zher1 zh e1 rr
+zher2 zh e2 rr
+zher3 zh e3 rr
+zher4 zh e4 rr
+zher5 zh e5 rr
+zheir1 zh ei1 rr
+zheir2 zh ei2 rr
+zheir3 zh ei3 rr
+zheir4 zh ei4 rr
+zheir5 zh ei5 rr
+zhenr1 zh en1 rr
+zhenr2 zh en2 rr
+zhenr3 zh en3 rr
+zhenr4 zh en4 rr
+zhenr5 zh en5 rr
+zhengr1 zh eng1 rr
+zhengr2 zh eng2 rr
+zhengr3 zh eng3 rr
+zhengr4 zh eng4 rr
+zhengr5 zh eng5 rr
+zhir1 zh iii1 rr
+zhir2 zh iii2 rr
+zhir3 zh iii3 rr
+zhir4 zh iii4 rr
+zhir5 zh iii5 rr
+zhongr1 zh ong1 rr
+zhongr2 zh ong2 rr
+zhongr3 zh ong3 rr
+zhongr4 zh ong4 rr
+zhongr5 zh ong5 rr
+zhour1 zh ou1 rr
+zhour2 zh ou2 rr
+zhour3 zh ou3 rr
+zhour4 zh ou4 rr
+zhour5 zh ou5 rr
+zhur1 zh u1 rr
+zhur2 zh u2 rr
+zhur3 zh u3 rr
+zhur4 zh u4 rr
+zhur5 zh u5 rr
+zhuar1 zh ua1 rr
+zhuar2 zh ua2 rr
+zhuar3 zh ua3 rr
+zhuar4 zh ua4 rr
+zhuar5 zh ua5 rr
+zhuair1 zh uai1 rr
+zhuair2 zh uai2 rr
+zhuair3 zh uai3 rr
+zhuair4 zh uai4 rr
+zhuair5 zh uai5 rr
+zhuanr1 zh uan1 rr
+zhuanr2 zh uan2 rr
+zhuanr3 zh uan3 rr
+zhuanr4 zh uan4 rr
+zhuanr5 zh uan5 rr
+zhuangr1 zh uang1 rr
+zhuangr2 zh uang2 rr
+zhuangr3 zh uang3 rr
+zhuangr4 zh uang4 rr
+zhuangr5 zh uang5 rr
+zhuir1 zh uei1 rr
+zhuir2 zh uei2 rr
+zhuir3 zh uei3 rr
+zhuir4 zh uei4 rr
+zhuir5 zh uei5 rr
+zhunr1 zh uen1 rr
+zhunr2 zh uen2 rr
+zhunr3 zh uen3 rr
+zhunr4 zh uen4 rr
+zhunr5 zh uen5 rr
+zhuor1 zh uo1 rr
+zhuor2 zh uo2 rr
+zhuor3 zh uo3 rr
+zhuor4 zh uo4 rr
+zhuor5 zh uo5 rr
+zir1 z ii1 rr
+zir2 z ii2 rr
+zir3 z ii3 rr
+zir4 z ii4 rr
+zir5 z ii5 rr
+zongr1 z ong1 rr
+zongr2 z ong2 rr
+zongr3 z ong3 rr
+zongr4 z ong4 rr
+zongr5 z ong5 rr
+zour1 z ou1 rr
+zour2 z ou2 rr
+zour3 z ou3 rr
+zour4 z ou4 rr
+zour5 z ou5 rr
+zur1 z u1 rr
+zur2 z u2 rr
+zur3 z u3 rr
+zur4 z u4 rr
+zur5 z u5 rr
+zuanr1 z uan1 rr
+zuanr2 z uan2 rr
+zuanr3 z uan3 rr
+zuanr4 z uan4 rr
+zuanr5 z uan5 rr
+zuir1 z uei1 rr
+zuir2 z uei2 rr
+zuir3 z uei3 rr
+zuir4 z uei4 rr
+zuir5 z uei5 rr
+zunr1 z uen1 rr
+zunr2 z uen2 rr
+zunr3 z uen3 rr
+zunr4 z uen4 rr
+zunr5 z uen5 rr
+zuor1 z uo1 rr
+zuor2 z uo2 rr
+zuor3 z uo3 rr
+zuor4 z uo4 rr
+zuor5 z uo5 rr

lemas_tts/infer/text_norm/symbols.py ADDED Viewed

	@@ -0,0 +1,419 @@

+pinyin_dict = {
+    "a": ("^", "a"),
+    "ai": ("^", "ai"),
+    "an": ("^", "an"),
+    "ang": ("^", "ang"),
+    "ao": ("^", "ao"),
+    "ba": ("b", "a"),
+    "bai": ("b", "ai"),
+    "ban": ("b", "an"),
+    "bang": ("b", "ang"),
+    "bao": ("b", "ao"),
+    "be": ("b", "e"),
+    "bei": ("b", "ei"),
+    "ben": ("b", "en"),
+    "beng": ("b", "eng"),
+    "bi": ("b", "i"),
+    "bian": ("b", "ian"),
+    "biao": ("b", "iao"),
+    "bie": ("b", "ie"),
+    "bin": ("b", "in"),
+    "bing": ("b", "ing"),
+    "bo": ("b", "o"),
+    "bu": ("b", "u"),
+    "ca": ("c", "a"),
+    "cai": ("c", "ai"),
+    "can": ("c", "an"),
+    "cang": ("c", "ang"),
+    "cao": ("c", "ao"),
+    "ce": ("c", "e"),
+    "cen": ("c", "en"),
+    "ceng": ("c", "eng"),
+    "cha": ("ch", "a"),
+    "chai": ("ch", "ai"),
+    "chan": ("ch", "an"),
+    "chang": ("ch", "ang"),
+    "chao": ("ch", "ao"),
+    "che": ("ch", "e"),
+    "chen": ("ch", "en"),
+    "cheng": ("ch", "eng"),
+    "chi": ("ch", "iii"),
+    "chong": ("ch", "ong"),
+    "chou": ("ch", "ou"),
+    "chu": ("ch", "u"),
+    "chua": ("ch", "ua"),
+    "chuai": ("ch", "uai"),
+    "chuan": ("ch", "uan"),
+    "chuang": ("ch", "uang"),
+    "chui": ("ch", "uei"),
+    "chun": ("ch", "uen"),
+    "chuo": ("ch", "uo"),
+    "ci": ("c", "ii"),
+    "cong": ("c", "ong"),
+    "cou": ("c", "ou"),
+    "cu": ("c", "u"),
+    "cuan": ("c", "uan"),
+    "cui": ("c", "uei"),
+    "cun": ("c", "uen"),
+    "cuo": ("c", "uo"),
+    "da": ("d", "a"),
+    "dai": ("d", "ai"),
+    "dan": ("d", "an"),
+    "dang": ("d", "ang"),
+    "dao": ("d", "ao"),
+    "de": ("d", "e"),
+    "dei": ("d", "ei"),
+    "den": ("d", "en"),
+    "deng": ("d", "eng"),
+    "di": ("d", "i"),
+    "dia": ("d", "ia"),
+    "dian": ("d", "ian"),
+    "diao": ("d", "iao"),
+    "die": ("d", "ie"),
+    "ding": ("d", "ing"),
+    "diu": ("d", "iou"),
+    "dong": ("d", "ong"),
+    "dou": ("d", "ou"),
+    "du": ("d", "u"),
+    "duan": ("d", "uan"),
+    "dui": ("d", "uei"),
+    "dun": ("d", "uen"),
+    "duo": ("d", "uo"),
+    "e": ("^", "e"),
+    "ei": ("^", "ei"),
+    "en": ("^", "en"),
+    "ng": ("^", "en"),
+    "eng": ("^", "eng"),
+    "er": ("^", "er"),
+    "fa": ("f", "a"),
+    "fan": ("f", "an"),
+    "fang": ("f", "ang"),
+    "fei": ("f", "ei"),
+    "fen": ("f", "en"),
+    "feng": ("f", "eng"),
+    "fo": ("f", "o"),
+    "fou": ("f", "ou"),
+    "fu": ("f", "u"),
+    "ga": ("g", "a"),
+    "gai": ("g", "ai"),
+    "gan": ("g", "an"),
+    "gang": ("g", "ang"),
+    "gao": ("g", "ao"),
+    "ge": ("g", "e"),
+    "gei": ("g", "ei"),
+    "gen": ("g", "en"),
+    "geng": ("g", "eng"),
+    "gong": ("g", "ong"),
+    "gou": ("g", "ou"),
+    "gu": ("g", "u"),
+    "gua": ("g", "ua"),
+    "guai": ("g", "uai"),
+    "guan": ("g", "uan"),
+    "guang": ("g", "uang"),
+    "gui": ("g", "uei"),
+    "gun": ("g", "uen"),
+    "guo": ("g", "uo"),
+    "ha": ("h", "a"),
+    "hai": ("h", "ai"),
+    "han": ("h", "an"),
+    "hang": ("h", "ang"),
+    "hao": ("h", "ao"),
+    "he": ("h", "e"),
+    "hei": ("h", "ei"),
+    "hen": ("h", "en"),
+    "heng": ("h", "eng"),
+    "hong": ("h", "ong"),
+    "hou": ("h", "ou"),
+    "hu": ("h", "u"),
+    "hua": ("h", "ua"),
+    "huai": ("h", "uai"),
+    "huan": ("h", "uan"),
+    "huang": ("h", "uang"),
+    "hui": ("h", "uei"),
+    "hun": ("h", "uen"),
+    "huo": ("h", "uo"),
+    "ji": ("j", "i"),
+    "jia": ("j", "ia"),
+    "jian": ("j", "ian"),
+    "jiang": ("j", "iang"),
+    "jiao": ("j", "iao"),
+    "jie": ("j", "ie"),
+    "jin": ("j", "in"),
+    "jing": ("j", "ing"),
+    "jiong": ("j", "iong"),
+    "jiu": ("j", "iou"),
+    "ju": ("j", "v"),
+    "juan": ("j", "van"),
+    "jue": ("j", "ve"),
+    "jun": ("j", "vn"),
+    "ka": ("k", "a"),
+    "kai": ("k", "ai"),
+    "kan": ("k", "an"),
+    "kang": ("k", "ang"),
+    "kao": ("k", "ao"),
+    "ke": ("k", "e"),
+    "kei": ("k", "ei"),
+    "ken": ("k", "en"),
+    "keng": ("k", "eng"),
+    "kong": ("k", "ong"),
+    "kou": ("k", "ou"),
+    "ku": ("k", "u"),
+    "kua": ("k", "ua"),
+    "kuai": ("k", "uai"),
+    "kuan": ("k", "uan"),
+    "kuang": ("k", "uang"),
+    "kui": ("k", "uei"),
+    "kun": ("k", "uen"),
+    "kuo": ("k", "uo"),
+    "la": ("l", "a"),
+    "lai": ("l", "ai"),
+    "lan": ("l", "an"),
+    "lang": ("l", "ang"),
+    "lao": ("l", "ao"),
+    "le": ("l", "e"),
+    "lei": ("l", "ei"),
+    "leng": ("l", "eng"),
+    "li": ("l", "i"),
+    "lia": ("l", "ia"),
+    "lian": ("l", "ian"),
+    "liang": ("l", "iang"),
+    "liao": ("l", "iao"),
+    "lie": ("l", "ie"),
+    "lin": ("l", "in"),
+    "ling": ("l", "ing"),
+    "liu": ("l", "iou"),
+    "lo": ("l", "o"),
+    "long": ("l", "ong"),
+    "lou": ("l", "ou"),
+    "lu": ("l", "u"),
+    "lv": ("l", "v"),
+    "luan": ("l", "uan"),
+    "lve": ("l", "ve"),
+    "lue": ("l", "ve"),
+    "lun": ("l", "uen"),
+    "luo": ("l", "uo"),
+    "ma": ("m", "a"),
+    "mai": ("m", "ai"),
+    "man": ("m", "an"),
+    "mang": ("m", "ang"),
+    "mao": ("m", "ao"),
+    "me": ("m", "e"),
+    "mei": ("m", "ei"),
+    "men": ("m", "en"),
+    "meng": ("m", "eng"),
+    "mi": ("m", "i"),
+    "mian": ("m", "ian"),
+    "miao": ("m", "iao"),
+    "mie": ("m", "ie"),
+    "min": ("m", "in"),
+    "ming": ("m", "ing"),
+    "miu": ("m", "iou"),
+    "mo": ("m", "o"),
+    "mou": ("m", "ou"),
+    "mu": ("m", "u"),
+    "na": ("n", "a"),
+    "nai": ("n", "ai"),
+    "nan": ("n", "an"),
+    "nang": ("n", "ang"),
+    "nao": ("n", "ao"),
+    "ne": ("n", "e"),
+    "nei": ("n", "ei"),
+    "nen": ("n", "en"),
+    "neng": ("n", "eng"),
+    "ni": ("n", "i"),
+    "nia": ("n", "ia"),
+    "nian": ("n", "ian"),
+    "niang": ("n", "iang"),
+    "niao": ("n", "iao"),
+    "nie": ("n", "ie"),
+    "nin": ("n", "in"),
+    "ning": ("n", "ing"),
+    "niu": ("n", "iou"),
+    "nong": ("n", "ong"),
+    "nou": ("n", "ou"),
+    "nu": ("n", "u"),
+    "nv": ("n", "v"),
+    "nuan": ("n", "uan"),
+    "nve": ("n", "ve"),
+    "nue": ("n", "ve"),
+    "nuo": ("n", "uo"),
+    "o": ("^", "o"),
+    "ou": ("^", "ou"),
+    "pa": ("p", "a"),
+    "pai": ("p", "ai"),
+    "pan": ("p", "an"),
+    "pang": ("p", "ang"),
+    "pao": ("p", "ao"),
+    "pe": ("p", "e"),
+    "pei": ("p", "ei"),
+    "pen": ("p", "en"),
+    "peng": ("p", "eng"),
+    "pi": ("p", "i"),
+    "pian": ("p", "ian"),
+    "piao": ("p", "iao"),
+    "pie": ("p", "ie"),
+    "pin": ("p", "in"),
+    "ping": ("p", "ing"),
+    "po": ("p", "o"),
+    "pou": ("p", "ou"),
+    "pu": ("p", "u"),
+    "qi": ("q", "i"),
+    "qia": ("q", "ia"),
+    "qian": ("q", "ian"),
+    "qiang": ("q", "iang"),
+    "qiao": ("q", "iao"),
+    "qie": ("q", "ie"),
+    "qin": ("q", "in"),
+    "qing": ("q", "ing"),
+    "qiong": ("q", "iong"),
+    "qiu": ("q", "iou"),
+    "qu": ("q", "v"),
+    "quan": ("q", "van"),
+    "que": ("q", "ve"),
+    "qun": ("q", "vn"),
+    "ran": ("r", "an"),
+    "rang": ("r", "ang"),
+    "rao": ("r", "ao"),
+    "re": ("r", "e"),
+    "ren": ("r", "en"),
+    "reng": ("r", "eng"),
+    "ri": ("r", "iii"),
+    "rong": ("r", "ong"),
+    "rou": ("r", "ou"),
+    "ru": ("r", "u"),
+    "rua": ("r", "ua"),
+    "ruan": ("r", "uan"),
+    "rui": ("r", "uei"),
+    "run": ("r", "uen"),
+    "ruo": ("r", "uo"),
+    "sa": ("s", "a"),
+    "sai": ("s", "ai"),
+    "san": ("s", "an"),
+    "sang": ("s", "ang"),
+    "sao": ("s", "ao"),
+    "se": ("s", "e"),
+    "sen": ("s", "en"),
+    "seng": ("s", "eng"),
+    "sha": ("sh", "a"),
+    "shai": ("sh", "ai"),
+    "shan": ("sh", "an"),
+    "shang": ("sh", "ang"),
+    "shao": ("sh", "ao"),
+    "she": ("sh", "e"),
+    "shei": ("sh", "ei"),
+    "shen": ("sh", "en"),
+    "sheng": ("sh", "eng"),
+    "shi": ("sh", "iii"),
+    "shou": ("sh", "ou"),
+    "shu": ("sh", "u"),
+    "shua": ("sh", "ua"),
+    "shuai": ("sh", "uai"),
+    "shuan": ("sh", "uan"),
+    "shuang": ("sh", "uang"),
+    "shui": ("sh", "uei"),
+    "shun": ("sh", "uen"),
+    "shuo": ("sh", "uo"),
+    "si": ("s", "ii"),
+    "song": ("s", "ong"),
+    "sou": ("s", "ou"),
+    "su": ("s", "u"),
+    "suan": ("s", "uan"),
+    "sui": ("s", "uei"),
+    "sun": ("s", "uen"),
+    "suo": ("s", "uo"),
+    "ta": ("t", "a"),
+    "tai": ("t", "ai"),
+    "tan": ("t", "an"),
+    "tang": ("t", "ang"),
+    "tao": ("t", "ao"),
+    "te": ("t", "e"),
+    "tei": ("t", "ei"),
+    "teng": ("t", "eng"),
+    "ti": ("t", "i"),
+    "tian": ("t", "ian"),
+    "tiao": ("t", "iao"),
+    "tie": ("t", "ie"),
+    "ting": ("t", "ing"),
+    "tong": ("t", "ong"),
+    "tou": ("t", "ou"),
+    "tu": ("t", "u"),
+    "tuan": ("t", "uan"),
+    "tui": ("t", "uei"),
+    "tun": ("t", "uen"),
+    "tuo": ("t", "uo"),
+    "wa": ("^", "ua"),
+    "wai": ("^", "uai"),
+    "wan": ("^", "uan"),
+    "wang": ("^", "uang"),
+    "wei": ("^", "uei"),
+    "wen": ("^", "uen"),
+    "weng": ("^", "ueng"),
+    "wo": ("^", "uo"),
+    "wu": ("^", "u"),
+    "xi": ("x", "i"),
+    "xia": ("x", "ia"),
+    "xian": ("x", "ian"),
+    "xiang": ("x", "iang"),
+    "xiao": ("x", "iao"),
+    "xie": ("x", "ie"),
+    "xin": ("x", "in"),
+    "xing": ("x", "ing"),
+    "xiong": ("x", "iong"),
+    "xiu": ("x", "iou"),
+    "xu": ("x", "v"),
+    "xuan": ("x", "van"),
+    "xue": ("x", "ve"),
+    "xun": ("x", "vn"),
+    "ya": ("^", "ia"),
+    "yan": ("^", "ian"),
+    "yang": ("^", "iang"),
+    "yao": ("^", "iao"),
+    "ye": ("^", "ie"),
+    "yi": ("^", "i"),
+    "yin": ("^", "in"),
+    "ying": ("^", "ing"),
+    "yo": ("^", "iou"),
+    "yong": ("^", "iong"),
+    "you": ("^", "iou"),
+    "yu": ("^", "v"),
+    "yuan": ("^", "van"),
+    "yue": ("^", "ve"),
+    "yun": ("^", "vn"),
+    "za": ("z", "a"),
+    "zai": ("z", "ai"),
+    "zan": ("z", "an"),
+    "zang": ("z", "ang"),
+    "zao": ("z", "ao"),
+    "ze": ("z", "e"),
+    "zei": ("z", "ei"),
+    "zen": ("z", "en"),
+    "zeng": ("z", "eng"),
+    "zha": ("zh", "a"),
+    "zhai": ("zh", "ai"),
+    "zhan": ("zh", "an"),
+    "zhang": ("zh", "ang"),
+    "zhao": ("zh", "ao"),
+    "zhe": ("zh", "e"),
+    "zhei": ("zh", "ei"),
+    "zhen": ("zh", "en"),
+    "zheng": ("zh", "eng"),
+    "zhi": ("zh", "iii"),
+    "zhong": ("zh", "ong"),
+    "zhou": ("zh", "ou"),
+    "zhu": ("zh", "u"),
+    "zhua": ("zh", "ua"),
+    "zhuai": ("zh", "uai"),
+    "zhuan": ("zh", "uan"),
+    "zhuang": ("zh", "uang"),
+    "zhui": ("zh", "uei"),
+    "zhun": ("zh", "uen"),
+    "zhuo": ("zh", "uo"),
+    "zi": ("z", "ii"),
+    "zong": ("z", "ong"),
+    "zou": ("z", "ou"),
+    "zu": ("z", "u"),
+    "zuan": ("z", "uan"),
+    "zui": ("z", "uei"),
+    "zun": ("z", "uen"),
+    "zuo": ("z", "uo"),
+}

lemas_tts/infer/text_norm/tokenizer.py ADDED Viewed

	@@ -0,0 +1,219 @@

+# cp from https://github.com/lifeiteng/vall-e/blob/main/valle/data/tokenizer.py
+# Copyright    2023                            (authors: Feiteng Li)
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re, logging
+from dataclasses import asdict, dataclass
+from typing import Any, Dict, List, Optional, Pattern, Union
+import math
+import numpy as np
+import torch
+import torchaudio
+# from lhotse.features import FeatureExtractor
+# from lhotse.utils import Seconds, compute_num_frames
+from phonemizer.backend import EspeakBackend
+from phonemizer.backend.espeak.language_switch import LanguageSwitch
+from phonemizer.backend.espeak.words_mismatch import WordMismatch
+from phonemizer.punctuation import Punctuation
+from phonemizer.separator import Separator
+class TextTokenizer:
+    """Phonemize Text."""
+    def __init__(
+        self,
+        language="en-us",
+        backend="espeak",
+        separator=Separator(word="_", syllable="-", phone="|"),
+        preserve_punctuation=True,
+        punctuation_marks: Union[str, Pattern] = Punctuation.default_marks(),
+        with_stress: bool = False,
+        tie: Union[bool, str] = False,
+        language_switch: LanguageSwitch = "keep-flags",
+        words_mismatch: WordMismatch = "ignore",
+    ) -> None:
+        phonemizer = EspeakBackend(
+            language,
+            punctuation_marks=punctuation_marks,
+            preserve_punctuation=preserve_punctuation,
+            with_stress=with_stress,
+            tie=tie,
+            language_switch=language_switch,
+            words_mismatch=words_mismatch,
+        )
+        self.backend = phonemizer
+        self.separator = separator
+    def to_list(self, phonemized: str) -> List[str]:
+        fields = []
+        for word in phonemized.split(self.separator.word):
+            # "ɐ    m|iː|n?"    ɹ|ɪ|z|ɜː|v; h|ɪ|z.
+            pp = re.findall(r"\w+|[^\w\s]", word, re.UNICODE)
+            fields.extend(
+                [p for p in pp if p != self.separator.phone]
+                + [self.separator.word]
+            )
+        assert len("".join(fields[:-1])) == len(phonemized) - phonemized.count(
+            self.separator.phone
+        )
+        return fields[:-1]
+    def __call__(self, text, strip=True) -> List[List[str]]:
+        if isinstance(text, str):
+            text = [text]
+        phones = []
+        for txt in text:
+            if txt == '':
+                continue
+            if txt[0] == '#':
+                phones.append(txt)
+            else:
+                ipa = text_tokenizer.backend.phonemize([txt], separator=text_tokenizer.separator, strip=True, njobs=1, logger=logging.basicConfig(level=logging.ERROR))
+                phones += text_tokenizer.to_list(ipa[0])
+        return phones
+def tokenize_text(tokenizer: TextTokenizer, text: str) -> List[str]:
+    phonemes = tokenizer([text.strip()])
+    return phonemes[0]  # k2symbols
+_PAUSE_SYMBOL = {'、':',', '，':',', '。':',', '！':'!', '？':'?', '：':':'}
+def _replace(match):
+    word = match.group(0)
+    return _PAUSE_SYMBOL[word]
+def txt2phone(tokenizer: TextTokenizer, text: str):
+    text = re.sub('|'.join(_PAUSE_SYMBOL.keys()), _replace, text)
+    text = re.split(r"(#\d)", text)
+    phones = []
+    for txt in text:
+        if txt == '':
+            continue
+        if txt[0] == '#':
+            phones.append(txt)
+        else:
+            ipa = tokenizer.backend.phonemize([txt], separator=tokenizer.separator, strip=True, njobs=1)
+            phones += tokenizer.to_list(ipa[0])
+    phones = "|".join(phones).replace("(|", "(").replace("|)", ")")
+    # phones = ["(cmn)"] + phones.split("|")
+    return phones
+def convert_audio(wav: torch.Tensor, sr: int, target_sr: int, target_channels: int):
+    assert wav.shape[0] in [1, 2], "Audio must be mono or stereo."
+    if target_channels == 1:
+        wav = wav.mean(0, keepdim=True)
+    elif target_channels == 2:
+        *shape, _, length = wav.shape
+        wav = wav.expand(*shape, target_channels, length)
+    elif wav.shape[0] == 1:
+        wav = wav.expand(target_channels, -1)
+    wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
+    return wav
+class AudioTokenizer:
+    """EnCodec audio."""
+    def __init__(
+        self,
+        device: Any = None,
+        signature = None
+    ) -> None:
+        from audiocraft.solvers import CompressionSolver
+        model = CompressionSolver.model_from_checkpoint(signature)
+        self.sample_rate = model.sample_rate
+        self.channels = model.channels
+        if not device:
+            device = torch.device("cpu")
+            if torch.cuda.is_available():
+                device = torch.device("cuda:0")
+        self._device = device
+        self.codec = model.to(device)
+    @property
+    def device(self):
+        return self._device
+    def encode(self, wav: torch.Tensor) -> torch.Tensor:
+        codes = self.codec.encode(wav.to(self.device))
+        return [(codes[0], None)]
+    def decode(self, frames: torch.Tensor) -> torch.Tensor:
+        frames = frames[0][0] # [1,4,T]
+        return self.codec.decode(frames)
+def tokenize_audio(tokenizer: AudioTokenizer, audio, offset = -1, num_frames=-1):
+    # Load and pre-process the audio waveform
+    if type(audio) == str:
+        if offset != -1 and num_frames!=-1:
+            wav, sr = torchaudio.load(audio, frame_offset=offset, num_frames=num_frames)
+        else:
+            wav, sr = torchaudio.load(audio)
+        wav = convert_audio(wav, sr, tokenizer.sample_rate, tokenizer.channels)
+        wav = wav.unsqueeze(0)
+    else:
+        wav = audio.unsqueeze(0).unsqueeze(0)
+    # Extract discrete codes from EnCodec
+    with torch.no_grad():
+        encoded_frames = tokenizer.encode(wav)
+    return encoded_frames
+class AudioSR:
+    """EnCodec audio."""
+    def __init__(
+        self,
+        model_path,
+        device = "cpu",
+    ) -> None:
+        import dac
+        self.codec = dac.DAC.load(model_path)
+        self.codec.to(device)
+        self.codec.eval()
+        self.sample_rate = self.codec.sample_rate
+        self.channels = 1
+        self._device = device
+    @property
+    def device(self):
+        return self._device
+    def encode(self, wav: torch.Tensor) -> torch.Tensor:
+        length = wav.shape[-1]
+        right_pad = math.ceil(length / self.codec.hop_length) * self.codec.hop_length - length
+        wav = torch.nn.functional.pad(wav, (0, right_pad))
+        z, codes, _, _, _ = self.codec.encode(wav.to(self._device))
+        return [(codes, z)]
+    def decode(self, frames: torch.Tensor) -> torch.Tensor:
+        # frames = frames[0][0] # [1,4,T]
+        # with torch.no_grad():
+        #     z = self.codec.quantizer.from_codes(frames)[0]
+        #     y = self.codec.decode(z)
+        z = frames[0][1] # [1, 2048, T]
+        with torch.no_grad():
+            y = self.codec.decode(z)
+        return y

lemas_tts/infer/text_norm/txt2pinyin.py ADDED Viewed

	@@ -0,0 +1,225 @@

+import multiprocessing
+from concurrent.futures import ProcessPoolExecutor
+import argparse
+import os, sys, re
+from random import shuffle
+from tqdm import tqdm
+from pypinyin import Style
+from pypinyin.contrib.neutral_tone import NeutralToneWith5Mixin
+from pypinyin.converter import DefaultConverter
+from pypinyin.core import Pinyin
+import jieba
+jieba.set_dictionary(dictionary_path=os.path.join(os.path.dirname(__file__)+'/jieba_dict.txt'))
+from .symbols import pinyin_dict
+from .cn_tn import NSWNormalizer
+zh_pattern = re.compile("[\u4e00-\u9fa5]")
+alpha_pattern = re.compile(r"[a-zA-Z]")
+def is_zh(word):
+    global zh_pattern
+    match = zh_pattern.search(word)
+    return match is not None
+def is_alpha(word):
+    global alpha_pattern
+    match = alpha_pattern.search(word)
+    return match is not None
+def get_phoneme_from_char_and_pinyin(chn_char, pinyin):
+    # we do not need #4, use sil to replace it
+    chn_char = chn_char.replace("#4", "")
+    char_len = len(chn_char)
+    i, j = 0, 0
+    result = []
+    # print(pinyin)
+    while i < char_len:
+        cur_char = chn_char[i]
+        if is_zh(cur_char):
+            if pinyin[j][:-1] == 'n':  # 处理特殊“嗯” 特殊拼音
+                pinyin[j] = 'en' + pinyin[j][-1]
+            if i < len(chn_char)-2 and is_zh(chn_char[i:i+3]) and pinyin[j][-1] == pinyin[j+1][-1] == pinyin[j+2][-1] == '3':  # 处理连续三个三声变调
+                pinyin[j+1] = pinyin[j+1][:-1] + '2'
+                # print(chn_char[i:i+3], pinyin[j:j+3])
+            if i < len(chn_char)-1 and pinyin[j][:-1] in pinyin_dict and is_zh(chn_char[i]) and is_zh(chn_char[i+1]) and pinyin[j][-1] == pinyin[j+1][-1] == '3':  # 处理连续两个三声变调
+                pinyin[j] = pinyin[j][:-1] + '2'
+                # print('change tone ', chn_char[i:i+2], pinyin[j:j + 2])
+            if pinyin[j][:-1] not in pinyin_dict:  # 处理儿化音
+                assert chn_char[i + 1] == "儿", f"current_char : {cur_char}, next_char: {chn_char[i+1]}, cur_pinyin: {pinyin[j]}"
+                assert pinyin[j][-2] == "r"
+                tone = pinyin[j][-1]
+                a = pinyin[j][:-2]
+                # a1, a2 = pinyin_dict[a]
+                # result += [a1, a2 + tone, "er5"]
+                result += [a + tone, er5]
+                if i + 2 < char_len and chn_char[i + 2] != "#":
+                    result.append("#0")
+                i += 2
+                j += 1
+            else:
+                tone = pinyin[j][-1]
+                a = pinyin[j][:-1]
+                a1, a2 = pinyin_dict[a] # a="wen" a1="^", a2="en"
+                # result += [a1, a2 + tone]  # result = [zh, ong1, ^,en2]
+                result.append(a+tone)
+                # if i + 1 < char_len and chn_char[i + 1] != "#":  # 每个字后面接一个#0
+                    # result.append("#0")
+                i += 1
+                j += 1
+        # TODO support English alpha
+        # elif is_alpha(cur_char):
+        #     result += ALPHA_PHONE_DICT[cur_char.upper()]
+        #     if i + 1 < char_len and chn_char[i + 1] not in "#、，。！？：" :  # 每个字后面接一个#0
+        #         result.append("#0")
+        #     i += 1
+        #     j += 1  # baker alpha dataset "ABC" in pinyin
+        elif cur_char == "#":
+            result.append(chn_char[i : i + 2])
+            i += 2
+        elif cur_char in _PAUSE_SYMBOL:  # 遇到标点符号，添加停顿
+            result.pop()  # 去掉#0
+            result.append("#3")
+            i += 1
+        else:
+            # ignore the unknown char
+            # result.append(chn_char[i])
+            i += 1
+    if result[-1] == "#0":  # 去掉最后的#0，改为sil
+        result = result[:-1]
+    # if result[-1] != "sil":
+    #     result.append("sil")
+    assert j == len(pinyin)
+    return result
+# _PAUSE_SYMBOL = {'、', '，', '。', ',', '！', '!', '？', '：', ':', '《', '》', '·', '（', '）', '(', ')'}
+_PAUSE_SYMBOL = {'.':'.', '、':',', '，':',', '。':'.', ',':',', '！':'!', '!':'!', '？':'?', '?':'?', '：':',', ':':',', '——':','}
+class MyConverter(NeutralToneWith5Mixin, DefaultConverter):
+    pass
+def checkErHuaYin(text, GT_pinyin):
+    new_pinyin = []
+    check_pattern = re.compile("[\\t\.\!\?\/_,$%^*(+\"\']+|[+——！，。？、~@#￥%……&*（）“”：；]+")
+    check_text = check_pattern.sub('', text)
+    if len(check_text) > len(GT_pinyin) and '儿' in check_text:
+        # print('Size mismatch: ', check_text, len(check_text), '\n', GT_pinyin, len(GT_pinyin))
+        for i in range(len(GT_pinyin)):
+            if GT_pinyin[i][-2] == 'r' and GT_pinyin[i][:2] != 'er' and check_text[i + 1] == '儿':
+                new_pinyin.append(GT_pinyin[i][:-2] + GT_pinyin[i][-1])
+                new_pinyin.append('er5')
+                replace_word = check_text[i:i + 2]
+                replace_pattern = re.compile(replace_word)
+                # text = replace_pattern.sub(replace_word[:-1], text)
+                check_text = replace_pattern.sub(replace_word[:-1], check_text, count=1)
+            else:
+                new_pinyin.append(GT_pinyin[i])
+        GT_pinyin = new_pinyin
+    return GT_pinyin
+def change_tone_in_bu_or_yi(chars, pinyin_list):
+    location_yi = [m.start() for m in re.finditer(r'一', chars)]
+    location_bu = [m.start() for m in re.finditer(r'不', chars)]
+    # print('data: ', chars, pinyin_list, location_yi, location_bu)
+    for l in location_yi:
+        if l > 0 and l<len(chars) and chars[l-1]==chars[l+1]:
+            pinyin_list[l] = 'yi5'
+        elif l<len(chars) and pinyin_list[l+1][-1] == '4':
+                pinyin_list[l] = 'yi2'
+    for l in location_bu:
+        if l<len(chars) and pinyin_list[l+1][-1] == '4':
+            pinyin_list[l] = 'bu2'
+    return pinyin_list
+def txt2pinyin(text, pinyin_parser):
+    phonemes = []
+    text = NSWNormalizer(text.strip()).normalize().upper()
+    texts = text.split(' ')
+    for text in texts:
+        text_list = list(jieba.cut(text))
+        for words in text_list:
+            # print('words: ', words)
+            if words in _PAUSE_SYMBOL:
+                # phonemes.append('#2')
+                phonemes[-1] += _PAUSE_SYMBOL[words]
+            elif re.search("[\u4e00-\u9fa5]+", words):
+                pinyin = pinyin_parser(words, style=Style.TONE3, errors="ignore")
+                new_pinyin = []
+                for x in pinyin:
+                    x = "".join(x)
+                    if "#" not in x:
+                        new_pinyin.append(x)
+                new_pinyin = change_tone_in_bu_or_yi(words, new_pinyin) if len(words)>1 and words[-1] not in {"一","不"} else new_pinyin
+                phoneme = get_phoneme_from_char_and_pinyin(words, new_pinyin) # phoneme seq: [sil c e4 #0 sh iii4 #0 ^ uen2 #0 b en3 sil]  string 的list
+                phonemes += phoneme
+            elif re.search(r"[a-zA-Z]", words):
+                phonemes.append(words.upper())
+                # phonemes.append("#1")
+    phones = " ".join(phonemes)
+    return phones
+def process_batch(text_list, save_dir):
+    my_pinyin = Pinyin(MyConverter())
+    pinyin_parser = my_pinyin.pinyin
+    for text_info in tqdm(text_list):
+        try:
+            name, text = text_info
+            save_path = os.path.join(save_dir, name+".txt")
+            phones = txt2pinyin(text, pinyin_parser)
+            open(save_path, 'w', encoding='utf-8').write(phones)
+        except Exception as e:
+            print(text_info, e)
+def parallel_process(filenames, num_processes, save_dir):
+    with ProcessPoolExecutor(max_workers=num_processes) as executor:
+        tasks = []
+        for i in range(num_processes):
+            start = int(i * len(filenames) / num_processes)
+            end = int((i + 1) * len(filenames) / num_processes)
+            chunk = filenames[start:end]
+            tasks.append(executor.submit(process_batch, chunk, save_dir))
+        for task in tqdm(tasks):
+            task.result()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--text_file", type=str, default="", help="path to input text file")
+    parser.add_argument(
+        "--save_dir", type=str, default="", help="path to output text file")
+    parser.add_argument(
+        '--workers', type=int, default=4, help='You are advised to set the number of processes to the same as the number of CPU cores')
+    args = parser.parse_args()
+    sampling_rate = 16000
+    os.makedirs(args.save_dir, exist_ok=True)
+    filenames = open(args.text_file, 'r', encoding='utf-8').readlines()
+    filenames = [x.strip().split('\t') for x in tqdm(filenames)]
+    filenames = [[x[0], x[-1]] for x in tqdm(filenames)]
+    # shuffle(filenames)
+    print(len(filenames))
+    multiprocessing.set_start_method("spawn", force=True)
+    if args.workers == 0:
+        args.workers = os.cpu_count()
+    parallel_process(filenames, args.workers, args.save_dir)
+#################################################################################

lemas_tts/infer/utils_infer.py ADDED Viewed

	@@ -0,0 +1,651 @@

+# A unified script for inference process
+# Make adjustments inside functions, and consider both gradio and cli scripts if need to change func output format
+import os
+import sys
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor
+os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"  # for MPS device compatibility
+sys.path.append(f"{os.path.dirname(os.path.abspath(__file__))}/../../third_party/BigVGAN/")
+import hashlib
+import re
+import tempfile
+from importlib.resources import files
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pylab as plt
+import numpy as np
+import torch
+import torchaudio
+import tqdm
+from huggingface_hub import hf_hub_download
+from pydub import AudioSegment, silence
+from transformers import pipeline
+from vocos import Vocos
+from lemas_tts.model.cfm import CFM
+from lemas_tts.model.utils import (
+    get_tokenizer,
+    convert_char_to_pinyin,
+)
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+# Resolve repository layout for pretrained assets when running from source tree
+THIS_FILE = Path(__file__).resolve()
+REPO_ROOT = _find_repo_root(THIS_FILE)
+PRETRAINED_ROOT = REPO_ROOT / "pretrained_models"
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+_ref_audio_cache = {}
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
+# -----------------------------------------
+target_sample_rate = 24000
+n_mel_channels = 100
+hop_length = 256
+win_length = 1024
+n_fft = 1024
+mel_spec_type = "vocos"
+target_rms = 0.1
+cross_fade_duration = 0.15
+ode_method = "euler"
+nfe_step = 32  # 16, 32
+cfg_strength = 3.0
+sway_sampling_coef = 1
+speed = 1.0
+fix_duration = None
+# -----------------------------------------
+# chunk text into smaller pieces
+def chunk_text(text, max_chars=135):
+    """
+    Splits the input text into chunks, each with a maximum number of characters.
+    Args:
+        text (str): The text to be split.
+        max_chars (int): The maximum number of characters per chunk.
+    Returns:
+        List[str]: A list of text chunks.
+    """
+    chunks = []
+    current_chunk = ""
+    # Split the text into sentences based on punctuation followed by whitespace
+    sentences = re.split(r"(?<=[;:,.!?])\s+|(?<=[；：，。！？])", text)
+    for sentence in sentences:
+        if len(current_chunk.encode("utf-8")) + len(sentence.encode("utf-8")) <= max_chars:
+            current_chunk += sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
+        else:
+            if current_chunk:
+                chunks.append(current_chunk.strip())
+            current_chunk = sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
+    if current_chunk:
+        chunks.append(current_chunk.strip())
+    return chunks
+# load vocoder
+def load_vocoder(vocoder_name="vocos", is_local=False, local_path="", device=device, hf_cache_dir=None):
+    if vocoder_name == "vocos":
+        # vocoder = Vocos.from_pretrained("charactr/vocos-mel-24khz").to(device)
+        if is_local:
+            print(f"Load vocos from local path {local_path}")
+            config_path = f"{local_path}/config.yaml"
+            model_path = f"{local_path}/pytorch_model.bin"
+        else:
+            print("Download Vocos from huggingface charactr/vocos-mel-24khz")
+            repo_id = "charactr/vocos-mel-24khz"
+            config_path = hf_hub_download(repo_id=repo_id, cache_dir=hf_cache_dir, filename="config.yaml")
+            model_path = hf_hub_download(repo_id=repo_id, cache_dir=hf_cache_dir, filename="pytorch_model.bin")
+        vocoder = Vocos.from_hparams(config_path)
+        state_dict = torch.load(model_path, map_location="cpu", weights_only=True)
+        from vocos.feature_extractors import EncodecFeatures
+        if isinstance(vocoder.feature_extractor, EncodecFeatures):
+            encodec_parameters = {
+                "feature_extractor.encodec." + key: value
+                for key, value in vocoder.feature_extractor.encodec.state_dict().items()
+            }
+            state_dict.update(encodec_parameters)
+        vocoder.load_state_dict(state_dict)
+        vocoder = vocoder.eval().to(device)
+    elif vocoder_name == "bigvgan":
+        try:
+            from third_party.BigVGAN import bigvgan
+        except ImportError:
+            print("You need to follow the README to init submodule and change the BigVGAN source code.")
+        if is_local:
+            # download generator from https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x/tree/main
+            vocoder = bigvgan.BigVGAN.from_pretrained(local_path, use_cuda_kernel=False)
+        else:
+            vocoder = bigvgan.BigVGAN.from_pretrained(
+                "nvidia/bigvgan_v2_24khz_100band_256x", use_cuda_kernel=False, cache_dir=hf_cache_dir
+            )
+        vocoder.remove_weight_norm()
+        vocoder = vocoder.eval().to(device)
+    return vocoder
+# load asr pipeline
+asr_pipe = None
+def initialize_asr_pipeline(device: str = device, dtype=None):
+    if dtype is None:
+        dtype = (
+            torch.float16
+            if "cuda" in device
+            and torch.cuda.get_device_properties(device).major >= 7
+            and not torch.cuda.get_device_name().endswith("[ZLUDA]")
+            else torch.float32
+        )
+    global asr_pipe
+    asr_pipe = pipeline(
+        "automatic-speech-recognition",
+        model="openai/whisper-large-v3-turbo",
+        torch_dtype=dtype,
+        device=device,
+    )
+# transcribe
+def transcribe(ref_audio, language=None):
+    global asr_pipe
+    if asr_pipe is None:
+        initialize_asr_pipeline(device=device)
+    return asr_pipe(
+        ref_audio,
+        chunk_length_s=30,
+        batch_size=128,
+        generate_kwargs={"task": "transcribe", "language": language} if language else {"task": "transcribe"},
+        return_timestamps=False,
+    )["text"].strip()
+# load model checkpoint for inference
+def load_checkpoint(model, ckpt_path, device: str, dtype=None, use_ema=True):
+    if dtype is None:
+        dtype = (
+            torch.float16
+            if "cuda" in device
+            and torch.cuda.get_device_properties(device).major >= 7
+            and not torch.cuda.get_device_name().endswith("[ZLUDA]")
+            else torch.float32
+        )
+    model = model.to(dtype)
+    ckpt_type = ckpt_path.split(".")[-1]
+    if ckpt_type == "safetensors":
+        from safetensors.torch import load_file
+        checkpoint = load_file(ckpt_path, device=device)
+    else:
+        checkpoint = torch.load(ckpt_path, map_location=device, weights_only=True)
+    if use_ema:
+        if ckpt_type == "safetensors":
+            checkpoint = {"ema_model_state_dict": checkpoint}
+        checkpoint["model_state_dict"] = {
+            k.replace("ema_model.", ""): v
+            for k, v in checkpoint["ema_model_state_dict"].items()
+            if k not in ["initted", "step"]
+        }
+        # patch for backward compatibility, 305e3ea
+        for key in ["mel_spec.mel_stft.mel_scale.fb", "mel_spec.mel_stft.spectrogram.window", "ctc.proj.0.weight", "ctc.proj.0.bias", "ctc.ctc_proj.weight", "ctc.ctc_proj.bias"]:
+            if key in checkpoint["model_state_dict"]:
+                del checkpoint["model_state_dict"][key]
+        model.load_state_dict(checkpoint["model_state_dict"])
+    else:
+        if ckpt_type == "safetensors":
+            checkpoint = {"model_state_dict": checkpoint}
+        model.load_state_dict(checkpoint["model_state_dict"])
+    del checkpoint
+    torch.cuda.empty_cache()
+    return model.to(device)
+# load model for inference
+def load_model(
+    model_cls,
+    model_cfg,
+    ckpt_path,
+    mel_spec_type=mel_spec_type,
+    vocab_file="",
+    ode_method=ode_method,
+    use_ema=True,
+    device=device,
+    use_prosody_encoder=False,
+    prosody_cfg_path="",
+    prosody_ckpt_path="",
+):
+    if vocab_file == "":
+        vocab_file = str(files("lemas_tts").joinpath("infer/examples/vocab.txt"))
+    tokenizer = "custom"
+    print("\nvocab : ", vocab_file)
+    print("token : ", tokenizer)
+    print("model : ", ckpt_path, "\n")
+    vocab_char_map, vocab_size = get_tokenizer(vocab_file, tokenizer)
+    # Resolve prosody encoder assets if requested but paths not provided
+    if use_prosody_encoder:
+        if not prosody_cfg_path:
+            prosody_cfg_path = str(CKPTS_ROOT / "prosody_encoder" / "pretssel_cfg.json")
+        if not prosody_ckpt_path:
+            prosody_ckpt_path = str(CKPTS_ROOT / "prosody_encoder" / "prosody_encoder_UnitY2.pt")
+    model = CFM(
+        transformer=model_cls(**model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels, use_prosody_encoder=use_prosody_encoder),
+        mel_spec_kwargs=dict(
+            n_fft=n_fft,
+            hop_length=hop_length,
+            win_length=win_length,
+            n_mel_channels=n_mel_channels,
+            target_sample_rate=target_sample_rate,
+            mel_spec_type=mel_spec_type,
+        ),
+        odeint_kwargs=dict(
+            method=ode_method,
+        ),
+        vocab_char_map=vocab_char_map,
+        use_prosody_encoder=use_prosody_encoder,
+        prosody_cfg_path=prosody_cfg_path,
+        prosody_ckpt_path=prosody_ckpt_path,
+    ).to(device)
+    dtype = torch.float32 if mel_spec_type == "bigvgan" else None
+    model = load_checkpoint(model, ckpt_path, device, dtype=dtype, use_ema=use_ema)
+    return model
+def remove_silence_edges(audio, silence_threshold=-42):
+    # Remove silence from the start
+    non_silent_start_idx = silence.detect_leading_silence(audio, silence_threshold=silence_threshold)
+    audio = audio[non_silent_start_idx:]
+    # Remove silence from the end
+    non_silent_end_duration = audio.duration_seconds
+    for ms in reversed(audio):
+        if ms.dBFS > silence_threshold:
+            break
+        non_silent_end_duration -= 0.001
+    trimmed_audio = audio[: int(non_silent_end_duration * 1000)]
+    return trimmed_audio
+# preprocess reference audio and text
+def preprocess_ref_audio_text(ref_audio_orig, ref_text, clip_short=True, show_info=print):
+    show_info("Converting audio...")
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
+        aseg = AudioSegment.from_file(ref_audio_orig)
+        if clip_short:
+            # 1. try to find long silence for clipping
+            non_silent_segs = silence.split_on_silence(
+                aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000, seek_step=10
+            )
+            non_silent_wave = AudioSegment.silent(duration=0)
+            for non_silent_seg in non_silent_segs:
+                if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 12000:
+                    show_info("Audio is over 12s, clipping short. (1)")
+                    break
+                non_silent_wave += non_silent_seg
+            # 2. try to find short silence for clipping if 1. failed
+            if len(non_silent_wave) > 12000:
+                non_silent_segs = silence.split_on_silence(
+                    aseg, min_silence_len=100, silence_thresh=-40, keep_silence=1000, seek_step=10
+                )
+                non_silent_wave = AudioSegment.silent(duration=0)
+                for non_silent_seg in non_silent_segs:
+                    if len(non_silent_wave) > 6000 and len(non_silent_wave + non_silent_seg) > 12000:
+                        show_info("Audio is over 12s, clipping short. (2)")
+                        break
+                    non_silent_wave += non_silent_seg
+            aseg = non_silent_wave
+            # 3. if no proper silence found for clipping
+            if len(aseg) > 12000:
+                aseg = aseg[:12000]
+                show_info("Audio is over 12s, clipping short. (3)")
+        aseg = remove_silence_edges(aseg) + AudioSegment.silent(duration=50)
+        aseg.export(f.name, format="wav")
+        ref_audio = f.name
+    # Compute a hash of the reference audio file
+    with open(ref_audio, "rb") as audio_file:
+        audio_data = audio_file.read()
+        audio_hash = hashlib.md5(audio_data).hexdigest()
+    if not ref_text.strip():
+        global _ref_audio_cache
+        if audio_hash in _ref_audio_cache:
+            # Use cached asr transcription
+            show_info("Using cached reference text...")
+            ref_text = _ref_audio_cache[audio_hash]
+        else:
+            show_info("No reference text provided, transcribing reference audio...")
+            ref_text = transcribe(ref_audio)
+            # Cache the transcribed text (not caching custom ref_text, enabling users to do manual tweak)
+            _ref_audio_cache[audio_hash] = ref_text
+    else:
+        show_info("Using custom reference text...")
+    # Ensure ref_text ends with a proper sentence-ending punctuation
+    if not ref_text.endswith(". ") and not ref_text.endswith("。"):
+        if ref_text.endswith("."):
+            ref_text += " "
+        else:
+            ref_text += ". "
+    print("\nref_text  ", ref_text)
+    return ref_audio, ref_text
+# infer process: chunk text -> infer batches [i.e. infer_batch_process()]
+def infer_process(
+    ref_audio,
+    ref_text,
+    gen_text,
+    model_obj,
+    vocoder,
+    mel_spec_type=mel_spec_type,
+    show_info=print,
+    progress=tqdm,
+    target_rms=target_rms,
+    cross_fade_duration=cross_fade_duration,
+    nfe_step=nfe_step,
+    cfg_strength=cfg_strength,
+    sway_sampling_coef=sway_sampling_coef,
+    use_acc_grl=True,
+    use_prosody_encoder=True,
+    ref_ratio=None,
+    no_ref_audio=False,
+    speed=speed,
+    fix_duration=fix_duration,
+    device=device,
+):
+    # Split the input text into batches
+    audio, sr = torchaudio.load(ref_audio)
+    if type(ref_text) == str:
+        max_chars = int(len(ref_text.encode("utf-8")) / (audio.shape[-1] / sr) * (22 - audio.shape[-1] / sr))
+        gen_text_batches = chunk_text(gen_text, max_chars=max_chars)
+    else:
+        gen_text_batches = gen_text
+    print(f"ref_text:", ref_text)
+    for i, gen_text in enumerate(gen_text_batches):
+        print(f"gen_text {i}", gen_text)
+    print("\n")
+    show_info(f"Generating audio in {len(gen_text_batches)} batches...")
+    return next(
+        infer_batch_process(
+            (audio, sr),
+            ref_text,
+            gen_text_batches,
+            model_obj,
+            vocoder,
+            mel_spec_type=mel_spec_type,
+            progress=progress,
+            target_rms=target_rms,
+            cross_fade_duration=cross_fade_duration,
+            nfe_step=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+            use_acc_grl=use_acc_grl,
+            use_prosody_encoder=use_prosody_encoder,
+            ref_ratio=ref_ratio,
+            no_ref_audio=no_ref_audio,
+            speed=speed,
+            fix_duration=fix_duration,
+            device=device,
+        )
+    )
+# infer batches
+def infer_batch_process(
+    ref_audio,
+    ref_text,
+    gen_text_batches,
+    model_obj,
+    vocoder,
+    mel_spec_type="vocos",
+    progress=tqdm,
+    target_rms=0.1,
+    cross_fade_duration=0.15,
+    nfe_step=32,
+    cfg_strength=2.0,
+    sway_sampling_coef=-1,
+    use_acc_grl=True,
+    use_prosody_encoder=True,
+    ref_ratio=None,
+    no_ref_audio=False,
+    speed=1,
+    fix_duration=None,
+    device=None,
+    streaming=False,
+    chunk_size=2048,
+):
+    audio, sr = ref_audio
+    if audio.shape[0] > 1:
+        audio = torch.mean(audio, dim=0, keepdim=True)
+    rms = torch.sqrt(torch.mean(torch.square(audio)))
+    if rms < target_rms:
+        audio = audio * target_rms / rms
+    if sr != target_sample_rate:
+        resampler = torchaudio.transforms.Resample(sr, target_sample_rate)
+        audio = resampler(audio)
+    audio = audio.to(device)
+    generated_waves = []
+    spectrograms = []
+    if type(ref_text) == str:
+        if len(ref_text[-1].encode("utf-8")) == 1:
+            ref_text = ref_text + " "
+    def process_batch(gen_text):
+        local_speed = speed
+        if type(ref_text) == str:
+            if len(gen_text.encode("utf-8")) < 10:
+                local_speed = 0.3
+            # Prepare the text
+            text_list = [ref_text + gen_text]
+            final_text_list = convert_char_to_pinyin(text_list)
+        else:
+            final_text_list = [ref_text + gen_text]
+        print("final_text_list:", final_text_list)
+        ref_audio_len = audio.shape[-1] // hop_length
+        if fix_duration is not None:
+            duration = int(fix_duration * target_sample_rate / hop_length)
+        else:
+            # Calculate duration
+            ref_text_len = len(ref_text) # .encode("utf-8")
+            gen_text_len = len(gen_text) # .encode("utf-8")
+            duration = ref_audio_len + int(ref_audio_len / ref_text_len * gen_text_len / local_speed)
+        # inference
+        with torch.inference_mode():
+            generated, _ = model_obj.sample(
+                cond=audio,
+                text=final_text_list,
+                duration=duration,
+                steps=nfe_step,
+                cfg_strength=cfg_strength,
+                sway_sampling_coef=sway_sampling_coef,
+                use_acc_grl=use_acc_grl,
+                use_prosody_encoder=use_prosody_encoder,
+                ref_ratio=ref_ratio,
+                no_ref_audio=no_ref_audio,
+            )
+            del _
+            generated = generated.to(torch.float32)  # generated mel spectrogram
+            generated = generated[:, ref_audio_len:, :]
+            generated = generated.permute(0, 2, 1)
+            if mel_spec_type == "vocos":
+                generated_wave = vocoder.decode(generated)
+            elif mel_spec_type == "bigvgan":
+                generated_wave = vocoder(generated)
+            if rms < target_rms:
+                generated_wave = generated_wave * rms / target_rms
+            # wav -> numpy
+            # generated_wave = torch.clip(generated_wave, -0.999, 0.999)
+            generated_wave = generated_wave.squeeze().cpu().numpy()
+            if streaming:
+                for j in range(0, len(generated_wave), chunk_size):
+                    yield generated_wave[j : j + chunk_size], target_sample_rate
+            else:
+                generated_cpu = generated[0].cpu().numpy()
+                del generated
+                yield generated_wave, generated_cpu
+    if streaming:
+        for gen_text in progress.tqdm(gen_text_batches) if progress is not None else gen_text_batches:
+            for chunk in process_batch(gen_text):
+                yield chunk
+    else:
+        with ThreadPoolExecutor() as executor:
+            futures = [executor.submit(process_batch, gen_text) for gen_text in gen_text_batches]
+            for future in progress.tqdm(futures) if progress is not None else futures:
+                result = future.result()
+                if result:
+                    generated_wave, generated_mel_spec = next(result)
+                    generated_waves.append(generated_wave)
+                    spectrograms.append(generated_mel_spec)
+        if generated_waves:
+            if cross_fade_duration <= 0:
+                # Simply concatenate
+                final_wave = np.concatenate(generated_waves)
+            else:
+                # Combine all generated waves with cross-fading
+                final_wave = generated_waves[0]
+                for i in range(1, len(generated_waves)):
+                    prev_wave = final_wave
+                    next_wave = generated_waves[i]
+                    # Calculate cross-fade samples, ensuring it does not exceed wave lengths
+                    cross_fade_samples = int(cross_fade_duration * target_sample_rate)
+                    cross_fade_samples = min(cross_fade_samples, len(prev_wave), len(next_wave))
+                    if cross_fade_samples <= 0:
+                        # No overlap possible, concatenate
+                        final_wave = np.concatenate([prev_wave, next_wave])
+                        continue
+                    # Overlapping parts
+                    prev_overlap = prev_wave[-cross_fade_samples:]
+                    next_overlap = next_wave[:cross_fade_samples]
+                    # Fade out and fade in
+                    fade_out = np.linspace(1, 0, cross_fade_samples)
+                    fade_in = np.linspace(0, 1, cross_fade_samples)
+                    # Cross-faded overlap
+                    cross_faded_overlap = prev_overlap * fade_out + next_overlap * fade_in
+                    # Combine
+                    new_wave = np.concatenate(
+                        [prev_wave[:-cross_fade_samples], cross_faded_overlap, next_wave[cross_fade_samples:]]
+                    )
+                    final_wave = new_wave
+            # Create a combined spectrogram
+            combined_spectrogram = np.concatenate(spectrograms, axis=1)
+            final_wave = np.clip(final_wave, -0.999, 0.999)
+            yield final_wave, target_sample_rate, combined_spectrogram
+        else:
+            yield None, target_sample_rate, None
+# remove silence from generated wav
+def remove_silence_for_generated_wav(filename):
+    aseg = AudioSegment.from_file(filename)
+    non_silent_segs = silence.split_on_silence(
+        aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=500, seek_step=10
+    )
+    non_silent_wave = AudioSegment.silent(duration=0)
+    for non_silent_seg in non_silent_segs:
+        non_silent_wave += non_silent_seg
+    aseg = non_silent_wave
+    aseg.export(filename, format="wav")
+# save spectrogram
+def save_spectrogram(spectrogram, path):
+    plt.figure(figsize=(12, 4))
+    plt.imshow(spectrogram, origin="lower", aspect="auto")
+    plt.colorbar()
+    plt.savefig(path)
+    plt.close()

lemas_tts/model/backbones/README.md ADDED Viewed

	@@ -0,0 +1,20 @@

+## Backbones quick introduction
+### unett.py
+- flat unet transformer
+- structure same as in e2-tts & voicebox paper except using rotary pos emb
+- possible abs pos emb & convnextv2 blocks for embedded text before concat
+### dit.py
+- adaln-zero dit
+- embedded timestep as condition
+- concatted noised_input + masked_cond + embedded_text, linear proj in
+- possible abs pos emb & convnextv2 blocks for embedded text before concat
+- possible long skip connection (first layer to last layer)
+### mmdit.py
+- stable diffusion 3 block structure
+- timestep as condition
+- left stream: text embedded and applied a abs pos emb
+- right stream: masked_cond & noised_input concatted and with same conv pos emb as unett

lemas_tts/model/backbones/dit.py ADDED Viewed

	@@ -0,0 +1,254 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+from typing import Optional
+import torch
+from torch import nn
+import torch.nn.functional as F
+from x_transformers.x_transformers import RotaryEmbedding
+from lemas_tts.model.modules import (
+    TimestepEmbedding,
+    ConvNeXtV2Block,
+    ConvPositionEmbedding,
+    DiTBlock,
+    AdaLayerNorm_Final,
+    precompute_freqs_cis,
+    get_pos_embed_indices,
+)
+from lemas_tts.model.backbones.ecapa_tdnn import ECAPA_TDNN
+# Text embedding
+class TextEmbedding(nn.Module):
+    def __init__(self, text_num_embeds, text_dim, mask_padding=True, conv_layers=0, conv_mult=2):
+        super().__init__()
+        self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
+        if conv_layers > 0:
+            self.extra_modeling = True
+            self.precompute_max_pos = 4096  # ~44s of 24khz audio
+            self.register_buffer("freqs_cis", precompute_freqs_cis(text_dim, self.precompute_max_pos), persistent=False)
+            self.text_blocks = nn.Sequential(
+                *[ConvNeXtV2Block(text_dim, text_dim * conv_mult) for _ in range(conv_layers)]
+            )
+        else:
+            self.extra_modeling = False
+    def forward(self, text: int["b nt"], seq_len, drop_text=False):  # noqa: F722
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
+        batch, text_len = text.shape[0], text.shape[1]
+        text = F.pad(text, (0, seq_len - text_len), value=0)
+        if self.mask_padding:
+            text_mask = text == 0
+        if drop_text:  # cfg for text
+            text = torch.zeros_like(text)
+        text = self.text_embed(text)  # b n -> b n d
+        # possible extra modeling
+        if self.extra_modeling:
+            # sinus pos emb
+            batch_start = torch.zeros((batch,), dtype=torch.long)
+            pos_idx = get_pos_embed_indices(batch_start, seq_len, max_pos=self.precompute_max_pos)
+            text_pos_embed = self.freqs_cis[pos_idx]
+            text = text + text_pos_embed
+            # convnextv2 blocks
+            if self.mask_padding:
+                text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+                for block in self.text_blocks:
+                    text = block(text)
+                    text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+            else:
+                text = self.text_blocks(text)
+        return text
+# noised input audio and context mixing embedding
+class InputEmbedding(nn.Module):
+    def __init__(self, mel_dim, text_dim, out_dim):
+        super().__init__()
+        self.proj = nn.Linear(mel_dim * 2 + text_dim, out_dim)
+        self.conv_pos_embed = ConvPositionEmbedding(dim=out_dim)
+    def forward(self, x: float["b n d"], cond: float["b n d"], text_embed: float["b n d"], drop_audio_cond=False):  # noqa: F722
+        if drop_audio_cond:  # cfg for cond audio
+            cond = torch.zeros_like(cond)
+        x = self.proj(torch.cat((x, cond, text_embed), dim=-1))
+        x = self.conv_pos_embed(x) + x
+        return x
+# Transformer backbone using DiT blocks
+class DiT(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        depth=8,
+        heads=8,
+        dim_head=64,
+        dropout=0.1,
+        ff_mult=4,
+        mel_dim=100,
+        text_num_embeds=256,
+        text_dim=None,
+        text_mask_padding=True,
+        qk_norm=None,
+        conv_layers=0,
+        pe_attn_head=None,
+        long_skip_connection=False,
+        checkpoint_activations=False,
+        use_prosody_encoder=False,
+    ):
+        super().__init__()
+        self.time_embed = TimestepEmbedding(dim)
+        if text_dim is None:
+            text_dim = mel_dim
+        self.text_embed = TextEmbedding(
+            text_num_embeds, text_dim, mask_padding=text_mask_padding, conv_layers=conv_layers
+        )
+        # project prosody embeddings (512-dim) to text_dim for conditioning
+        self.use_prosody_encoder = use_prosody_encoder
+        if use_prosody_encoder:
+            self.prosody_text_proj = nn.Linear(512, text_dim)
+        else:
+            self.prosody_text_proj = None
+        self.text_cond, self.text_uncond = None, None  # text cache
+        self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
+        self.rotary_embed = RotaryEmbedding(dim_head)
+        self.dim = dim
+        self.depth = depth
+        self.transformer_blocks = nn.ModuleList(
+            [
+                DiTBlock(
+                    dim=dim,
+                    heads=heads,
+                    dim_head=dim_head,
+                    ff_mult=ff_mult,
+                    dropout=dropout,
+                    qk_norm=qk_norm,
+                    pe_attn_head=pe_attn_head,
+                )
+                for _ in range(depth)
+            ]
+        )
+        self.long_skip_connection = nn.Linear(dim * 2, dim, bias=False) if long_skip_connection else None
+        self.norm_out = AdaLayerNorm_Final(dim)  # final modulation
+        self.proj_out = nn.Linear(dim, mel_dim)
+        self.checkpoint_activations = checkpoint_activations
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Zero-out AdaLN layers in DiT blocks:
+        for block in self.transformer_blocks:
+            nn.init.constant_(block.attn_norm.linear.weight, 0)
+            nn.init.constant_(block.attn_norm.linear.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.norm_out.linear.weight, 0)
+        nn.init.constant_(self.norm_out.linear.bias, 0)
+        nn.init.constant_(self.proj_out.weight, 0)
+        nn.init.constant_(self.proj_out.bias, 0)
+    def ckpt_wrapper(self, module):
+        # https://github.com/chuanyangjin/fast-DiT/blob/main/models.py
+        def ckpt_forward(*inputs):
+            outputs = module(*inputs)
+            return outputs
+        return ckpt_forward
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
+    def forward(
+        self,
+        x: float["b n d"],  # nosied input audio  # noqa: F722
+        cond: float["b n d"],  # masked cond audio  # noqa: F722
+        text: int["b nt"],  # text  # noqa: F722
+        time: float["b"] | float[""],  # time step  # noqa: F821 F722
+        drop_audio_cond,  # cfg for cond audio
+        drop_text,  # cfg for text
+        mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
+        prosody_text: Optional[torch.Tensor] = None,
+    ):
+        batch, seq_len = x.shape[0], x.shape[1]
+        if time.ndim == 0:
+            time = time.repeat(batch)
+        # t: conditioning time, text: text, x: noised audio + cond audio + text
+        t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, seq_len, drop_text=True)
+                text_embed = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, seq_len, drop_text=False)
+                text_embed = self.text_cond
+        else:
+            text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
+        # optional prosody conditioning on text side
+        if prosody_text is not None and self.use_prosody_encoder:
+            # prosody_text: (B, T_text, 512) -> project to text_dim and align to seq_len
+            pt = self.prosody_text_proj(prosody_text)
+            if pt.size(1) < seq_len:
+                pad_len = seq_len - pt.size(1)
+                pt = F.pad(pt, (0, 0, 0, pad_len))
+            elif pt.size(1) > seq_len:
+                pt = pt[:, :seq_len]
+            text_embed = text_embed + pt
+        x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
+        rope = self.rotary_embed.forward_from_seq_len(seq_len)
+        if self.long_skip_connection is not None:
+            residual = x
+        for block in self.transformer_blocks:
+            if self.checkpoint_activations:
+                # https://pytorch.org/docs/stable/checkpoint.html#torch.utils.checkpoint.checkpoint
+                x = torch.utils.checkpoint.checkpoint(self.ckpt_wrapper(block), x, t, mask, rope, use_reentrant=False)
+            else:
+                x = block(x, t, mask=mask, rope=rope)
+        if self.long_skip_connection is not None:
+            x = self.long_skip_connection(torch.cat((x, residual), dim=-1))
+        x = self.norm_out(x, t)
+        output = self.proj_out(x)
+        return output

lemas_tts/model/backbones/ecapa_tdnn.py ADDED Viewed

	@@ -0,0 +1,931 @@

+"""A popular speaker recognition and diarization model.
+Authors
+ * Hwidong Na 2020
+"""
+import math
+import os
+import torch  # noqa: F401
+import torch.nn as nn
+import torch.nn.functional as F
+def length_to_mask(length, max_len=None, dtype=None, device=None):
+    """Creates a binary mask for each sequence.
+    Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3
+    Arguments
+    ---------
+    length : torch.LongTensor
+        Containing the length of each sequence in the batch. Must be 1D.
+    max_len : int
+        Max length for the mask, also the size of the second dimension.
+    dtype : torch.dtype, default: None
+        The dtype of the generated mask.
+    device: torch.device, default: None
+        The device to put the mask variable.
+    Returns
+    -------
+    mask : tensor
+        The binary mask.
+    Example
+    -------
+    >>> length=torch.Tensor([1,2,3])
+    >>> mask=length_to_mask(length)
+    >>> mask
+    tensor([[1., 0., 0.],
+            [1., 1., 0.],
+            [1., 1., 1.]])
+    """
+    assert len(length.shape) == 1
+    if max_len is None:
+        max_len = length.max().long().item()  # using arange to generate mask
+    mask = torch.arange(max_len, device=length.device, dtype=length.dtype).expand(
+        len(length), max_len
+    ) < length.unsqueeze(1)
+    if dtype is None:
+        dtype = length.dtype
+    if device is None:
+        device = length.device
+    mask = torch.as_tensor(mask, dtype=dtype, device=device)
+    return mask
+def get_padding_elem(L_in: int, stride: int, kernel_size: int, dilation: int):
+    """This function computes the number of elements to add for zero-padding.
+    Arguments
+    ---------
+    L_in : int
+    stride: int
+    kernel_size : int
+    dilation : int
+    """
+    if stride > 1:
+        n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
+        L_out = stride * (n_steps - 1) + kernel_size * dilation
+        padding = [kernel_size // 2, kernel_size // 2]
+    else:
+        L_out = (L_in - dilation * (kernel_size - 1) - 1) // stride + 1
+        padding = [(L_in - L_out) // 2, (L_in - L_out) // 2]
+    return padding
+class Conv1d(nn.Module):
+    """This function implements 1d convolution.
+    Arguments
+    ---------
+    out_channels : int
+        It is the number of output channels.
+    kernel_size : int
+        Kernel size of the convolutional filters.
+    input_shape : tuple
+        The shape of the input. Alternatively use ``in_channels``.
+    in_channels : int
+        The number of input channels. Alternatively use ``input_shape``.
+    stride : int
+        Stride factor of the convolutional filters. When the stride factor > 1,
+        a decimation in time is performed.
+    dilation : int
+        Dilation factor of the convolutional filters.
+    padding : str
+        (same, valid, causal). If "valid", no padding is performed.
+        If "same" and stride is 1, output shape is the same as the input shape.
+        "causal" results in causal (dilated) convolutions.
+    padding_mode : str
+        This flag specifies the type of padding. See torch.nn documentation
+        for more information.
+    skip_transpose : bool
+        If False, uses batch x time x channel convention of speechbrain.
+        If True, uses batch x channel x time convention.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([10, 40, 16])
+    >>> cnn_1d = Conv1d(
+    ...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=5
+    ... )
+    >>> out_tensor = cnn_1d(inp_tensor)
+    >>> out_tensor.shape
+    torch.Size([10, 40, 8])
+    """
+    def __init__(
+        self,
+        out_channels,
+        kernel_size,
+        input_shape=None,
+        in_channels=None,
+        stride=1,
+        dilation=1,
+        padding="same",
+        groups=1,
+        bias=True,
+        padding_mode="reflect",
+        skip_transpose=True,
+    ):
+        super().__init__()
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.dilation = dilation
+        self.padding = padding
+        self.padding_mode = padding_mode
+        self.unsqueeze = False
+        self.skip_transpose = skip_transpose
+        if input_shape is None and in_channels is None:
+            raise ValueError("Must provide one of input_shape or in_channels")
+        if in_channels is None:
+            in_channels = self._check_input_shape(input_shape)
+        self.conv = nn.Conv1d(
+            in_channels,
+            out_channels,
+            self.kernel_size,
+            stride=self.stride,
+            dilation=self.dilation,
+            padding=0,
+            groups=groups,
+            bias=bias,
+        )
+    def forward(self, x):
+        """Returns the output of the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, channel)
+            input to convolve. 2d or 4d tensors are expected.
+        """
+        if not self.skip_transpose:
+            x = x.transpose(1, -1)
+        if self.unsqueeze:
+            x = x.unsqueeze(1)
+        if self.padding == "same":
+            x = self._manage_padding(x, self.kernel_size, self.dilation, self.stride)
+        elif self.padding == "causal":
+            num_pad = (self.kernel_size - 1) * self.dilation
+            x = F.pad(x, (num_pad, 0))
+        elif self.padding == "valid":
+            pass
+        else:
+            raise ValueError(
+                "Padding must be 'same', 'valid' or 'causal'. Got " + self.padding
+            )
+        wx = self.conv(x.to(self.conv.weight.dtype))
+        if self.unsqueeze:
+            wx = wx.squeeze(1)
+        if not self.skip_transpose:
+            wx = wx.transpose(1, -1)
+        return wx
+    def _manage_padding(
+        self,
+        x,
+        kernel_size: int,
+        dilation: int,
+        stride: int,
+    ):
+        """This function performs zero-padding on the time axis
+        such that their lengths is unchanged after the convolution.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Input tensor.
+        kernel_size : int
+            Size of kernel.
+        dilation : int
+            Dilation used.
+        stride : int
+            Stride.
+        """
+        # Detecting input shape
+        L_in = x.shape[-1]
+        # Time padding
+        padding = get_padding_elem(L_in, stride, kernel_size, dilation)
+        # Applying padding
+        x = F.pad(x, padding, mode=self.padding_mode)
+        return x
+    def _check_input_shape(self, shape):
+        """Checks the input shape and returns the number of input channels."""
+        if len(shape) == 2:
+            self.unsqueeze = True
+            in_channels = 1
+        elif self.skip_transpose:
+            in_channels = shape[1]
+        elif len(shape) == 3:
+            in_channels = shape[2]
+        else:
+            raise ValueError("conv1d expects 2d, 3d inputs. Got " + str(len(shape)))
+        # Kernel size must be odd
+        if self.kernel_size % 2 == 0:
+            raise ValueError(
+                "The field kernel size must be an odd number. Got %s."
+                % (self.kernel_size)
+            )
+        return in_channels
+class Fp32BatchNorm(nn.Module):
+    def __init__(self, sync=True, *args, **kwargs):
+        super().__init__()
+        if (
+            not torch.distributed.is_initialized()
+            or torch.distributed.get_world_size() == 1
+        ):
+            sync = False
+        if sync:
+            self.bn = nn.SyncBatchNorm(*args, **kwargs)
+        else:
+            self.bn = nn.BatchNorm1d(*args, **kwargs)
+        self.sync = sync
+    def forward(self, input):
+        if self.bn.running_mean.dtype != torch.float:
+            if self.sync:
+                self.bn.running_mean = self.bn.running_mean.float()
+                self.bn.running_var = self.bn.running_var.float()
+                if self.bn.affine:
+                    try:
+                        self.bn.weight = self.bn.weight.float()
+                        self.bn.bias = self.bn.bias.float()
+                    except:
+                        self.bn.float()
+            else:
+                self.bn.float()
+        output = self.bn(input.float())
+        return output.type_as(input)
+class BatchNorm1d(nn.Module):
+    """Applies 1d batch normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    momentum : float
+        It is a value used for the running_mean and running_var computation.
+    affine : bool
+        When set to True, the affine parameters are learned.
+    track_running_stats : bool
+        When set to True, this module tracks the running mean and variance,
+        and when set to False, this module does not track such statistics.
+    combine_batch_time : bool
+        When true, it combines batch an time axis.
+    Example
+    -------
+    >>> input = torch.randn(100, 10)
+    >>> norm = BatchNorm1d(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    torch.Size([100, 10])
+    """
+    def __init__(
+        self,
+        input_shape=None,
+        input_size=None,
+        eps=1e-05,
+        momentum=0.1,
+        affine=True,
+        track_running_stats=True,
+        combine_batch_time=False,
+        skip_transpose=True,
+        enabled=True,
+    ):
+        super().__init__()
+        self.combine_batch_time = combine_batch_time
+        self.skip_transpose = skip_transpose
+        if input_size is None and skip_transpose:
+            input_size = input_shape[1]
+        elif input_size is None:
+            input_size = input_shape[-1]
+        if enabled:
+            self.norm = Fp32BatchNorm(
+                num_features=input_size,
+                eps=eps,
+                momentum=momentum,
+                affine=affine,
+                track_running_stats=track_running_stats,
+            )
+        else:
+            self.norm = nn.Identity()
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor (batch, time, [channels])
+            input to normalize. 2d or 3d tensors are expected in input
+            4d tensors can be used when combine_dims=True.
+        """
+        shape_or = x.shape
+        if self.combine_batch_time:
+            if x.ndim == 3:
+                x = x.reshape(shape_or[0] * shape_or[1], shape_or[2])
+            else:
+                x = x.reshape(shape_or[0] * shape_or[1], shape_or[3], shape_or[2])
+        elif not self.skip_transpose:
+            x = x.transpose(-1, 1)
+        x_n = self.norm(x)
+        if self.combine_batch_time:
+            x_n = x_n.reshape(shape_or)
+        elif not self.skip_transpose:
+            x_n = x_n.transpose(1, -1)
+        return x_n
+class Linear(torch.nn.Module):
+    """Computes a linear transformation y = wx + b.
+    Arguments
+    ---------
+    n_neurons : int
+        It is the number of output neurons (i.e, the dimensionality of the
+        output).
+    bias : bool
+        If True, the additive bias b is adopted.
+    combine_dims : bool
+        If True and the input is 4D, combine 3rd and 4th dimensions of input.
+    Example
+    -------
+    >>> inputs = torch.rand(10, 50, 40)
+    >>> lin_t = Linear(input_shape=(10, 50, 40), n_neurons=100)
+    >>> output = lin_t(inputs)
+    >>> output.shape
+    torch.Size([10, 50, 100])
+    """
+    def __init__(
+        self,
+        n_neurons,
+        input_shape=None,
+        input_size=None,
+        bias=True,
+        combine_dims=False,
+    ):
+        super().__init__()
+        self.combine_dims = combine_dims
+        if input_shape is None and input_size is None:
+            raise ValueError("Expected one of input_shape or input_size")
+        if input_size is None:
+            input_size = input_shape[-1]
+            if len(input_shape) == 4 and self.combine_dims:
+                input_size = input_shape[2] * input_shape[3]
+        # Weights are initialized following pytorch approach
+        self.w = nn.Linear(input_size, n_neurons, bias=bias)
+    def forward(self, x):
+        """Returns the linear transformation of input tensor.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Input to transform linearly.
+        """
+        if x.ndim == 4 and self.combine_dims:
+            x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3])
+        wx = self.w(x)
+        return wx
+class TDNNBlock(nn.Module):
+    """An implementation of TDNN.
+    Arguments
+    ----------
+    in_channels : int
+        Number of input channels.
+    out_channels : int
+        The number of output channels.
+    kernel_size : int
+        The kernel size of the TDNN blocks.
+    dilation : int
+        The dilation of the Res2Net block.
+    activation : torch class
+        A class for constructing the activation layers.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> layer = TDNNBlock(64, 64, kernel_size=3, dilation=1)
+    >>> out_tensor = layer(inp_tensor).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        dilation,
+        activation=nn.ReLU,
+        batch_norm=True,
+    ):
+        super(TDNNBlock, self).__init__()
+        self.conv = Conv1d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            dilation=dilation,
+        )
+        self.activation = activation()
+        self.norm = BatchNorm1d(input_size=out_channels, enabled=batch_norm)
+    def forward(self, x):
+        return self.norm(self.activation(self.conv(x)))
+class Res2NetBlock(torch.nn.Module):
+    """An implementation of Res2NetBlock w/ dilation.
+    Arguments
+    ---------
+    in_channels : int
+        The number of channels expected in the input.
+    out_channels : int
+        The number of output channels.
+    scale : int
+        The scale of the Res2Net block.
+    kernel_size: int
+        The kernel size of the Res2Net block.
+    dilation : int
+        The dilation of the Res2Net block.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> layer = Res2NetBlock(64, 64, scale=4, dilation=3)
+    >>> out_tensor = layer(inp_tensor).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        scale=8,
+        kernel_size=3,
+        dilation=1,
+        batch_norm=True,
+    ):
+        super(Res2NetBlock, self).__init__()
+        assert in_channels % scale == 0
+        assert out_channels % scale == 0
+        in_channel = in_channels // scale
+        hidden_channel = out_channels // scale
+        self.blocks = nn.ModuleList(
+            [
+                TDNNBlock(
+                    in_channel,
+                    hidden_channel,
+                    kernel_size=kernel_size,
+                    dilation=dilation,
+                    batch_norm=batch_norm,
+                )
+                for i in range(scale - 1)
+            ]
+        )
+        self.scale = scale
+    def forward(self, x):
+        y = []
+        for i, x_i in enumerate(torch.chunk(x, self.scale, dim=1)):
+            if i == 0:
+                y_i = x_i
+            elif i == 1:
+                y_i = self.blocks[i - 1](x_i)
+            else:
+                y_i = self.blocks[i - 1](x_i + y_i)
+            y.append(y_i)
+        y = torch.cat(y, dim=1)
+        return y
+class SEBlock(nn.Module):
+    """An implementation of squeeze-and-excitation block.
+    Arguments
+    ---------
+    in_channels : int
+        The number of input channels.
+    se_channels : int
+        The number of output channels after squeeze.
+    out_channels : int
+        The number of output channels.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> se_layer = SEBlock(64, 16, 64)
+    >>> lengths = torch.rand((8,))
+    >>> out_tensor = se_layer(inp_tensor, lengths).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(self, in_channels, se_channels, out_channels):
+        super(SEBlock, self).__init__()
+        self.conv1 = Conv1d(
+            in_channels=in_channels, out_channels=se_channels, kernel_size=1
+        )
+        self.relu = torch.nn.ReLU(inplace=True)
+        self.conv2 = Conv1d(
+            in_channels=se_channels, out_channels=out_channels, kernel_size=1
+        )
+        self.sigmoid = torch.nn.Sigmoid()
+    def forward(self, x, lengths=None):
+        L = x.shape[-1]
+        if lengths is not None:
+            mask = length_to_mask(lengths * L, max_len=L, device=x.device)
+            mask = mask.unsqueeze(1)
+            total = mask.sum(dim=2, keepdim=True)
+            s = (x * mask).sum(dim=2, keepdim=True) / total
+        else:
+            s = x.mean(dim=2, keepdim=True)
+        s = self.relu(self.conv1(s))
+        s = self.sigmoid(self.conv2(s))
+        return s * x
+class AttentiveStatisticsPooling(nn.Module):
+    """This class implements an attentive statistic pooling layer for each channel.
+    It returns the concatenated mean and std of the input tensor.
+    Arguments
+    ---------
+    channels: int
+        The number of input channels.
+    attention_channels: int
+        The number of attention channels.
+    Example
+    -------
+    >>> inp_tensor = torch.rand([8, 120, 64]).transpose(1, 2)
+    >>> asp_layer = AttentiveStatisticsPooling(64)
+    >>> lengths = torch.rand((8,))
+    >>> out_tensor = asp_layer(inp_tensor, lengths).transpose(1, 2)
+    >>> out_tensor.shape
+    torch.Size([8, 1, 128])
+    """
+    def __init__(
+        self, channels, attention_channels=128, global_context=True, batch_norm=True
+    ):
+        super().__init__()
+        self.eps = 1e-12
+        self.global_context = global_context
+        if global_context:
+            self.tdnn = TDNNBlock(
+                channels * 3, attention_channels, 1, 1, batch_norm=batch_norm
+            )
+        else:
+            self.tdnn = TDNNBlock(
+                channels, attention_channels, 1, 1, batch_norm, batch_norm
+            )
+        self.tanh = nn.Tanh()
+        self.conv = Conv1d(
+            in_channels=attention_channels, out_channels=channels, kernel_size=1
+        )
+    def forward(self, x, lengths=None):
+        """Calculates mean and std for a batch (input tensor).
+        Arguments
+        ---------
+        x : torch.Tensor
+            Tensor of shape [N, C, L].
+        """
+        L = x.shape[-1]
+        def _compute_statistics(x, m, dim=2, eps=self.eps):
+            mean = (m * x).sum(dim)
+            std = torch.sqrt((m * (x - mean.unsqueeze(dim)).pow(2)).sum(dim).clamp(eps))
+            return mean, std
+        if lengths is None:
+            lengths = torch.ones(x.shape[0], device=x.device)
+        # Make binary mask of shape [N, 1, L]
+        mask = length_to_mask(lengths * L, max_len=L, device=x.device)
+        mask = mask.unsqueeze(1)
+        # Expand the temporal context of the pooling layer by allowing the
+        # self-attention to look at global properties of the utterance.
+        if self.global_context:
+            # torch.std is unstable for backward computation
+            # https://github.com/pytorch/pytorch/issues/4320
+            total = mask.sum(dim=2, keepdim=True).float()
+            mean, std = _compute_statistics(x, mask / total)
+            mean = mean.unsqueeze(2).repeat(1, 1, L)
+            std = std.unsqueeze(2).repeat(1, 1, L)
+            attn = torch.cat([x, mean, std], dim=1)
+        else:
+            attn = x
+        # Apply layers
+        attn = self.conv(self.tanh(self.tdnn(attn)))
+        # Filter out zero-paddings
+        attn = attn.masked_fill(mask == 0, float("-inf"))
+        attn = F.softmax(attn, dim=2)
+        mean, std = _compute_statistics(x, attn)
+        # Append mean and std of the batch
+        pooled_stats = torch.cat((mean, std), dim=1)
+        pooled_stats = pooled_stats.unsqueeze(2)
+        return pooled_stats
+class SERes2NetBlock(nn.Module):
+    """An implementation of building block in ECAPA-TDNN, i.e.,
+    TDNN-Res2Net-TDNN-SEBlock.
+    Arguments
+    ----------
+    out_channels: int
+        The number of output channels.
+    res2net_scale: int
+        The scale of the Res2Net block.
+    kernel_size: int
+        The kernel size of the TDNN blocks.
+    dilation: int
+        The dilation of the Res2Net block.
+    activation : torch class
+        A class for constructing the activation layers.
+    Example
+    -------
+    >>> x = torch.rand(8, 120, 64).transpose(1, 2)
+    >>> conv = SERes2NetBlock(64, 64, res2net_scale=4)
+    >>> out = conv(x).transpose(1, 2)
+    >>> out.shape
+    torch.Size([8, 120, 64])
+    """
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        res2net_scale=8,
+        se_channels=128,
+        kernel_size=1,
+        dilation=1,
+        activation=torch.nn.ReLU,
+        batch_norm=True,
+    ):
+        super().__init__()
+        self.out_channels = out_channels
+        self.tdnn1 = TDNNBlock(
+            in_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            activation=activation,
+            batch_norm=batch_norm,
+        )
+        self.res2net_block = Res2NetBlock(
+            out_channels,
+            out_channels,
+            res2net_scale,
+            kernel_size,
+            dilation,
+            batch_norm=batch_norm,
+        )
+        self.tdnn2 = TDNNBlock(
+            out_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            activation=activation,
+            batch_norm=batch_norm,
+        )
+        self.se_block = SEBlock(out_channels, se_channels, out_channels)
+        self.shortcut = None
+        if in_channels != out_channels:
+            self.shortcut = Conv1d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+            )
+    def forward(self, x, lengths=None):
+        residual = x
+        if self.shortcut:
+            residual = self.shortcut(x)
+        x = self.tdnn1(x)
+        x = self.res2net_block(x)
+        x = self.tdnn2(x)
+        x = self.se_block(x, lengths)
+        return x + residual
+class ECAPA_TDNN(torch.nn.Module):
+    """An implementation of the speaker embedding model in a paper.
+    "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in
+    TDNN Based Speaker Verification" (https://arxiv.org/abs/2005.07143).
+    Arguments
+    ---------
+    device : str
+        Device used, e.g., "cpu" or "cuda".
+    activation : torch class
+        A class for constructing the activation layers.
+    channels : list of ints
+        Output channels for TDNN/SERes2Net layer.
+    kernel_sizes : list of ints
+        List of kernel sizes for each layer.
+    dilations : list of ints
+        List of dilations for kernels in each layer.
+    lin_neurons : int
+        Number of neurons in linear layers.
+    Example
+    -------
+    >>> input_feats = torch.rand([5, 120, 80])
+    >>> compute_embedding = ECAPA_TDNN(80, lin_neurons=192)
+    >>> outputs = compute_embedding(input_feats)
+    >>> outputs.shape
+    torch.Size([5, 1, 192])
+    """
+    def __init__(
+        self,
+        input_size,
+        lin_neurons=192,
+        activation=torch.nn.ReLU,
+        channels=[512, 512, 512, 512, 1536],
+        kernel_sizes=[5, 3, 3, 3, 1],
+        dilations=[1, 2, 3, 4, 1],
+        attention_channels=128,
+        res2net_scale=8,
+        se_channels=128,
+        global_context=True,
+        batch_norm=True,
+    ):
+        super().__init__()
+        assert len(channels) == len(kernel_sizes)
+        assert len(channels) == len(dilations)
+        self.channels = channels
+        self.blocks = nn.ModuleList()
+        # The initial TDNN layer
+        self.blocks.append(
+            TDNNBlock(
+                input_size,
+                channels[0],
+                kernel_sizes[0],
+                dilations[0],
+                activation,
+                batch_norm=batch_norm,
+            )
+        )
+        # SE-Res2Net layers
+        for i in range(1, len(channels) - 1):
+            self.blocks.append(
+                SERes2NetBlock(
+                    channels[i - 1],
+                    channels[i],
+                    res2net_scale=res2net_scale,
+                    se_channels=se_channels,
+                    kernel_size=kernel_sizes[i],
+                    dilation=dilations[i],
+                    activation=activation,
+                    batch_norm=batch_norm,
+                )
+            )
+        # Multi-layer feature aggregation
+        self.mfa = TDNNBlock(
+            channels[-1],
+            channels[-1],
+            kernel_sizes[-1],
+            dilations[-1],
+            activation,
+            batch_norm=batch_norm,
+        )
+        # Attentive Statistical Pooling
+        self.asp = AttentiveStatisticsPooling(
+            channels[-1],
+            attention_channels=attention_channels,
+            global_context=global_context,
+            batch_norm=batch_norm,
+        )
+        self.asp_bn = BatchNorm1d(input_size=channels[-1] * 2, enabled=batch_norm)
+        # Final linear transformation
+        self.fc = Conv1d(
+            in_channels=channels[-1] * 2,
+            out_channels=input_size, # lin_neurons,
+            kernel_size=1,
+        )
+    # @torch.cuda.amp.autocast(enabled=True, dtype=torch.float32)
+    def forward(self, x, lengths=None):
+        """Returns the embedding vector.
+        Arguments
+        ---------
+        x : torch.Tensor
+            Tensor of shape (batch, time, channel).
+        """
+        # Minimize transpose for efficiency
+        x = x.transpose(1, 2)
+        xl = []
+        for layer in self.blocks:
+            try:
+                x = layer(x, lengths=lengths)
+            except TypeError:
+                x = layer(x)
+            xl.append(x)
+        # Multi-layer feature aggregation
+        x = torch.cat(xl[1:], dim=1)
+        x = self.mfa(x)
+        # Attentive Statistical Pooling
+        x = self.asp(x, lengths=lengths)
+        x = self.asp_bn(x)
+        # Final linear transformation
+        x = self.fc(x)
+        x = x.squeeze(-1)
+        return x
+if __name__ == "__main__":
+    model = ECAPA_TDNN(128, batch_norm=False)
+    # print(model)

lemas_tts/model/backbones/mmdit.py ADDED Viewed

	@@ -0,0 +1,189 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+import torch
+from torch import nn
+from x_transformers.x_transformers import RotaryEmbedding
+from lemas_tts.model.modules import (
+    TimestepEmbedding,
+    ConvPositionEmbedding,
+    MMDiTBlock,
+    AdaLayerNorm_Final,
+    precompute_freqs_cis,
+    get_pos_embed_indices,
+)
+# text embedding
+class TextEmbedding(nn.Module):
+    def __init__(self, out_dim, text_num_embeds, mask_padding=True):
+        super().__init__()
+        self.text_embed = nn.Embedding(text_num_embeds + 1, out_dim)  # will use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
+        self.precompute_max_pos = 1024
+        self.register_buffer("freqs_cis", precompute_freqs_cis(out_dim, self.precompute_max_pos), persistent=False)
+    def forward(self, text: int["b nt"], drop_text=False) -> int["b nt d"]:  # noqa: F722
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        if self.mask_padding:
+            text_mask = text == 0
+        if drop_text:  # cfg for text
+            text = torch.zeros_like(text)
+        text = self.text_embed(text)  # b nt -> b nt d
+        # sinus pos emb
+        batch_start = torch.zeros((text.shape[0],), dtype=torch.long)
+        batch_text_len = text.shape[1]
+        pos_idx = get_pos_embed_indices(batch_start, batch_text_len, max_pos=self.precompute_max_pos)
+        text_pos_embed = self.freqs_cis[pos_idx]
+        text = text + text_pos_embed
+        if self.mask_padding:
+            text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+        return text
+# noised input & masked cond audio embedding
+class AudioEmbedding(nn.Module):
+    def __init__(self, in_dim, out_dim):
+        super().__init__()
+        self.linear = nn.Linear(2 * in_dim, out_dim)
+        self.conv_pos_embed = ConvPositionEmbedding(out_dim)
+    def forward(self, x: float["b n d"], cond: float["b n d"], drop_audio_cond=False):  # noqa: F722
+        if drop_audio_cond:
+            cond = torch.zeros_like(cond)
+        x = torch.cat((x, cond), dim=-1)
+        x = self.linear(x)
+        x = self.conv_pos_embed(x) + x
+        return x
+# Transformer backbone using MM-DiT blocks
+class MMDiT(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        depth=8,
+        heads=8,
+        dim_head=64,
+        dropout=0.1,
+        ff_mult=4,
+        mel_dim=100,
+        text_num_embeds=256,
+        text_mask_padding=True,
+        qk_norm=None,
+    ):
+        super().__init__()
+        self.time_embed = TimestepEmbedding(dim)
+        self.text_embed = TextEmbedding(dim, text_num_embeds, mask_padding=text_mask_padding)
+        self.text_cond, self.text_uncond = None, None  # text cache
+        self.audio_embed = AudioEmbedding(mel_dim, dim)
+        self.rotary_embed = RotaryEmbedding(dim_head)
+        self.dim = dim
+        self.depth = depth
+        self.transformer_blocks = nn.ModuleList(
+            [
+                MMDiTBlock(
+                    dim=dim,
+                    heads=heads,
+                    dim_head=dim_head,
+                    dropout=dropout,
+                    ff_mult=ff_mult,
+                    context_pre_only=i == depth - 1,
+                    qk_norm=qk_norm,
+                )
+                for i in range(depth)
+            ]
+        )
+        self.norm_out = AdaLayerNorm_Final(dim)  # final modulation
+        self.proj_out = nn.Linear(dim, mel_dim)
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Zero-out AdaLN layers in MMDiT blocks:
+        for block in self.transformer_blocks:
+            nn.init.constant_(block.attn_norm_x.linear.weight, 0)
+            nn.init.constant_(block.attn_norm_x.linear.bias, 0)
+            nn.init.constant_(block.attn_norm_c.linear.weight, 0)
+            nn.init.constant_(block.attn_norm_c.linear.bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.norm_out.linear.weight, 0)
+        nn.init.constant_(self.norm_out.linear.bias, 0)
+        nn.init.constant_(self.proj_out.weight, 0)
+        nn.init.constant_(self.proj_out.bias, 0)
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
+    def forward(
+        self,
+        x: float["b n d"],  # nosied input audio  # noqa: F722
+        cond: float["b n d"],  # masked cond audio  # noqa: F722
+        text: int["b nt"],  # text  # noqa: F722
+        time: float["b"] | float[""],  # time step  # noqa: F821 F722
+        drop_audio_cond,  # cfg for cond audio
+        drop_text,  # cfg for text
+        mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
+    ):
+        batch = x.shape[0]
+        if time.ndim == 0:
+            time = time.repeat(batch)
+        # t: conditioning (time), c: context (text + masked cond audio), x: noised input audio
+        t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, drop_text=True)
+                c = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, drop_text=False)
+                c = self.text_cond
+        else:
+            c = self.text_embed(text, drop_text=drop_text)
+        x = self.audio_embed(x, cond, drop_audio_cond=drop_audio_cond)
+        seq_len = x.shape[1]
+        text_len = text.shape[1]
+        rope_audio = self.rotary_embed.forward_from_seq_len(seq_len)
+        rope_text = self.rotary_embed.forward_from_seq_len(text_len)
+        for block in self.transformer_blocks:
+            c, x = block(x, c, t, mask=mask, rope=rope_audio, c_rope=rope_text)
+        x = self.norm_out(x, t)
+        output = self.proj_out(x)
+        return output

lemas_tts/model/backbones/prosody_encoder.py ADDED Viewed

	@@ -0,0 +1,433 @@

+"""
+Prosody encoder backbone based on the Pretssel ECAPA-TDNN architecture.
+This module provides:
+  - ProsodyEncoder: wraps an ECAPA-TDNN model to produce utterance-level
+    prosody embeddings from 80-dim FBANK features.
+  - extract_fbank_16k: utility to compute 80-bin FBANK from 16kHz audio.
+It is self-contained (no fairseq2 dependency) and can be used inside
+CFM or other models as a conditioning network.
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import List, Optional, Tuple
+import json
+import torch
+import torchaudio
+from torch import Tensor
+from torch import nn
+from torch.nn import Conv1d, LayerNorm, Module, ModuleList, ReLU, Sigmoid, Tanh, init
+import torch.nn.functional as F
+AUDIO_SAMPLE_RATE = 16_000
+class ECAPA_TDNN(Module):
+    """
+    ECAPA-TDNN core used in Pretssel prosody encoder.
+    Expects input features of shape (B, T, C) with C=80 and returns
+    a normalized embedding of shape (B, embed_dim).
+    """
+    def __init__(
+        self,
+        channels: List[int],
+        kernel_sizes: List[int],
+        dilations: List[int],
+        attention_channels: int,
+        res2net_scale: int,
+        se_channels: int,
+        global_context: bool,
+        groups: List[int],
+        embed_dim: int,
+        input_dim: int,
+    ):
+        super().__init__()
+        assert len(channels) == len(kernel_sizes) == len(dilations)
+        self.channels = channels
+        self.embed_dim = embed_dim
+        self.blocks = ModuleList()
+        self.blocks.append(
+            TDNNBlock(
+                input_dim,
+                channels[0],
+                kernel_sizes[0],
+                dilations[0],
+                groups[0],
+            )
+        )
+        for i in range(1, len(channels) - 1):
+            self.blocks.append(
+                SERes2NetBlock(
+                    channels[i - 1],
+                    channels[i],
+                    res2net_scale=res2net_scale,
+                    se_channels=se_channels,
+                    kernel_size=kernel_sizes[i],
+                    dilation=dilations[i],
+                    groups=groups[i],
+                )
+            )
+        self.mfa = TDNNBlock(
+            channels[-1],
+            channels[-1],
+            kernel_sizes[-1],
+            dilations[-1],
+            groups=groups[-1],
+        )
+        self.asp = AttentiveStatisticsPooling(
+            channels[-1],
+            attention_channels=attention_channels,
+            global_context=global_context,
+        )
+        self.asp_norm = LayerNorm(channels[-1] * 2, eps=1e-12)
+        self.fc = Conv1d(
+            in_channels=channels[-1] * 2,
+            out_channels=embed_dim,
+            kernel_size=1,
+        )
+        self.reset_parameters()
+    def reset_parameters(self) -> None:
+        def encoder_init(m: Module) -> None:
+            if isinstance(m, Conv1d):
+                init.xavier_uniform_(m.weight, init.calculate_gain("relu"))
+        self.apply(encoder_init)
+    def forward(
+        self,
+        x: Tensor,
+        padding_mask: Optional[Tensor] = None,
+    ) -> Tensor:
+        # x: (B, T, C)
+        x = x.transpose(1, 2)  # (B, C, T)
+        xl = []
+        for layer in self.blocks:
+            x = layer(x, padding_mask=padding_mask)
+            xl.append(x)
+        x = torch.cat(xl[1:], dim=1)
+        x = self.mfa(x)
+        x = self.asp(x, padding_mask=padding_mask)
+        x = self.asp_norm(x.transpose(1, 2)).transpose(1, 2)
+        x = self.fc(x)
+        x = x.transpose(1, 2).squeeze(1)  # (B, embed_dim)
+        return F.normalize(x, dim=-1)
+class TDNNBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int,
+        dilation: int,
+        groups: int = 1,
+    ):
+        super().__init__()
+        self.conv = Conv1d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            dilation=dilation,
+            padding=dilation * (kernel_size - 1) // 2,
+            groups=groups,
+        )
+        self.activation = ReLU()
+        self.norm = LayerNorm(out_channels, eps=1e-12)
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        x = self.activation(self.conv(x))
+        return self.norm(x.transpose(1, 2)).transpose(1, 2)
+class Res2NetBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        scale: int = 8,
+        kernel_size: int = 3,
+        dilation: int = 1,
+    ):
+        super().__init__()
+        assert in_channels % scale == 0
+        assert out_channels % scale == 0
+        in_channel = in_channels // scale
+        hidden_channel = out_channels // scale
+        self.blocks = ModuleList(
+            [
+                TDNNBlock(
+                    in_channel,
+                    hidden_channel,
+                    kernel_size=kernel_size,
+                    dilation=dilation,
+                )
+                for _ in range(scale - 1)
+            ]
+        )
+        self.scale = scale
+    def forward(self, x: Tensor) -> Tensor:
+        y = []
+        for i, x_i in enumerate(torch.chunk(x, self.scale, dim=1)):
+            if i == 0:
+                y_i = x_i
+            elif i == 1:
+                y_i = self.blocks[i - 1](x_i)
+            else:
+                y_i = self.blocks[i - 1](x_i + y_i)
+            y.append(y_i)
+        return torch.cat(y, dim=1)
+class SEBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        se_channels: int,
+        out_channels: int,
+    ):
+        super().__init__()
+        self.conv1 = Conv1d(in_channels=in_channels, out_channels=se_channels, kernel_size=1)
+        self.relu = ReLU(inplace=True)
+        self.conv2 = Conv1d(in_channels=se_channels, out_channels=out_channels, kernel_size=1)
+        self.sigmoid = Sigmoid()
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        if padding_mask is not None:
+            # padding_mask: (B, T) with 1 for valid, 0 for pad
+            mask = padding_mask.unsqueeze(1)  # (B, 1, T)
+            lengths = mask.sum(dim=2, keepdim=True)
+            s = (x * mask).sum(dim=2, keepdim=True) / torch.clamp(lengths, min=1.0)
+        else:
+            s = x.mean(dim=2, keepdim=True)
+        s = self.relu(self.conv1(s))
+        s = self.sigmoid(self.conv2(s))
+        return s * x
+class AttentiveStatisticsPooling(Module):
+    def __init__(
+        self, channels: int, attention_channels: int = 128, global_context: bool = True
+    ):
+        super().__init__()
+        self.eps = 1e-12
+        self.global_context = global_context
+        if global_context:
+            self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
+        else:
+            self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
+        self.tanh = Tanh()
+        self.conv = Conv1d(in_channels=attention_channels, out_channels=channels, kernel_size=1)
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        # x: (N, C, L)
+        N, C, L = x.shape
+        def _compute_statistics(
+            x: Tensor, m: Tensor, dim: int = 2, eps: float = 1e-12
+        ) -> Tuple[Tensor, Tensor]:
+            mean = (m * x).sum(dim)
+            std = torch.sqrt((m * (x - mean.unsqueeze(dim)).pow(2)).sum(dim).clamp(eps))
+            return mean, std
+        if padding_mask is not None:
+            mask = padding_mask
+        else:
+            mask = torch.ones(N, L, device=x.device, dtype=x.dtype)
+        mask = mask.unsqueeze(1)  # (N, 1, L)
+        if self.global_context:
+            total = mask.sum(dim=2, keepdim=True).to(x)
+            mean, std = _compute_statistics(x, mask / total)
+            mean = mean.unsqueeze(2).repeat(1, 1, L)
+            std = std.unsqueeze(2).repeat(1, 1, L)
+            attn = torch.cat([x, mean, std], dim=1)
+        else:
+            attn = x
+        attn = self.conv(self.tanh(self.tdnn(attn)))
+        attn = attn.masked_fill(mask == 0, float("-inf"))
+        attn = F.softmax(attn, dim=2)
+        mean, std = _compute_statistics(x, attn)
+        pooled_stats = torch.cat((mean, std), dim=1)
+        pooled_stats = pooled_stats.unsqueeze(2)
+        return pooled_stats
+class SERes2NetBlock(Module):
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        res2net_scale: int = 8,
+        se_channels: int = 128,
+        kernel_size: int = 1,
+        dilation: int = 1,
+        groups: int = 1,
+    ):
+        super().__init__()
+        self.out_channels = out_channels
+        self.tdnn1 = TDNNBlock(
+            in_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            groups=groups,
+        )
+        self.res2net_block = Res2NetBlock(
+            out_channels,
+            out_channels,
+            res2net_scale,
+            kernel_size,
+            dilation,
+        )
+        self.tdnn2 = TDNNBlock(
+            out_channels,
+            out_channels,
+            kernel_size=1,
+            dilation=1,
+            groups=groups,
+        )
+        self.se_block = SEBlock(out_channels, se_channels, out_channels)
+        self.shortcut = None
+        if in_channels != out_channels:
+            self.shortcut = Conv1d(
+                in_channels=in_channels,
+                out_channels=out_channels,
+                kernel_size=1,
+            )
+    def forward(self, x: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        residual = x
+        if self.shortcut:
+            residual = self.shortcut(x)
+        x = self.tdnn1(x)
+        x = self.res2net_block(x)
+        x = self.tdnn2(x)
+        x = self.se_block(x, padding_mask=padding_mask)
+        return x + residual
+def extract_fbank_16k(audio_16k: Tensor) -> Tensor:
+    """
+    Compute 80-dim FBANK features from 16kHz audio.
+    Args:
+        audio_16k: Tensor of shape (T,) or (1, T)
+    Returns:
+        fbank: Tensor of shape (T_fbank, 80)
+    """
+    if audio_16k.ndim == 1:
+        audio_16k = audio_16k.unsqueeze(0)
+    # Ensure minimum length for kaldi.fbank window (default 25ms @16k -> 400 samples)
+    min_len = 400
+    if audio_16k.shape[-1] < min_len:
+        repeat_times = (min_len // audio_16k.shape[-1]) + 1
+        audio_16k = audio_16k.repeat(1, repeat_times) if audio_16k.dim() > 1 else audio_16k.repeat(repeat_times)
+    fbank = torchaudio.compliance.kaldi.fbank(
+        audio_16k,
+        num_mel_bins=80,
+        sample_frequency=AUDIO_SAMPLE_RATE,
+    )
+    return fbank
+class ProsodyEncoder(nn.Module):
+    """
+    High-level wrapper for the Pretssel prosody encoder.
+    Usage:
+        encoder = ProsodyEncoder(cfg_path, ckpt_path, freeze=True)
+        emb = encoder(fbank_batch)  # (B, 512)
+    """
+    def __init__(self, cfg_path: Path, ckpt_path: Path, freeze: bool = True):
+        super().__init__()
+        model_cfg = self._load_pretssel_model_cfg(cfg_path)
+        self.encoder = self._build_prosody_encoder(model_cfg)
+        self._load_prosody_encoder_state(self.encoder, ckpt_path)
+        if freeze:
+            for p in self.encoder.parameters():
+                p.requires_grad = False
+    @staticmethod
+    def _load_pretssel_model_cfg(cfg_path: Path) -> dict:
+        cfg = json.loads(cfg_path.read_text())
+        if "model" not in cfg:
+            raise ValueError(f"{cfg_path} does not contain a top-level 'model' key.")
+        return cfg["model"]
+    @staticmethod
+    def _build_prosody_encoder(model_cfg: dict) -> ECAPA_TDNN:
+        encoder = ECAPA_TDNN(
+            channels=model_cfg["prosody_channels"],
+            kernel_sizes=model_cfg["prosody_kernel_sizes"],
+            dilations=model_cfg["prosody_dilations"],
+            attention_channels=model_cfg["prosody_attention_channels"],
+            res2net_scale=model_cfg["prosody_res2net_scale"],
+            se_channels=model_cfg["prosody_se_channels"],
+            global_context=model_cfg["prosody_global_context"],
+            groups=model_cfg["prosody_groups"],
+            embed_dim=model_cfg["prosody_embed_dim"],
+            input_dim=model_cfg["input_feat_per_channel"],
+        )
+        return encoder
+    @staticmethod
+    def _load_prosody_encoder_state(model: Module, ckpt_path: Path) -> None:
+        state = torch.load(ckpt_path, map_location="cpu")
+        if isinstance(state, dict):
+            if all(isinstance(k, str) for k in state.keys()) and (
+                any(k.startswith("prosody_encoder.") for k in state.keys())
+                or any(k.startswith("prosody_encoder_model.") for k in state.keys())
+            ):
+                state = {
+                    k.replace("prosody_encoder_model.", "", 1).replace("prosody_encoder.", "", 1): v
+                    for k, v in state.items()
+                    if k.startswith("prosody_encoder.") or k.startswith("prosody_encoder_model.")
+                }
+        missing, unexpected = model.load_state_dict(state, strict=False)
+        if missing or unexpected:
+            raise RuntimeError(
+                f"Error loading checkpoint {ckpt_path}: missing keys={missing}, "
+                f"unexpected keys={unexpected}"
+            )
+    def forward(self, fbank: Tensor, padding_mask: Optional[Tensor] = None) -> Tensor:
+        """
+        Args:
+            fbank: Tensor of shape (B, T, 80)
+            padding_mask: Optional tensor of shape (B, T) with 1 for valid.
+        Returns:
+            emb: Tensor of shape (B, 512)
+        """
+        return self.encoder(fbank, padding_mask=padding_mask)

lemas_tts/model/backbones/unett.py ADDED Viewed

	@@ -0,0 +1,250 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+from typing import Literal
+import torch
+from torch import nn
+import torch.nn.functional as F
+from x_transformers import RMSNorm
+from x_transformers.x_transformers import RotaryEmbedding
+from lemas_tts.model.modules import (
+    TimestepEmbedding,
+    ConvNeXtV2Block,
+    ConvPositionEmbedding,
+    Attention,
+    AttnProcessor,
+    FeedForward,
+    precompute_freqs_cis,
+    get_pos_embed_indices,
+)
+# Text embedding
+class TextEmbedding(nn.Module):
+    def __init__(self, text_num_embeds, text_dim, mask_padding=True, conv_layers=0, conv_mult=2):
+        super().__init__()
+        self.text_embed = nn.Embedding(text_num_embeds + 1, text_dim)  # use 0 as filler token
+        self.mask_padding = mask_padding  # mask filler and batch padding tokens or not
+        if conv_layers > 0:
+            self.extra_modeling = True
+            self.precompute_max_pos = 4096  # ~44s of 24khz audio
+            self.register_buffer("freqs_cis", precompute_freqs_cis(text_dim, self.precompute_max_pos), persistent=False)
+            self.text_blocks = nn.Sequential(
+                *[ConvNeXtV2Block(text_dim, text_dim * conv_mult) for _ in range(conv_layers)]
+            )
+        else:
+            self.extra_modeling = False
+    def forward(self, text: int["b nt"], seq_len, drop_text=False):  # noqa: F722
+        text = text + 1  # use 0 as filler token. preprocess of batch pad -1, see list_str_to_idx()
+        text = text[:, :seq_len]  # curtail if character tokens are more than the mel spec tokens
+        batch, text_len = text.shape[0], text.shape[1]
+        text = F.pad(text, (0, seq_len - text_len), value=0)
+        if self.mask_padding:
+            text_mask = text == 0
+        if drop_text:  # cfg for text
+            text = torch.zeros_like(text)
+        text = self.text_embed(text)  # b n -> b n d
+        # possible extra modeling
+        if self.extra_modeling:
+            # sinus pos emb
+            batch_start = torch.zeros((batch,), dtype=torch.long)
+            pos_idx = get_pos_embed_indices(batch_start, seq_len, max_pos=self.precompute_max_pos)
+            text_pos_embed = self.freqs_cis[pos_idx]
+            text = text + text_pos_embed
+            # convnextv2 blocks
+            if self.mask_padding:
+                text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+                for block in self.text_blocks:
+                    text = block(text)
+                    text = text.masked_fill(text_mask.unsqueeze(-1).expand(-1, -1, text.size(-1)), 0.0)
+            else:
+                text = self.text_blocks(text)
+        return text
+# noised input audio and context mixing embedding
+class InputEmbedding(nn.Module):
+    def __init__(self, mel_dim, text_dim, out_dim):
+        super().__init__()
+        self.proj = nn.Linear(mel_dim * 2 + text_dim, out_dim)
+        self.conv_pos_embed = ConvPositionEmbedding(dim=out_dim)
+    def forward(self, x: float["b n d"], cond: float["b n d"], text_embed: float["b n d"], drop_audio_cond=False):  # noqa: F722
+        if drop_audio_cond:  # cfg for cond audio
+            cond = torch.zeros_like(cond)
+        x = self.proj(torch.cat((x, cond, text_embed), dim=-1))
+        x = self.conv_pos_embed(x) + x
+        return x
+# Flat UNet Transformer backbone
+class UNetT(nn.Module):
+    def __init__(
+        self,
+        *,
+        dim,
+        depth=8,
+        heads=8,
+        dim_head=64,
+        dropout=0.1,
+        ff_mult=4,
+        mel_dim=100,
+        text_num_embeds=256,
+        text_dim=None,
+        text_mask_padding=True,
+        qk_norm=None,
+        conv_layers=0,
+        pe_attn_head=None,
+        skip_connect_type: Literal["add", "concat", "none"] = "concat",
+    ):
+        super().__init__()
+        assert depth % 2 == 0, "UNet-Transformer's depth should be even."
+        self.time_embed = TimestepEmbedding(dim)
+        if text_dim is None:
+            text_dim = mel_dim
+        self.text_embed = TextEmbedding(
+            text_num_embeds, text_dim, mask_padding=text_mask_padding, conv_layers=conv_layers
+        )
+        self.text_cond, self.text_uncond = None, None  # text cache
+        self.input_embed = InputEmbedding(mel_dim, text_dim, dim)
+        self.rotary_embed = RotaryEmbedding(dim_head)
+        # transformer layers & skip connections
+        self.dim = dim
+        self.skip_connect_type = skip_connect_type
+        needs_skip_proj = skip_connect_type == "concat"
+        self.depth = depth
+        self.layers = nn.ModuleList([])
+        for idx in range(depth):
+            is_later_half = idx >= (depth // 2)
+            attn_norm = RMSNorm(dim)
+            attn = Attention(
+                processor=AttnProcessor(pe_attn_head=pe_attn_head),
+                dim=dim,
+                heads=heads,
+                dim_head=dim_head,
+                dropout=dropout,
+                qk_norm=qk_norm,
+            )
+            ff_norm = RMSNorm(dim)
+            ff = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+            skip_proj = nn.Linear(dim * 2, dim, bias=False) if needs_skip_proj and is_later_half else None
+            self.layers.append(
+                nn.ModuleList(
+                    [
+                        skip_proj,
+                        attn_norm,
+                        attn,
+                        ff_norm,
+                        ff,
+                    ]
+                )
+            )
+        self.norm_out = RMSNorm(dim)
+        self.proj_out = nn.Linear(dim, mel_dim)
+    def clear_cache(self):
+        self.text_cond, self.text_uncond = None, None
+    def forward(
+        self,
+        x: float["b n d"],  # nosied input audio  # noqa: F722
+        cond: float["b n d"],  # masked cond audio  # noqa: F722
+        text: int["b nt"],  # text  # noqa: F722
+        time: float["b"] | float[""],  # time step  # noqa: F821 F722
+        drop_audio_cond,  # cfg for cond audio
+        drop_text,  # cfg for text
+        mask: bool["b n"] | None = None,  # noqa: F722
+        cache=False,
+    ):
+        batch, seq_len = x.shape[0], x.shape[1]
+        if time.ndim == 0:
+            time = time.repeat(batch)
+        # t: conditioning time, c: context (text + masked cond audio), x: noised input audio
+        t = self.time_embed(time)
+        if cache:
+            if drop_text:
+                if self.text_uncond is None:
+                    self.text_uncond = self.text_embed(text, seq_len, drop_text=True)
+                text_embed = self.text_uncond
+            else:
+                if self.text_cond is None:
+                    self.text_cond = self.text_embed(text, seq_len, drop_text=False)
+                text_embed = self.text_cond
+        else:
+            text_embed = self.text_embed(text, seq_len, drop_text=drop_text)
+        x = self.input_embed(x, cond, text_embed, drop_audio_cond=drop_audio_cond)
+        # postfix time t to input x, [b n d] -> [b n+1 d]
+        x = torch.cat([t.unsqueeze(1), x], dim=1)  # pack t to x
+        if mask is not None:
+            mask = F.pad(mask, (1, 0), value=1)
+        rope = self.rotary_embed.forward_from_seq_len(seq_len + 1)
+        # flat unet transformer
+        skip_connect_type = self.skip_connect_type
+        skips = []
+        for idx, (maybe_skip_proj, attn_norm, attn, ff_norm, ff) in enumerate(self.layers):
+            layer = idx + 1
+            # skip connection logic
+            is_first_half = layer <= (self.depth // 2)
+            is_later_half = not is_first_half
+            if is_first_half:
+                skips.append(x)
+            if is_later_half:
+                skip = skips.pop()
+                if skip_connect_type == "concat":
+                    x = torch.cat((x, skip), dim=-1)
+                    x = maybe_skip_proj(x)
+                elif skip_connect_type == "add":
+                    x = x + skip
+            # attention and feedforward blocks
+            x = attn(attn_norm(x), rope=rope, mask=mask) + x
+            x = ff(ff_norm(x)) + x
+        assert len(skips) == 0
+        x = self.norm_out(x)[:, 1:, :]  # unpack t from x
+        return self.proj_out(x)

lemas_tts/model/cfm.py ADDED Viewed

	@@ -0,0 +1,899 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+from random import random
+import random as _random
+from typing import Callable, Dict, OrderedDict
+import math
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+import torchaudio
+from torch import nn
+from torch.nn.utils.rnn import pad_sequence
+from torchdiffeq import odeint
+from lemas_tts.model.modules import MelSpec
+from lemas_tts.model.modules import MIEsitmator, AccentClassifier, grad_reverse
+from lemas_tts.model.backbones.ecapa_tdnn import ECAPA_TDNN
+from lemas_tts.model.backbones.prosody_encoder import ProsodyEncoder, extract_fbank_16k
+from lemas_tts.model.utils import (
+    default,
+    exists,
+    lens_to_mask,
+    list_str_to_idx,
+    list_str_to_tensor,
+    mask_from_frac_lengths,
+)
+def clip_and_shuffle(mel, mel_len, sample_rate=24000, hop_length=256, ratio=None):
+    """
+    Randomly clip a mel-spectrogram segment and shuffle 1-second chunks to
+    create an accent-invariant conditioning segment.
+    This is a inference-time utility used by the accent GRL path.
+    Args:
+        mel: [n_mels, T]
+        mel_len: int, original mel length (T)
+    """
+    frames_per_second = int(sample_rate / hop_length)  # ≈ 94 frames / second
+    # ---- 1. Randomly crop 25%~75% of the original length (or ratio * length) ----
+    total_len = mel_len
+    if not ratio:
+        seg_len = _random.randint(int(0.25 * total_len), int(0.75 * total_len))
+    else:
+        seg_len = int(total_len * ratio)
+    start = _random.randint(0, max(0, total_len - seg_len))
+    mel_seg = mel[:, start : start + seg_len]
+    # ---- 2. Split into ~1-second chunks ----
+    n_chunks = (mel_seg.size(1) + frames_per_second - 1) // frames_per_second
+    chunks = []
+    for i in range(n_chunks):
+        chunk = mel_seg[:, i * frames_per_second : (i + 1) * frames_per_second]
+        chunks.append(chunk)
+    # ---- 3. Shuffle chunk order ----
+    _random.shuffle(chunks)
+    shuffled_mel = torch.cat(chunks, dim=1)
+    # ---- 4. Repeat random chunks until reaching original length ----
+    if shuffled_mel.size(1) < total_len:
+        repeat_chunks = []
+        while sum(c.size(1) for c in repeat_chunks) < total_len:
+            repeat_chunks.append(_random.choice(chunks))
+        shuffled_mel = torch.cat([shuffled_mel] + repeat_chunks, dim=1)
+    # ---- 5. Trim to exactly mel_len ----
+    shuffled_mel = shuffled_mel[:, :total_len]
+    assert shuffled_mel.shape == mel.shape, f"shuffled_mel.shape != mel.shape: {shuffled_mel.shape} != {mel.shape}"
+    return shuffled_mel
+class CFM(nn.Module):
+    def __init__(
+        self,
+        transformer: nn.Module,
+        sigma=0.0,
+        odeint_kwargs: dict = dict(
+            # atol = 1e-5,
+            # rtol = 1e-5,
+            method="euler"  # 'midpoint'
+        ),
+        audio_drop_prob=0.3,
+        text_drop_prob=0.1,
+        num_channels=None,
+        mel_spec_module: nn.Module | None = None,
+        mel_spec_kwargs: dict = dict(),
+        frac_lengths_mask: tuple[float, float] = (0.7, 1.0),
+        vocab_char_map: dict[str:int] | None = None,
+        use_ctc_loss: bool = False,
+        use_spk_enc: bool = False,
+        use_prosody_encoder: bool = False,
+        prosody_cfg_path: str | None = None,
+        prosody_ckpt_path: str | None = None,
+    ):
+        super().__init__()
+        self.frac_lengths_mask = frac_lengths_mask
+        # mel spec
+        self.mel_spec = default(mel_spec_module, MelSpec(**mel_spec_kwargs))
+        num_channels = default(num_channels, self.mel_spec.n_mel_channels)
+        self.num_channels = num_channels
+        # classifier-free guidance
+        self.audio_drop_prob = audio_drop_prob
+        self.text_drop_prob = text_drop_prob
+        # transformer
+        self.transformer = transformer
+        dim = transformer.dim
+        self.dim = dim
+        # conditional flow related
+        self.sigma = sigma
+        # sampling related
+        self.odeint_kwargs = odeint_kwargs
+        # vocab map for tokenization
+        self.vocab_char_map = vocab_char_map
+        # Prosody encoder (Pretssel ECAPA-TDNN)
+        self.use_prosody_encoder = (
+            use_prosody_encoder and prosody_cfg_path is not None and prosody_ckpt_path is not None
+        )
+        if self.use_prosody_encoder:
+            cfg_path = Path(prosody_cfg_path)
+            ckpt_path = Path(prosody_ckpt_path)
+            self.prosody_encoder = ProsodyEncoder(cfg_path, ckpt_path, freeze=True)
+            # 512-d prosody -> mel channel dimension
+            self.prosody_to_mel = nn.Linear(512, self.num_channels)
+            self.prosody_dropout = nn.Dropout(p=0.2)
+        else:
+            self.prosody_encoder = None
+        # Speaker encoder
+        self.use_spk_enc = use_spk_enc
+        if use_spk_enc:
+            self.speaker_encoder = ECAPA_TDNN(
+                self.num_channels,
+                self.dim,
+                channels=[512, 512, 512, 512, 1536],
+                kernel_sizes=[5, 3, 3, 3, 1],
+                dilations=[1, 2, 3, 4, 1],
+                attention_channels=128,
+                res2net_scale=4,
+                se_channels=128,
+                global_context=True,
+                batch_norm=True,
+            )
+            # self.load_partial_weights(self.speaker_encoder, "/cto_labs/vistring/zhaozhiyuan/outputs/F5-TTS/pretrain/speaker.bin", device="cpu")
+        self.use_ctc_loss = use_ctc_loss
+        if use_ctc_loss:
+            # print("vocab_char_map:", len(vocab_char_map)+1, "dim:", dim, "mel_spec_kwargs:",mel_spec_kwargs)
+            self.ctc = MIEsitmator(len(self.vocab_char_map), self.num_channels, self.dim, dropout=self.text_drop_prob)
+        self.accent_classifier = AccentClassifier(input_dim=self.num_channels, hidden_dim=self.dim, num_accents=12)
+        self.accent_criterion = nn.CrossEntropyLoss()
+    def load_partial_weights(self, model: nn.Module,
+                            ckpt_path: str,
+                            device="cpu",
+                            verbose=True) -> int:
+        """
+        仅加载形状匹配的参数，其余跳过。
+        返回成功加载的参数数量。
+        """
+        state_dict = torch.load(ckpt_path, map_location=device)
+        model_dict = model.state_dict()
+        ok_count = 0
+        new_dict: OrderedDict[str, torch.Tensor] = OrderedDict()
+        for k, v in state_dict.items():
+            if k in model_dict and v.shape == model_dict[k].shape:
+                new_dict[k] = v
+                ok_count += 1
+            else:
+                if verbose:
+                    print(f"[SKIP] {k}  ckpt:{v.shape}  model:{model_dict[k].shape if k in model_dict else 'N/A'}")
+        model_dict.update(new_dict)
+        model.load_state_dict(model_dict)
+        if verbose:
+            print(f"=> 成功加载 {ok_count}/{len(state_dict)} 个参数")
+        return ok_count
+    @property
+    def device(self):
+        return next(self.parameters()).device
+    @torch.no_grad()
+    def sample(
+        self,
+        cond: float["b n d"] | float["b nw"],  # noqa: F722
+        text: int["b nt"] | list[str],  # noqa: F722
+        duration: int | int["b"],  # noqa: F821
+        *,
+        lens: int["b"] | None = None,  # noqa: F821
+        steps=32,
+        cfg_strength=1.0,
+        sway_sampling_coef=None,
+        seed: int | None = None,
+        max_duration=4096,
+        vocoder: Callable[[float["b d n"]], float["b nw"]] | None = None,  # noqa: F722
+        no_ref_audio=False,
+        duplicate_test=False,
+        t_inter=0.1,
+        edit_mask=None,
+        use_acc_grl = True,
+        use_prosody_encoder = True,
+        ref_ratio = 1,
+    ):
+        self.eval()
+        # raw wave -> mel, keep a copy for prosody encoder if available
+        raw_audio = None
+        if cond.ndim == 2:
+            raw_audio = cond.clone()  # (B, nw)
+            cond = self.mel_spec(cond)
+            cond = cond.permute(0, 2, 1)
+            assert cond.shape[-1] == self.num_channels
+        cond = cond.to(next(self.parameters()).dtype)
+        cond_mean = cond.mean(dim=1, keepdim=True)
+        batch, cond_seq_len, device = *cond.shape[:2], cond.device
+        if not exists(lens):
+            lens = torch.full((batch,), cond_seq_len, device=device, dtype=torch.long)
+        # optional global prosody conditioning at inference (one embedding per sample)
+        prosody_mel_cond = None
+        prosody_text_cond = None
+        prosody_embeds = None
+        if self.prosody_encoder is not None and raw_audio is not None and use_prosody_encoder:
+            embeds = []
+            for b in range(batch):
+                audio_b = raw_audio[b].unsqueeze(0)  # (1, nw)
+                src_sr = self.mel_spec.target_sample_rate
+                if src_sr != 16_000:
+                    audio_16k = torchaudio.functional.resample(
+                        audio_b, src_sr, 16_000
+                    ).squeeze(0)
+                else:
+                    audio_16k = audio_b.squeeze(0)
+                fbank = extract_fbank_16k(audio_16k)
+                fbank = fbank.unsqueeze(0).to(device=device, dtype=cond.dtype)
+                emb = self.prosody_encoder(fbank, padding_mask=None)[0]  # (512,)
+                embeds.append(emb)
+            prosody_embeds = torch.stack(embeds, dim=0)  # (B, 512)
+            # broadcast along mel and text
+            prosody_mel_cond = prosody_embeds[:, None, :].expand(-1, cond_seq_len, -1)
+        if use_acc_grl:
+            # rand_mel = clip_and_shuffle(cond.permute(0, 2, 1).squeeze(0), cond.shape[1])
+            # rand_mel = rand_mel.unsqueeze(0).permute(0, 2, 1)
+            # assert rand_mel.shape == cond.shape, f"Shape diff: rand_mel.shape: {rand_mel.shape}, cond.shape: {cond.shape}"
+            # cond_grl = grad_reverse(rand_mel, lambda_=1.0)
+            if ref_ratio < 1:
+                rand_mel = clip_and_shuffle(cond.permute(0, 2, 1).squeeze(0), cond.shape[1], ratio=ref_ratio)
+                rand_mel = rand_mel.unsqueeze(0).permute(0, 2, 1)
+                assert rand_mel.shape == cond.shape, f"Shape diff: rand_mel.shape: {rand_mel.shape}, cond.shape: {cond.shape}"
+                cond_grl = grad_reverse(rand_mel, lambda_=1.0)
+            else:
+                cond_grl = grad_reverse(cond, lambda_=1.0)
+            # print("cond:", cond.shape, cond.mean(), cond.max(), cond.min(), "rand_mel:", rand_mel.mean(), rand_mel.max(), rand_mel.min(), "cond_grl:", cond_grl.mean(), cond_grl.max(), cond_grl.min())
+        # text
+        if isinstance(text, list):
+            if exists(self.vocab_char_map):
+                text = list_str_to_idx(text, self.vocab_char_map).to(device)
+            else:
+                text = list_str_to_tensor(text).to(device)
+            assert text.shape[0] == batch
+        # duration
+        cond_mask = lens_to_mask(lens)
+        if edit_mask is not None:
+            cond_mask = cond_mask & edit_mask
+        if isinstance(duration, int):
+            duration = torch.full((batch,), duration, device=device, dtype=torch.long)
+        duration = torch.maximum(
+            torch.maximum((text != -1).sum(dim=-1), lens) + 1, duration
+        )  # duration at least text/audio prompt length plus one token, so something is generated
+        # clamp and convert max_duration to python int for padding ops
+        duration = duration.clamp(max=max_duration)
+        max_duration = int(duration.amax().item())
+        # duplicate test corner for inner time step oberservation
+        if duplicate_test:
+            test_cond = F.pad(cond, (0, 0, cond_seq_len, max_duration - 2 * cond_seq_len), value=0.0)
+        cond = F.pad(cond, (0, 0, 0, max_duration - cond_seq_len), value=0.0)
+        if prosody_mel_cond is not None:
+            prosody_mel_cond = F.pad(
+                prosody_mel_cond, (0, 0, 0, max_duration - cond_seq_len), value=0.0
+            )
+            prosody_mel_proj = self.prosody_to_mel(prosody_mel_cond)
+            cond = cond + prosody_mel_proj
+        if no_ref_audio:
+            random_cond = torch.randn_like(cond) * 0.1 + cond_mean
+            random_cond = random_cond / random_cond.mean(dim=1, keepdim=True) * cond_mean
+            print("cond:", cond.mean(), cond.max(), cond.min(), "random_cond:", random_cond.mean(), random_cond.max(), random_cond.min(), "mean_cond:", cond_mean.shape)
+            cond = random_cond
+        cond_mask = F.pad(cond_mask, (0, max_duration - cond_mask.shape[-1]), value=False)
+        cond_mask = cond_mask.unsqueeze(-1)
+        if use_acc_grl:
+            cond_grl = F.pad(cond_grl, (0, 0, 0, max_duration - cond_seq_len), value=0.0)
+        step_cond = torch.where(cond_mask, cond, torch.zeros_like(cond))  # allow direct control (cut cond audio) with lens passed in
+        if batch > 1:
+            mask = lens_to_mask(duration)
+        else:  # save memory and speed up, as single inference need no mask currently
+            mask = None
+        # neural ode
+        def compute_sway_max(steps: int,
+                            t_start: float = 0.0,
+                            dtype=torch.float32,
+                            min_ratio: float | None = None,
+                            safety_factor: float = 0.5) -> float:
+            """
+            Compute a safe upper bound for sway_sampling_coef given steps and t_start.
+            - steps: number of ODE steps
+            - t_start: start time in [0,1)
+            - dtype: torch dtype (for machine eps)
+            - min_ratio: smallest distinguishable dt^p (if None, use conservative default)
+            - safety_factor: scale down the theoretical maximum to be safe
+            """
+            assert 0.0 <= t_start < 1.0
+            dt = (1.0 - t_start) / max(1, steps)
+            eps = torch.finfo(dtype).eps
+            if min_ratio is None:
+                # conservative default: ~100 * eps (float32 -> ~1e-5)
+                min_ratio = max(1e-9, 1e2 * float(eps))
+            if dt >= 0.9:
+                p_max = 1.0 + 10.0
+            else:
+                # solve dt^p >= min_ratio  =>  p <= log(min_ratio)/log(dt)
+                p_max = math.log(min_ratio) / math.log(dt)
+            sway_max = max(0.0, p_max - 1.0)
+            sway_max = sway_max * float(safety_factor)
+            return torch.tensor(sway_max, device=device, dtype=dtype)
+        # prepare text-side prosody conditioning if embeddings available
+        if prosody_embeds is not None:
+            text_len = text.shape[1]
+            prosody_text_cond = prosody_embeds[:, None, :].expand(-1, text_len, -1)
+        else:
+            prosody_text_cond = None
+        def fn(t, x):
+            # at each step, conditioning is fixed
+            # if use_spk_enc:
+            #     mix_cond = t * cond + (1-t) * spk_emb
+            #     step_cond = torch.where(cond_mask, mix_cond, torch.zeros_like(mix_cond))
+            if use_acc_grl:
+                step_cond = torch.where(cond_mask, cond_grl, torch.zeros_like(cond_grl))
+            else:
+                step_cond = torch.where(cond_mask, cond, torch.zeros_like(cond))
+            # predict flow
+            pred = self.transformer(
+                x=x,
+                cond=step_cond,
+                text=text,
+                time=t,
+                mask=mask,
+                drop_audio_cond=False,
+                drop_text=False,
+                cache=True,
+                prosody_text=prosody_text_cond,
+            )
+            if cfg_strength < 1e-5:
+                return pred
+            null_pred = self.transformer(
+                x=x,
+                cond=step_cond,
+                text=text,
+                time=t,
+                mask=mask,
+                drop_audio_cond=True,
+                drop_text=True,
+                cache=True,
+                prosody_text=prosody_text_cond,
+            )
+            # cfg_t = cfg_strength * torch.cos(0.5 * torch.pi * t)
+            # cfg_t = cfg_strength * (1 - t)
+            cfg_t = cfg_strength * ((1 - t) ** 2)
+            # print("t:", t, "cfg_t:", cfg_t)
+            res = pred + (pred - null_pred) * cfg_t
+            # print("t:", t.item(), "\tres:", res.shape, res.mean().item(), res.max().item(), res.min().item(), "\tpred:", pred.mean().item(), pred.max().item(), pred.min().item(), "\tnull_pred:", null_pred.mean().item(), null_pred.max().item(), null_pred.min().item(), "\tcfg_t:", cfg_t.item())
+            res = res.clamp(-20, 20)
+            return res
+        # noise input
+        # to make sure batch inference result is same with different batch size, and for sure single inference
+        # still some difference maybe due to convolutional layers
+        y0 = []
+        for dur in duration:
+            if exists(seed):
+                torch.manual_seed(seed)
+            y0.append(torch.randn(dur, self.num_channels, device=self.device, dtype=step_cond.dtype))
+        y0 = pad_sequence(y0, padding_value=0, batch_first=True)
+        t_start = 0
+        # duplicate test corner for inner time step oberservation
+        if duplicate_test:
+            t_start = t_inter
+            y0 = (1 - t_start) * y0 + t_start * test_cond
+            steps = int(steps * (1 - t_start))
+        t = torch.linspace(t_start, 1, int(steps + 1), device=self.device, dtype=step_cond.dtype)
+        sway_max = compute_sway_max(steps, t_start=t_start, dtype=step_cond.dtype, min_ratio=1e-9, safety_factor=0.7)
+        if sway_sampling_coef is not None:
+            sway_sampling_coef = min(sway_max, sway_sampling_coef)
+            # t = t + sway_sampling_coef *  (torch.cos(torch.pi / 2 * t) - 1 + t)
+            t = t ** (1 + sway_sampling_coef)
+        else:
+            t = t ** (1 + sway_max)
+        # print("t:",t, "sway_max:", sway_max, "sway_sampling_coef:", sway_sampling_coef)
+        trajectory = odeint(fn, y0, t, **self.odeint_kwargs)
+        self.transformer.clear_cache()
+        sampled = trajectory[-1]
+        out = sampled
+        out = torch.where(cond_mask, cond, out)
+        # out生成的部分，或者说pad补0的部分，单独计算mean, 然后和cond的mean做对齐（乘以系数，两个的均值要差不多）
+        if no_ref_audio:
+            out_mean = out[:,cond_seq_len:,:].mean(dim=1, keepdim=True)
+            out[:,cond_seq_len:,:] = out[:,cond_seq_len:,:] - (out_mean - cond_mean)
+            # print("out_mean:", out_mean.shape, out_mean.mean(), "cond_mean:", cond_mean.shape, cond_mean.mean(), "out:", out[:,cond_seq_len:,:].shape, out[:,cond_seq_len:,:].mean().item(), out[:,cond_seq_len:,:].max().item(), out[:,cond_seq_len:,:].min().item())
+        if exists(vocoder):
+            out = out.permute(0, 2, 1)
+            out = vocoder(out)
+        # print("out:", out.shape, "trajectory:", trajectory.shape)
+        return out, trajectory
+    def info_nce_speaker(self,
+                        e_gt: torch.Tensor,
+                        e_pred: torch.Tensor,
+                        temperature: float = 0.1):
+        """
+        InfoNCE loss for speaker encoder training.
+        同一条样本的 e_gt 与 e_pred 互为正例，其余均为负例。
+        Args:
+            temperature: 温度缩放 τ
+        Returns:
+            loss: 标量 tensor，可 backward
+        """
+        B = e_gt.size(0)
+        # 2. L2 归一化
+        e_gt   = F.normalize(e_gt,   dim=1)
+        e_pred = F.normalize(e_pred, dim=1)
+        # 3. 计算 B×B 相似度矩阵（pred 对 gt）
+        logits = torch.einsum('bd,cd->bc', e_pred, e_gt) / temperature  # [B, B]
+        # 4. 正例标签正好是对角线
+        labels = torch.arange(B, device=logits.device)
+        # 5. InfoNCE = cross-entropy over in-batch negatives
+        loss = F.cross_entropy(logits, labels)
+        return loss
+    def forward_old(
+        self,
+        batchs: Dict[str, torch.Tensor],
+        # inp: float["b n d"] | float["b nw"],  # mel or raw wave  # noqa: F722
+        # text: int["b nt"] | list[str],  # noqa: F722
+        *,
+        # lens: int["b"] | None = None,  # noqa: F821
+        noise_scheduler: str | None = None,
+    ):
+        inp = batchs["mel"].permute(0, 2, 1)
+        lens = batchs["mel_lengths"]
+        rand_mel = batchs["rand_mel"].permute(0, 2, 1)
+        text = batchs["text"]
+        target_text_lengths = torch.tensor([len(x) for x in text], device=inp.device)
+        langs = batchs["langs"]
+        # print("inp:", inp.shape, "rand_mel:", rand_mel.shape, "lens:", lens, "target_text_lengths:", target_text_lengths, "langs:", langs)
+        # handle raw wave
+        if inp.ndim == 2:
+            inp = self.mel_spec(inp)
+            inp = inp.permute(0, 2, 1)
+            assert inp.shape[-1] == self.num_channels
+        batch, seq_len, dtype, device, _σ1 = *inp.shape[:2], inp.dtype, self.device, self.sigma
+        # print("inp_shape:", inp.shape, inp.max(), inp.min(), "dtype:", dtype, "device:", device, "σ1:", _σ1)
+        # handle text as string
+        if isinstance(text, list):
+            if exists(self.vocab_char_map):
+                text = list_str_to_idx(text, self.vocab_char_map).to(device)
+            else:
+                text = list_str_to_tensor(text).to(device)
+            assert text.shape[0] == batch
+        # lens and mask
+        if not exists(lens):
+            lens = torch.full((batch,), seq_len, device=device)
+        mask = lens_to_mask(lens, length=seq_len)  # useless here, as collate_fn will pad to max length in batch
+        # get a random span to mask out for training conditionally
+        frac_lengths = torch.zeros((batch,), device=self.device).float().uniform_(*self.frac_lengths_mask)
+        rand_span_mask = mask_from_frac_lengths(lens, frac_lengths)
+        if exists(mask):
+            rand_span_mask &= mask
+        # mel is x1
+        x1 = inp
+        # x0 is gaussian noise
+        x0 = torch.randn_like(x1)
+        # time step
+        time = torch.rand((batch,), dtype=dtype, device=self.device)
+        # TODO. noise_scheduler
+        # sample xt (φ_t(x) in the paper)
+        t = time.unsqueeze(-1).unsqueeze(-1)
+        φ = (1 - t) * x0 + t * x1
+        flow = x1 - x0
+        # cond = torch.where(rand_span_mask[..., None], torch.zeros_like(rand_mel), rand_mel)
+        cond = torch.where(rand_span_mask[..., None], torch.zeros_like(x1), x1)
+        # print("seq_len:", seq_len, "lens:", lens)
+        if self.use_spk_enc: # 50%的概率使用spk_emb
+            spk_emb = self.speaker_encoder(rand_mel, lens)
+            # global_emb: [batch, 1, dim] -> 重复扩展到 [batch, seq_len, dim]
+            spk_emb = spk_emb.unsqueeze(1).expand_as(x1)
+            # print("spk_emb_shape:", spk_emb.shape)
+            # 应用mask操作
+            cond = torch.where(rand_span_mask[..., None], torch.zeros_like(spk_emb), spk_emb)
+            # assert cond.shape[0] == batch, "speaker encoder batch size mismatch"
+            # print("x1.shape:", x1.shape, "cond_shape:", cond.shape)
+            # 给一个随机数，把spk_emb * 随机数，再加上原来的cond *（1-随机数）
+            rand_num = torch.rand((batch, 1, 1), dtype=dtype, device=self.device)
+            cond = cond * rand_num + spk_emb * (1 - rand_num)
+        cond_grl = grad_reverse(cond, lambda_=1.0)
+        # print("inp_shape:", inp.shape, "rand_span_mask:", rand_span_mask.shape)
+        # # # transformer and cfg training with a drop rate
+        # drop_audio_cond = random() < self.audio_drop_prob  # p_drop in voicebox paper
+        # drop_text_cond = random() < self.text_drop_prob  # p_drop in voicebox paper
+        drop_audio_cond = random() < self.audio_drop_prob  # p_drop in voicebox paper
+        if random() < self.text_drop_prob:  # p_uncond in voicebox paper
+            drop_audio_cond = True
+            drop_text_cond = True
+        else:
+            drop_text_cond = False
+        # print("drop_audio_cond:", drop_audio_cond, "drop_text_cond:", drop_text_cond)
+        # if want rigorously mask out padding, record in collate_fn in dataset.py, and pass in here
+        # adding mask will use more memory, thus also need to adjust batchsampler with scaled down threshold for long sequences
+        pred = self.transformer(x=φ, cond=cond_grl, text=text, time=time, drop_audio_cond=drop_audio_cond, drop_text=drop_text_cond)
+        # flow matching loss
+        pred_clamp = pred.float().clamp(-20, 20)
+        loss = F.mse_loss(pred_clamp, flow, reduction="none")
+        loss = loss[rand_span_mask]  # [N]
+        # # # 1. 全局截断：>2 或 NaN → 0（全局）
+        # print("mse loss shape:", loss.shape, "loss max:", loss.max(), "loss min:", loss.min(), target_text_lengths[0])
+        # # 2. 统计非NaN值的百分比
+        # valid_mask = ~torch.isnan(loss)
+        # total_count = loss.numel()  # 总元素数量（所有维度）
+        # valid_count = valid_mask.sum().item()  # 非NaN元素数量
+        # valid_percentage = (valid_count / total_count) * 100
+        # print(f"mse loss: total_count: {total_count}", f"valid_count: {valid_count}", f"valid_percentage: {valid_percentage:.2f}%")
+        # valid_loss = loss[~torch.isnan(loss)]
+        loss = torch.where(torch.isnan(loss) | (loss > 300.0), 300.0, loss)
+        loss = loss.mean()
+        # loss = torch.tanh(torch.log1p(loss.mean())) # 对数缩放
+        # if len(valid_loss) > 0:
+        #     clipped_loss = torch.clamp(valid_loss, max=150)
+        #     loss = torch.tanh(torch.log1p(clipped_loss.mean())) # 对数缩放
+        # else:
+        #     loss = torch.tensor(0.0, device=pred.device)
+        accent_logits = self.accent_classifier(cond_grl)
+        accent_logits_mean = accent_logits.mean(dim=1)
+        lang_labels = langs.to(accent_logits.device).long()
+        # print("langs:", lang_labels, "accent_logits:", accent_logits.shape, "accent_logits_mean:", accent_logits_mean.shape)
+        accent_loss = self.accent_criterion(accent_logits_mean, lang_labels)
+        # guard against NaN / Inf in accent_loss
+        if not torch.isfinite(accent_loss):
+            accent_loss = torch.zeros_like(accent_loss, device=accent_loss.device)
+        # accent_loss = torch.zeros_like(loss, device=loss.device, requires_grad=True)
+        loss += 0.1 * accent_loss
+        valid_indices = torch.where(time > 0.5)[0]
+        # print("torch.where(time > 0.5):", valid_indices, torch.where(time > 0.5))
+        if valid_indices.size(0) > 2:
+            # 动态选择符合条件的sample
+            selected_gt = inp[valid_indices]
+            selected_pred = pred[valid_indices]
+            selected_text = text[valid_indices]
+            selected_lens = lens[valid_indices]
+            selected_target_lengths = target_text_lengths[valid_indices]
+            # print("pred:", selected_pred.shape, "valid_indices:", valid_indices, "lens:", selected_lens, "target_lengths:", selected_target_lengths)
+        if self.use_spk_enc and valid_indices.size(0) > 2:
+            # speaker encoder loss
+            e_gt = self.speaker_encoder(selected_gt, selected_lens)
+            e_pred = self.speaker_encoder(selected_pred, selected_lens)
+            spk_loss = self.info_nce_speaker(e_gt, e_pred)
+            if not torch.isnan(spk_loss).any(): #  and spk_loss.item() > 1e-6:
+                loss = loss + spk_loss * 10.0
+            else:
+                spk_loss = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+        else:
+            spk_loss = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+        # print("spk_loss:", spk_loss)
+        # ctc loss
+        if self.use_ctc_loss and valid_indices.size(0) > 2:
+            # 如果t大于0.5 则计算ctc loss
+            ctc_loss = self.ctc(
+                decoder_outputs=selected_pred,
+                target_phones=selected_text,
+                decoder_lengths=selected_lens,
+                target_lengths=selected_target_lengths,
+            )
+            # print("loss:", loss, "ctc_loss:", ctc_loss, "time: ", time.shape, time[valid_indices].mean())
+            # 如果ctc loss没有nan，才加上ctc loss
+            if not torch.isnan(ctc_loss).any() and ctc_loss.item() > 1e-6:
+                # ctc_scaled = torch.tanh(torch.log1p(ctc_loss))
+                ctc_scaled = ctc_loss
+                loss = loss + 0.1 * ctc_scaled
+            else:
+                ctc_scaled = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+            # print("loss:", loss, "ctc_scaled:", ctc_scaled)
+        else:
+            ctc_scaled = torch.zeros_like(loss, device=loss.device, requires_grad=False)
+        # 在计算完 total loss 之前
+        total_loss = loss  # base flow loss + others you added
+        # note: we intentionally do NOT add 0.0 * pred.sum() etc. here, to avoid
+        # propagating NaNs from intermediate tensors into the loss scalar.
+        return total_loss, ctc_scaled, accent_loss, len(valid_indices), cond, pred # accent_loss,
+    def forward(self, batchs: Dict[str, torch.Tensor], *, noise_scheduler: str | None = None):
+        """
+        Simplified forward version for accent-invariant flow matching.
+        Removes speaker encoder and CTC parts, keeps accent GRL.
+        """
+        inp = batchs["mel"].permute(0, 2, 1)           # [B, T_mel, D]
+        lens = batchs["mel_lengths"]
+        text = batchs["text"]
+        langs = batchs["langs"]
+        audio_16k_list = batchs.get("audio_16k", None)
+        prosody_idx_list = batchs.get("prosody_idx", None)
+        # # ---- 4. 随机截取并打乱 segment ----
+        # rand_mel = [clip_and_shuffle(spec, spec.shape[-1]) for spec in batchs["mel"]]
+        # padded_rand_mel = []
+        # for spec in rand_mel:
+        #     padding = (0, batchs["mel"].shape[-1] - spec.size(-1))
+        #     padded_spec = F.pad(spec, padding, value=0)
+        #     padded_rand_mel.append(padded_spec)
+        # rand_mel = torch.stack(padded_rand_mel).permute(0, 2, 1)
+        # assert rand_mel.shape == inp.shape, f"shape diff: rand_mel.shape: {rand_mel.shape}, inp.shape: {inp.shape}"
+        if inp.ndim == 2:
+            inp = self.mel_spec(inp).permute(0, 2, 1)
+            assert inp.shape[-1] == self.num_channels
+        batch, seq_len, dtype, device = *inp.shape[:2], inp.dtype, self.device
+        # --- handle text
+        if isinstance(text, list):
+            if exists(self.vocab_char_map):
+                text = list_str_to_idx(text, self.vocab_char_map).to(device)
+            else:
+                text = list_str_to_tensor(text).to(device)
+            assert text.shape[0] == batch
+        # print("text:", batchs["text"][0], text.shape, text[0], batchs["text_lengths"][0])
+        # --- prosody conditioning (compute embeddings per sub-utterance)
+        prosody_mel_cond = None
+        prosody_text_cond = None
+        if (
+            self.prosody_encoder is not None
+            and audio_16k_list is not None
+            and prosody_idx_list is not None
+        ):
+            # prepare zero tensors for each sample
+            T_mel = seq_len
+            T_text = text.shape[1]
+            prosody_mel_cond = torch.zeros(batch, T_mel, 512, device=device, dtype=dtype)
+            prosody_text_cond = torch.zeros(batch, T_text, 512, device=device, dtype=dtype)
+            # collect all segments, run encoder per segment
+            seg_embeds: list[Tensor] = []
+            seg_meta: list[tuple[int, int, int, int, int, int]] = []
+            for b in range(batch):
+                audio_b = audio_16k_list[b]
+                idx_list = prosody_idx_list[b]
+                if audio_b is None or idx_list is None:
+                    continue
+                audio_b = audio_b.to(device=device, dtype=dtype)
+                for seg in idx_list:
+                    text_start, text_end, mel_start, mel_end, audio_start, audio_end = seg
+                    # clamp audio indices
+                    audio_start = max(0, min(audio_start, audio_b.shape[0] - 1))
+                    audio_end = max(audio_start + 1, min(audio_end, audio_b.shape[0]))
+                    audio_seg = audio_b[audio_start:audio_end]
+                    if audio_seg.numel() == 0:
+                        continue
+                    fbank = extract_fbank_16k(audio_seg)  # (T_fbank, 80)
+                    fbank = fbank.unsqueeze(0).to(device=device, dtype=dtype)  # (1, T_fbank, 80)
+                    with torch.no_grad():
+                        emb = self.prosody_encoder(fbank, padding_mask=None)[0]  # (512,)
+                    seg_embeds.append(emb)
+                    seg_meta.append(
+                        (b, text_start, text_end, mel_start, mel_end)
+                    )
+            if seg_embeds:
+                seg_embeds_tensor = torch.stack(seg_embeds, dim=0)  # (N_seg, 512)
+                # scatter embeddings back to per-sample tensors
+                for emb, meta in zip(seg_embeds_tensor, seg_meta):
+                    b, ts, te, ms, me = meta
+                    emb_exp = emb.to(device=device, dtype=dtype)
+                    prosody_mel_cond[b, ms:me, :] = emb_exp
+                    prosody_text_cond[b, ts:te, :] = emb_exp
+            # dropout on prosody conditioning
+            prosody_mel_cond = self.prosody_dropout(prosody_mel_cond)
+            prosody_text_cond = self.prosody_dropout(prosody_text_cond)
+        # --- mask & random span
+        mask = lens_to_mask(lens, length=seq_len)
+        frac_lengths = torch.zeros((batch,), device=device).float().uniform_(*self.frac_lengths_mask)
+        rand_span_mask = mask_from_frac_lengths(lens, frac_lengths)
+        if exists(mask):
+            rand_span_mask &= mask
+        # --- flow setup
+        x1 = inp
+        x0 = torch.randn_like(x1)
+        time = torch.rand((batch,), dtype=dtype, device=device)
+        t = time[:, None, None]
+        φ = (1 - t) * x0 + t * x1
+        flow = x1 - x0
+        # --- conditional input (masked mel) + optional prosody
+        cond = torch.where(rand_span_mask[..., None], torch.zeros_like(x1), x1) # x1 # rand_mel
+        if prosody_mel_cond is not None:
+            prosody_mel_proj = self.prosody_to_mel(prosody_mel_cond)  # (B, T_mel, num_channels)
+            # if needed, pad/crop to seq_len
+            if prosody_mel_proj.size(1) < seq_len:
+                pad_len = seq_len - prosody_mel_proj.size(1)
+                prosody_mel_proj = F.pad(prosody_mel_proj, (0, 0, 0, pad_len))
+            elif prosody_mel_proj.size(1) > seq_len:
+                prosody_mel_proj = prosody_mel_proj[:, :seq_len, :]
+            cond = cond + prosody_mel_proj
+        # --- Gradient reversal: encourage accent-invariant cond
+        cond_grl = grad_reverse(cond, lambda_=1.0)
+        # # --- random drop condition for CFG-like robustness
+        # drop_audio_cond = random() < self.audio_drop_prob
+        # drop_text_cond = random() < self.text_drop_prob if not drop_audio_cond else True
+        # safe per-batch random (tensor)
+        rand_for_drop = torch.rand(1, device=device)
+        drop_audio_cond = (rand_for_drop.item() < self.audio_drop_prob)
+        rand_for_text = torch.rand(1, device=device)
+        drop_text_cond = (rand_for_text.item() < self.text_drop_prob)
+        # --- main prediction
+        pred = self.transformer(
+            x=φ,
+            cond=cond_grl,
+            text=text,
+            time=time,
+            drop_audio_cond=drop_audio_cond,
+            drop_text=drop_text_cond,
+            prosody_text=prosody_text_cond,
+        )
+        # === FLOW LOSS (robust mask-weighted) ===
+        pred_clamp = pred.float().clamp(-20, 20)
+        per_elem_loss = F.mse_loss(pred_clamp, flow, reduction="none")  # [B, T, D]
+        mask_exp = rand_span_mask.unsqueeze(-1).to(dtype=per_elem_loss.dtype)  # [B, T, 1]
+        masked_loss = per_elem_loss * mask_exp  # zeros where mask False
+        # total selected scalar (frames * dim)
+        n_selected = mask_exp.sum() * per_elem_loss.size(-1)  # scalar
+        denom = torch.clamp(n_selected, min=1.0)
+        loss_sum = masked_loss.sum()
+        loss = loss_sum / denom
+        # numeric safety
+        loss = torch.where(torch.isnan(loss) | (loss > 300.0), torch.tensor(300.0, device=loss.device, dtype=loss.dtype), loss)
+        # === ACCENT LOSS ===
+        accent_logits = self.accent_classifier(cond_grl)
+        # pool across time -> [B, C]
+        accent_logits_mean = accent_logits.mean(dim=1)
+        lang_labels = langs.to(accent_logits_mean.device).long()
+        accent_loss = self.accent_criterion(accent_logits_mean, lang_labels)
+        # guard against NaN / Inf in accent_loss
+        if not torch.isfinite(accent_loss):
+            accent_loss = torch.zeros_like(accent_loss, device=accent_loss.device)
+        base_loss = loss + 0.1 * accent_loss
+        # === OPTIONAL CTC LOSS (robust, only on valid samples) ===
+        ctc_scaled = torch.tensor(0.0, device=device, dtype=dtype)
+        if getattr(self, "use_ctc_loss", False) and getattr(self, "ctc", None) is not None:
+            # select samples with larger t for CTC supervision (similar to forward_old)
+            valid_indices = torch.where(time > 0.5)[0]
+            if valid_indices.size(0) > 2:
+                selected_pred = pred[valid_indices]
+                selected_text = text[valid_indices]
+                selected_lens = lens[valid_indices]
+                # text was tokenized from list_str_to_idx, where padding is -1
+                selected_target_lengths = (selected_text != -1).sum(dim=-1)
+                ctc_loss = self.ctc(
+                    decoder_outputs=selected_pred,
+                    target_phones=selected_text,
+                    decoder_lengths=selected_lens,
+                    target_lengths=selected_target_lengths,
+                )
+                if torch.isfinite(ctc_loss) and ctc_loss.item() > 1e-6:
+                    ctc_scaled = ctc_loss
+                    base_loss = base_loss + 0.1 * ctc_scaled
+        total_loss = base_loss
+        # note: we intentionally do NOT add 0.0 * pred.sum() etc. here, to avoid
+        # propagating NaNs from intermediate tensors into the loss scalar.
+        return total_loss, accent_loss, ctc_scaled, cond, pred

lemas_tts/model/modules.py ADDED Viewed

	@@ -0,0 +1,802 @@

+"""
+ein notation:
+b - batch
+n - sequence
+nt - text sequence
+nw - raw wave length
+d - dimension
+"""
+from __future__ import annotations
+import math
+from typing import Optional
+import torch
+import torch.nn.functional as F
+import torchaudio
+from librosa.filters import mel as librosa_mel_fn
+from torch import nn
+from x_transformers.x_transformers import apply_rotary_pos_emb
+from torch.autograd import Function
+# raw wav to mel spec
+mel_basis_cache = {}
+hann_window_cache = {}
+def get_bigvgan_mel_spectrogram(
+    waveform,
+    n_fft=1024,
+    n_mel_channels=100,
+    target_sample_rate=24000,
+    hop_length=256,
+    win_length=1024,
+    fmin=0,
+    fmax=None,
+    center=False,
+):  # Copy from https://github.com/NVIDIA/BigVGAN/tree/main
+    device = waveform.device
+    key = f"{n_fft}_{n_mel_channels}_{target_sample_rate}_{hop_length}_{win_length}_{fmin}_{fmax}_{device}"
+    if key not in mel_basis_cache:
+        mel = librosa_mel_fn(sr=target_sample_rate, n_fft=n_fft, n_mels=n_mel_channels, fmin=fmin, fmax=fmax)
+        mel_basis_cache[key] = torch.from_numpy(mel).float().to(device)  # TODO: why they need .float()?
+        hann_window_cache[key] = torch.hann_window(win_length).to(device)
+    mel_basis = mel_basis_cache[key]
+    hann_window = hann_window_cache[key]
+    padding = (n_fft - hop_length) // 2
+    waveform = torch.nn.functional.pad(waveform.unsqueeze(1), (padding, padding), mode="reflect").squeeze(1)
+    spec = torch.stft(
+        waveform,
+        n_fft,
+        hop_length=hop_length,
+        win_length=win_length,
+        window=hann_window,
+        center=center,
+        pad_mode="reflect",
+        normalized=False,
+        onesided=True,
+        return_complex=True,
+    )
+    spec = torch.sqrt(torch.view_as_real(spec).pow(2).sum(-1) + 1e-9)
+    mel_spec = torch.matmul(mel_basis, spec)
+    mel_spec = torch.log(torch.clamp(mel_spec, min=1e-5))
+    return mel_spec
+def get_vocos_mel_spectrogram(
+    waveform,
+    n_fft=1024,
+    n_mel_channels=100,
+    target_sample_rate=24000,
+    hop_length=256,
+    win_length=1024,
+):
+    mel_stft = torchaudio.transforms.MelSpectrogram(
+        sample_rate=target_sample_rate,
+        n_fft=n_fft,
+        win_length=win_length,
+        hop_length=hop_length,
+        n_mels=n_mel_channels,
+        power=1,
+        center=True,
+        normalized=False,
+        norm=None,
+    ).to(waveform.device)
+    if len(waveform.shape) == 3:
+        waveform = waveform.squeeze(1)  # 'b 1 nw -> b nw'
+    assert len(waveform.shape) == 2
+    mel = mel_stft(waveform)
+    mel = mel.clamp(min=1e-5).log()
+    return mel
+class MelSpec(nn.Module):
+    def __init__(
+        self,
+        n_fft=1024,
+        hop_length=256,
+        win_length=1024,
+        n_mel_channels=100,
+        target_sample_rate=24_000,
+        mel_spec_type="vocos",
+    ):
+        super().__init__()
+        assert mel_spec_type in ["vocos", "bigvgan"], print("We only support two extract mel backend: vocos or bigvgan")
+        self.n_fft = n_fft
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.n_mel_channels = n_mel_channels
+        self.target_sample_rate = target_sample_rate
+        if mel_spec_type == "vocos":
+            self.extractor = get_vocos_mel_spectrogram
+        elif mel_spec_type == "bigvgan":
+            self.extractor = get_bigvgan_mel_spectrogram
+        self.register_buffer("dummy", torch.tensor(0), persistent=False)
+    def forward(self, wav):
+        if self.dummy.device != wav.device:
+            self.to(wav.device)
+        mel = self.extractor(
+            waveform=wav,
+            n_fft=self.n_fft,
+            n_mel_channels=self.n_mel_channels,
+            target_sample_rate=self.target_sample_rate,
+            hop_length=self.hop_length,
+            win_length=self.win_length,
+        )
+        return mel
+# sinusoidal position embedding
+class SinusPositionEmbedding(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+    def forward(self, x, scale=1000):
+        device = x.device
+        half_dim = self.dim // 2
+        emb = math.log(10000) / (half_dim - 1)
+        emb = torch.exp(torch.arange(half_dim, device=device).float() * -emb)
+        emb = scale * x.unsqueeze(1) * emb.unsqueeze(0)
+        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
+        return emb
+# convolutional position embedding
+class ConvPositionEmbedding(nn.Module):
+    def __init__(self, dim, kernel_size=31, groups=16):
+        super().__init__()
+        assert kernel_size % 2 != 0
+        self.conv1d = nn.Sequential(
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
+            nn.Mish(),
+            nn.Conv1d(dim, dim, kernel_size, groups=groups, padding=kernel_size // 2),
+            nn.Mish(),
+        )
+    def forward(self, x: float["b n d"], mask: bool["b n"] | None = None):  # noqa: F722
+        if mask is not None:
+            mask = mask[..., None]
+            x = x.masked_fill(~mask, 0.0)
+        x = x.permute(0, 2, 1)
+        x = self.conv1d(x)
+        out = x.permute(0, 2, 1)
+        if mask is not None:
+            out = out.masked_fill(~mask, 0.0)
+        return out
+# rotary positional embedding related
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, theta_rescale_factor=1.0):
+    # proposed by reddit user bloc97, to rescale rotary embeddings to longer sequence length without fine-tuning
+    # has some connection to NTK literature
+    # https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
+    # https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py
+    theta *= theta_rescale_factor ** (dim / (dim - 2))
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
+    t = torch.arange(end, device=freqs.device)  # type: ignore
+    freqs = torch.outer(t, freqs).float()  # type: ignore
+    freqs_cos = torch.cos(freqs)  # real part
+    freqs_sin = torch.sin(freqs)  # imaginary part
+    return torch.cat([freqs_cos, freqs_sin], dim=-1)
+def get_pos_embed_indices(start, length, max_pos, scale=1.0):
+    # length = length if isinstance(length, int) else length.max()
+    scale = scale * torch.ones_like(start, dtype=torch.float32)  # in case scale is a scalar
+    pos = (
+        start.unsqueeze(1)
+        + (torch.arange(length, device=start.device, dtype=torch.float32).unsqueeze(0) * scale.unsqueeze(1)).long()
+    )
+    # avoid extra long error.
+    pos = torch.where(pos < max_pos, pos, max_pos - 1)
+    return pos
+# Global Response Normalization layer (Instance Normalization ?)
+class GRN(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.gamma = nn.Parameter(torch.zeros(1, 1, dim))
+        self.beta = nn.Parameter(torch.zeros(1, 1, dim))
+    def forward(self, x):
+        Gx = torch.norm(x, p=2, dim=1, keepdim=True)
+        Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
+        return self.gamma * (x * Nx) + self.beta + x
+# ConvNeXt-V2 Block https://github.com/facebookresearch/ConvNeXt-V2/blob/main/models/convnextv2.py
+# ref: https://github.com/bfs18/e2_tts/blob/main/rfwave/modules.py#L108
+class ConvNeXtV2Block(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        intermediate_dim: int,
+        dilation: int = 1,
+    ):
+        super().__init__()
+        padding = (dilation * (7 - 1)) // 2
+        self.dwconv = nn.Conv1d(
+            dim, dim, kernel_size=7, padding=padding, groups=dim, dilation=dilation
+        )  # depthwise conv
+        self.norm = nn.LayerNorm(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, intermediate_dim)  # pointwise/1x1 convs, implemented with linear layers
+        self.act = nn.GELU()
+        self.grn = GRN(intermediate_dim)
+        self.pwconv2 = nn.Linear(intermediate_dim, dim)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        residual = x
+        x = x.transpose(1, 2)  # b n d -> b d n
+        x = self.dwconv(x)
+        x = x.transpose(1, 2)  # b d n -> b n d
+        x = self.norm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.grn(x)
+        x = self.pwconv2(x)
+        return residual + x
+# RMSNorm
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+        self.native_rms_norm = float(torch.__version__[:3]) >= 2.4
+    def forward(self, x):
+        if self.native_rms_norm:
+            if self.weight.dtype in [torch.float16, torch.bfloat16]:
+                x = x.to(self.weight.dtype)
+            x = F.rms_norm(x, normalized_shape=(x.shape[-1],), weight=self.weight, eps=self.eps)
+        else:
+            variance = x.to(torch.float32).pow(2).mean(-1, keepdim=True)
+            x = x * torch.rsqrt(variance + self.eps)
+            if self.weight.dtype in [torch.float16, torch.bfloat16]:
+                x = x.to(self.weight.dtype)
+            x = x * self.weight
+        return x
+# AdaLayerNorm
+# return with modulated x for attn input, and params for later mlp modulation
+class AdaLayerNorm(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(dim, dim * 6)
+        self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+    def forward(self, x, emb=None):
+        emb = self.linear(self.silu(emb))
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = torch.chunk(emb, 6, dim=1)
+        x = self.norm(x) * (1 + scale_msa[:, None]) + shift_msa[:, None]
+        return x, gate_msa, shift_mlp, scale_mlp, gate_mlp
+# AdaLayerNorm for final layer
+# return only with modulated x for attn input, cuz no more mlp modulation
+class AdaLayerNorm_Final(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(dim, dim * 2)
+        self.norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+    def forward(self, x, emb):
+        emb = self.linear(self.silu(emb))
+        scale, shift = torch.chunk(emb, 2, dim=1)
+        x = self.norm(x) * (1 + scale)[:, None, :] + shift[:, None, :]
+        return x
+# FeedForward
+class FeedForward(nn.Module):
+    def __init__(self, dim, dim_out=None, mult=4, dropout=0.0, approximate: str = "none"):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = dim_out if dim_out is not None else dim
+        activation = nn.GELU(approximate=approximate)
+        project_in = nn.Sequential(nn.Linear(dim, inner_dim), activation)
+        self.ff = nn.Sequential(project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out))
+    def forward(self, x):
+        return self.ff(x)
+# Attention with possible joint part
+# modified from diffusers/src/diffusers/models/attention_processor.py
+class Attention(nn.Module):
+    def __init__(
+        self,
+        processor: JointAttnProcessor | AttnProcessor,
+        dim: int,
+        heads: int = 8,
+        dim_head: int = 64,
+        dropout: float = 0.0,
+        context_dim: Optional[int] = None,  # if not None -> joint attention
+        context_pre_only: bool = False,
+        qk_norm: Optional[str] = None,
+    ):
+        super().__init__()
+        if not hasattr(F, "scaled_dot_product_attention"):
+            raise ImportError("Attention equires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.")
+        self.processor = processor
+        self.dim = dim
+        self.heads = heads
+        self.inner_dim = dim_head * heads
+        self.dropout = dropout
+        self.context_dim = context_dim
+        self.context_pre_only = context_pre_only
+        self.to_q = nn.Linear(dim, self.inner_dim)
+        self.to_k = nn.Linear(dim, self.inner_dim)
+        self.to_v = nn.Linear(dim, self.inner_dim)
+        if qk_norm is None:
+            self.q_norm = None
+            self.k_norm = None
+        elif qk_norm == "rms_norm":
+            self.q_norm = RMSNorm(dim_head, eps=1e-6)
+            self.k_norm = RMSNorm(dim_head, eps=1e-6)
+        else:
+            raise ValueError(f"Unimplemented qk_norm: {qk_norm}")
+        if self.context_dim is not None:
+            self.to_q_c = nn.Linear(context_dim, self.inner_dim)
+            self.to_k_c = nn.Linear(context_dim, self.inner_dim)
+            self.to_v_c = nn.Linear(context_dim, self.inner_dim)
+            if qk_norm is None:
+                self.c_q_norm = None
+                self.c_k_norm = None
+            elif qk_norm == "rms_norm":
+                self.c_q_norm = RMSNorm(dim_head, eps=1e-6)
+                self.c_k_norm = RMSNorm(dim_head, eps=1e-6)
+        self.to_out = nn.ModuleList([])
+        self.to_out.append(nn.Linear(self.inner_dim, dim))
+        self.to_out.append(nn.Dropout(dropout))
+        if self.context_dim is not None and not self.context_pre_only:
+            self.to_out_c = nn.Linear(self.inner_dim, context_dim)
+    def forward(
+        self,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        c: float["b n d"] = None,  # context c  # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding for x
+        c_rope=None,  # rotary position embedding for c
+    ) -> torch.Tensor:
+        if c is not None:
+            return self.processor(self, x, c=c, mask=mask, rope=rope, c_rope=c_rope)
+        else:
+            return self.processor(self, x, mask=mask, rope=rope)
+# Attention processor
+class AttnProcessor:
+    def __init__(
+        self,
+        pe_attn_head: int | None = None,  # number of attention head to apply rope, None for all
+    ):
+        self.pe_attn_head = pe_attn_head
+    def __call__(
+        self,
+        attn: Attention,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding
+    ) -> torch.FloatTensor:
+        batch_size = x.shape[0]
+        # `sample` projections
+        query = attn.to_q(x)
+        key = attn.to_k(x)
+        value = attn.to_v(x)
+        # attention
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # qk norm
+        if attn.q_norm is not None:
+            query = attn.q_norm(query)
+        if attn.k_norm is not None:
+            key = attn.k_norm(key)
+        # apply rotary position embedding
+        if rope is not None:
+            freqs, xpos_scale = rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            if self.pe_attn_head is not None:
+                pn = self.pe_attn_head
+                query[:, :pn, :, :] = apply_rotary_pos_emb(query[:, :pn, :, :], freqs, q_xpos_scale)
+                key[:, :pn, :, :] = apply_rotary_pos_emb(key[:, :pn, :, :], freqs, k_xpos_scale)
+            else:
+                query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
+                key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
+        # mask. e.g. inference got a batch with different target durations, mask out the padding
+        if mask is not None:
+            attn_mask = mask
+            attn_mask = attn_mask.unsqueeze(1).unsqueeze(1)  # 'b n -> b 1 1 n'
+            attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
+        else:
+            attn_mask = None
+        x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
+        x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        x = x.to(query.dtype)
+        # linear proj
+        x = attn.to_out[0](x)
+        # dropout
+        x = attn.to_out[1](x)
+        if mask is not None:
+            mask = mask.unsqueeze(-1)
+            x = x.masked_fill(~mask, 0.0)
+        return x
+# Joint Attention processor for MM-DiT
+# modified from diffusers/src/diffusers/models/attention_processor.py
+class JointAttnProcessor:
+    def __init__(self):
+        pass
+    def __call__(
+        self,
+        attn: Attention,
+        x: float["b n d"],  # noised input x  # noqa: F722
+        c: float["b nt d"] = None,  # context c, here text # noqa: F722
+        mask: bool["b n"] | None = None,  # noqa: F722
+        rope=None,  # rotary position embedding for x
+        c_rope=None,  # rotary position embedding for c
+    ) -> torch.FloatTensor:
+        residual = x
+        batch_size = c.shape[0]
+        # `sample` projections
+        query = attn.to_q(x)
+        key = attn.to_k(x)
+        value = attn.to_v(x)
+        # `context` projections
+        c_query = attn.to_q_c(c)
+        c_key = attn.to_k_c(c)
+        c_value = attn.to_v_c(c)
+        # attention
+        inner_dim = key.shape[-1]
+        head_dim = inner_dim // attn.heads
+        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_query = c_query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_key = c_key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        c_value = c_value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
+        # qk norm
+        if attn.q_norm is not None:
+            query = attn.q_norm(query)
+        if attn.k_norm is not None:
+            key = attn.k_norm(key)
+        if attn.c_q_norm is not None:
+            c_query = attn.c_q_norm(c_query)
+        if attn.c_k_norm is not None:
+            c_key = attn.c_k_norm(c_key)
+        # apply rope for context and noised input independently
+        if rope is not None:
+            freqs, xpos_scale = rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            query = apply_rotary_pos_emb(query, freqs, q_xpos_scale)
+            key = apply_rotary_pos_emb(key, freqs, k_xpos_scale)
+        if c_rope is not None:
+            freqs, xpos_scale = c_rope
+            q_xpos_scale, k_xpos_scale = (xpos_scale, xpos_scale**-1.0) if xpos_scale is not None else (1.0, 1.0)
+            c_query = apply_rotary_pos_emb(c_query, freqs, q_xpos_scale)
+            c_key = apply_rotary_pos_emb(c_key, freqs, k_xpos_scale)
+        # joint attention
+        query = torch.cat([query, c_query], dim=2)
+        key = torch.cat([key, c_key], dim=2)
+        value = torch.cat([value, c_value], dim=2)
+        # mask. e.g. inference got a batch with different target durations, mask out the padding
+        if mask is not None:
+            attn_mask = F.pad(mask, (0, c.shape[1]), value=True)  # no mask for c (text)
+            attn_mask = attn_mask.unsqueeze(1).unsqueeze(1)  # 'b n -> b 1 1 n'
+            attn_mask = attn_mask.expand(batch_size, attn.heads, query.shape[-2], key.shape[-2])
+        else:
+            attn_mask = None
+        x = F.scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=0.0, is_causal=False)
+        x = x.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
+        x = x.to(query.dtype)
+        # Split the attention outputs.
+        x, c = (
+            x[:, : residual.shape[1]],
+            x[:, residual.shape[1] :],
+        )
+        # linear proj
+        x = attn.to_out[0](x)
+        # dropout
+        x = attn.to_out[1](x)
+        if not attn.context_pre_only:
+            c = attn.to_out_c(c)
+        if mask is not None:
+            mask = mask.unsqueeze(-1)
+            x = x.masked_fill(~mask, 0.0)
+            # c = c.masked_fill(~mask, 0.)  # no mask for c (text)
+        return x, c
+# DiT Block
+class DiTBlock(nn.Module):
+    def __init__(self, dim, heads, dim_head, ff_mult=4, dropout=0.1, qk_norm=None, pe_attn_head=None):
+        super().__init__()
+        self.attn_norm = AdaLayerNorm(dim)
+        self.attn = Attention(
+            processor=AttnProcessor(pe_attn_head=pe_attn_head),
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            dropout=dropout,
+            qk_norm=qk_norm,
+        )
+        self.ff_norm = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+    def forward(self, x, t, mask=None, rope=None):  # x: noised input, t: time embedding
+        # pre-norm & modulation for attention input
+        norm, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.attn_norm(x, emb=t)
+        # attention
+        attn_output = self.attn(x=norm, mask=mask, rope=rope)
+        # process attention output for input x
+        x = x + gate_msa.unsqueeze(1) * attn_output
+        norm = self.ff_norm(x) * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
+        ff_output = self.ff(norm)
+        x = x + gate_mlp.unsqueeze(1) * ff_output
+        return x
+# MMDiT Block https://arxiv.org/abs/2403.03206
+class MMDiTBlock(nn.Module):
+    r"""
+    modified from diffusers/src/diffusers/models/attention.py
+    notes.
+    _c: context related. text, cond, etc. (left part in sd3 fig2.b)
+    _x: noised input related. (right part)
+    context_pre_only: last layer only do prenorm + modulation cuz no more ffn
+    """
+    def __init__(
+        self, dim, heads, dim_head, ff_mult=4, dropout=0.1, context_dim=None, context_pre_only=False, qk_norm=None
+    ):
+        super().__init__()
+        if context_dim is None:
+            context_dim = dim
+        self.context_pre_only = context_pre_only
+        self.attn_norm_c = AdaLayerNorm_Final(context_dim) if context_pre_only else AdaLayerNorm(context_dim)
+        self.attn_norm_x = AdaLayerNorm(dim)
+        self.attn = Attention(
+            processor=JointAttnProcessor(),
+            dim=dim,
+            heads=heads,
+            dim_head=dim_head,
+            dropout=dropout,
+            context_dim=context_dim,
+            context_pre_only=context_pre_only,
+            qk_norm=qk_norm,
+        )
+        if not context_pre_only:
+            self.ff_norm_c = nn.LayerNorm(context_dim, elementwise_affine=False, eps=1e-6)
+            self.ff_c = FeedForward(dim=context_dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+        else:
+            self.ff_norm_c = None
+            self.ff_c = None
+        self.ff_norm_x = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
+        self.ff_x = FeedForward(dim=dim, mult=ff_mult, dropout=dropout, approximate="tanh")
+    def forward(self, x, c, t, mask=None, rope=None, c_rope=None):  # x: noised input, c: context, t: time embedding
+        # pre-norm & modulation for attention input
+        if self.context_pre_only:
+            norm_c = self.attn_norm_c(c, t)
+        else:
+            norm_c, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = self.attn_norm_c(c, emb=t)
+        norm_x, x_gate_msa, x_shift_mlp, x_scale_mlp, x_gate_mlp = self.attn_norm_x(x, emb=t)
+        # attention
+        x_attn_output, c_attn_output = self.attn(x=norm_x, c=norm_c, mask=mask, rope=rope, c_rope=c_rope)
+        # process attention output for context c
+        if self.context_pre_only:
+            c = None
+        else:  # if not last layer
+            c = c + c_gate_msa.unsqueeze(1) * c_attn_output
+            norm_c = self.ff_norm_c(c) * (1 + c_scale_mlp[:, None]) + c_shift_mlp[:, None]
+            c_ff_output = self.ff_c(norm_c)
+            c = c + c_gate_mlp.unsqueeze(1) * c_ff_output
+        # process attention output for input x
+        x = x + x_gate_msa.unsqueeze(1) * x_attn_output
+        norm_x = self.ff_norm_x(x) * (1 + x_scale_mlp[:, None]) + x_shift_mlp[:, None]
+        x_ff_output = self.ff_x(norm_x)
+        x = x + x_gate_mlp.unsqueeze(1) * x_ff_output
+        return c, x
+# time step conditioning embedding
+class TimestepEmbedding(nn.Module):
+    def __init__(self, dim, freq_embed_dim=256):
+        super().__init__()
+        self.time_embed = SinusPositionEmbedding(freq_embed_dim)
+        self.time_mlp = nn.Sequential(nn.Linear(freq_embed_dim, dim), nn.SiLU(), nn.Linear(dim, dim))
+    def forward(self, timestep: float["b"]):  # noqa: F821
+        time_hidden = self.time_embed(timestep)
+        time_hidden = time_hidden.to(timestep.dtype)
+        time = self.time_mlp(time_hidden)  # b d
+        return time
+class MIEsitmator(nn.Module):
+    def __init__(self, vocab_size, decoder_dim, hidden_size, dropout=0.5):
+        super(MIEsitmator, self).__init__()
+        self.proj = nn.Sequential(
+            torch.nn.Linear(decoder_dim, hidden_size, bias=True),
+            nn.ReLU(),
+            nn.Dropout(p=dropout)
+        )
+        self.ctc_proj = torch.nn.Linear(hidden_size, vocab_size + 1, bias=True)
+        self.ctc = nn.CTCLoss(blank=vocab_size, reduction='mean', zero_infinity=True)
+    def forward(self, decoder_outputs, target_phones, decoder_lengths, target_lengths):
+        out = self.proj(decoder_outputs.type(self.ctc_proj.weight.dtype))
+        log_probs = self.ctc_proj(out).log_softmax(dim=2)
+        log_probs = log_probs.transpose(1, 0)
+        ctc_loss = self.ctc(log_probs.float(), target_phones, decoder_lengths, target_lengths)
+        ctc_loss = ctc_loss / decoder_lengths.float()
+        # print("ctc_loss:", ctc_loss.shape, "ctc_max:", torch.max(ctc_loss), "ctc_min:", torch.min(ctc_loss), decoder_lengths[0])
+        # # 2. 统计非NaN值的百分比
+        # mask = ~torch.isnan(ctc_loss)
+        # total_count = ctc_loss.numel()  # 总元素数量（所有维度）
+        # valid_count = mask.sum().item()  # 非NaN元素数量
+        # valid_percentage = (valid_count / total_count) * 100
+        # print(f"ctc loss: total_count: {total_count}", f"valid_count: {valid_count}", f"valid_percentage: {valid_percentage:.2f}%")
+        # 3. 将NaN或大于150的值替换为150
+        # ctc_loss = torch.where(torch.isnan(ctc_loss), 150.0, ctc_loss)
+        ctc_loss = torch.where((ctc_loss > 300.0) | torch.isnan(ctc_loss), 300.0, ctc_loss)
+        # ctc_loss = torch.nan_to_num(ctc_loss, nan=0.0, posinf=0.0, neginf=0.0)
+        # average by number of frames since taco_loss is averaged.
+        ctc_loss = ctc_loss.mean()
+        return ctc_loss
+    def inference(self, decoder_output):
+        out = self.proj(decoder_output.type(self.ctc_proj.weight.dtype))
+        log_probs = self.ctc_proj(out).log_softmax(dim=2)
+        log_probs = log_probs.transpose(1, 0)
+        return log_probs.item()
+class AccentClassifier(nn.Module):
+    def __init__(self, input_dim, hidden_dim, num_accents, dropout=0.3):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, num_accents)
+        )
+    def forward(self, x):
+        return self.net(x)
+class GradientReversalFunction(Function):
+    @staticmethod
+    def forward(ctx, x, lambda_):
+        ctx.lambda_ = lambda_
+        return x.view_as(x)
+    @staticmethod
+    def backward(ctx, grad_output):
+        return grad_output.neg() * ctx.lambda_, None
+def grad_reverse(x, lambda_=1.0):
+    return GradientReversalFunction.apply(x, lambda_)

lemas_tts/model/utils.py ADDED Viewed

	@@ -0,0 +1,190 @@

+from __future__ import annotations
+import os
+import random
+from collections import defaultdict
+from importlib.resources import files
+import torch
+from torch.nn.utils.rnn import pad_sequence
+import jieba
+from pypinyin import lazy_pinyin, Style
+import sys
+# seed everything
+def seed_everything(seed=0):
+    random.seed(seed)
+    os.environ["PYTHONHASHSEED"] = str(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+# helpers
+def exists(v):
+    return v is not None
+def default(v, d):
+    return v if exists(v) else d
+# tensor helpers
+def lens_to_mask(t: int["b"], length: int | None = None) -> bool["b n"]:  # noqa: F722 F821
+    if not exists(length):
+        length = t.amax()
+    seq = torch.arange(length, device=t.device)
+    return seq[None, :] < t[:, None]
+def mask_from_start_end_indices(seq_len: int["b"], start: int["b"], end: int["b"]):  # noqa: F722 F821
+    max_seq_len = seq_len.max().item()
+    seq = torch.arange(max_seq_len, device=start.device).long()
+    start_mask = seq[None, :] >= start[:, None]
+    end_mask = seq[None, :] < end[:, None]
+    return start_mask & end_mask
+def mask_from_frac_lengths(seq_len: int["b"], frac_lengths: float["b"]):  # noqa: F722 F821
+    lengths = (frac_lengths * seq_len).long()
+    max_start = seq_len - lengths
+    rand = torch.rand_like(frac_lengths)
+    start = (max_start * rand).long().clamp(min=0)
+    end = start + lengths
+    return mask_from_start_end_indices(seq_len, start, end)
+def maybe_masked_mean(t: float["b n d"], mask: bool["b n"] = None) -> float["b d"]:  # noqa: F722
+    if not exists(mask):
+        return t.mean(dim=1)
+    t = torch.where(mask[:, :, None], t, torch.tensor(0.0, device=t.device))
+    num = t.sum(dim=1)
+    den = mask.float().sum(dim=1)
+    return num / den.clamp(min=1.0)
+# simple utf-8 tokenizer, since paper went character based
+def list_str_to_tensor(text: list[str], padding_value=-1) -> int["b nt"]:  # noqa: F722
+    list_tensors = [torch.tensor([*bytes(t, "UTF-8")]) for t in text]  # ByT5 style
+    text = pad_sequence(list_tensors, padding_value=padding_value, batch_first=True)
+    return text
+# char tokenizer, based on custom dataset's extracted .txt file
+def list_str_to_idx(
+    text: list[str] | list[list[str]],
+    vocab_char_map: dict[str, int],  # {char: idx}
+    padding_value=-1,
+) -> int["b nt"]:  # noqa: F722
+    list_idx_tensors = [torch.tensor([vocab_char_map.get(c, 0) for c in t]) for t in text]  # pinyin or char style
+    text = pad_sequence(list_idx_tensors, padding_value=padding_value, batch_first=True)
+    return text
+# Get tokenizer
+def get_tokenizer(dataset_name, tokenizer: str = "pinyin"):
+    """
+    tokenizer   - "pinyin" do g2p for only chinese characters, need .txt vocab_file
+                - "char" for char-wise tokenizer, need .txt vocab_file
+                - "byte" for utf-8 tokenizer
+                - "custom" if you're directly passing in a path to the vocab.txt you want to use
+    vocab_size  - if use "pinyin", all available pinyin types, common alphabets (also those with accent) and symbols
+                - if use "char", derived from unfiltered character & symbol counts of custom dataset
+                - if use "byte", set to 256 (unicode byte range)
+    """
+    if tokenizer in ["pinyin", "char"]:
+        tokenizer_path = os.path.join(files("lemas_tts").joinpath("../../data"), f"{dataset_name}_{tokenizer}/vocab.txt")
+        with open(tokenizer_path, "r", encoding="utf-8") as f:
+            vocab_char_map = {}
+            for i, char in enumerate(f):
+                vocab_char_map[char[:-1]] = i
+        vocab_size = len(vocab_char_map)
+        assert vocab_char_map[" "] == 0, "make sure space is of idx 0 in vocab.txt, cuz 0 is used for unknown char"
+    elif tokenizer == "byte":
+        vocab_char_map = None
+        vocab_size = 256
+    elif tokenizer == "custom":
+        with open(dataset_name, "r", encoding="utf-8") as f:
+            vocab_char_map = {}
+            for i, char in enumerate(f):
+                vocab_char_map[char[:-1]] = i
+        vocab_size = len(vocab_char_map)
+    return vocab_char_map, vocab_size
+# convert char to pinyin
+def convert_char_to_pinyin(text_list, polyphone=True):
+    if jieba.dt.initialized is False:
+        jieba.default_logger.setLevel(50)  # CRITICAL
+        jieba.initialize()
+    final_text_list = []
+    custom_trans = str.maketrans(
+        {";": ",", "“": '"', "”": '"', "‘": "'", "’": "'"}
+    )  # add custom trans here, to address oov
+    def is_chinese(c):
+        return (
+            "\u3100" <= c <= "\u9fff"  # common chinese characters
+        )
+    for text in text_list:
+        char_list = []
+        text = text.translate(custom_trans)
+        from lemas_tts.infer.cn_tn import NSWNormalizer
+        text = NSWNormalizer(text.strip()).normalize()
+        text = list(jieba.cut(text))
+        for seg in text:
+            seg_byte_len = len(bytes(seg, "UTF-8"))
+            if seg_byte_len == len(seg):  # if pure alphabets and symbols
+                if char_list and seg_byte_len > 1 and char_list[-1] not in " :'\"":
+                    char_list.append(" ")
+                char_list.extend(seg)
+            elif polyphone and seg_byte_len == 3 * len(seg):  # if pure east asian characters
+                seg_ = lazy_pinyin(seg, style=Style.TONE3, tone_sandhi=True)
+                for i, c in enumerate(seg):
+                    if is_chinese(c):
+                        char_list.append(" ")
+                    char_list.append(seg_[i])
+            else:  # if mixed characters, alphabets and symbols
+                for c in seg:
+                    if ord(c) < 256:
+                        char_list.extend(c)
+                    elif is_chinese(c):
+                        char_list.append(" ")
+                        char_list.extend(lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True))
+                    else:
+                        char_list.append(c)
+        final_text_list.append(char_list)
+    return final_text_list
+# filter func for dirty data with many repetitions
+def repetition_found(text, length=2, tolerance=10):
+    pattern_count = defaultdict(int)
+    for i in range(len(text) - length + 1):
+        pattern = text[i : i + length]
+        pattern_count[pattern] += 1
+    for pattern, count in pattern_count.items():
+        if count > tolerance:
+            return True
+    return False

lemas_tts/scripts/inference_gradio.py ADDED Viewed

	@@ -0,0 +1,584 @@

+import gc
+import os
+import platform
+import psutil
+import tempfile
+from glob import glob
+import traceback
+import click
+import gradio as gr
+import torch
+import sys
+from pathlib import Path
+# Add the local code directory so that `lemas_tts` can be imported when running this
+# script directly without installing the package.
+THIS_FILE = Path(__file__).resolve()
+SRC_ROOT = THIS_FILE.parents[2]  # .../code
+sys.path.append(str(SRC_ROOT))
+def _find_repo_root(start: Path) -> Path:
+    """Locate the repo root by looking for a `pretrained_models` folder upwards."""
+    for p in [start, *start.parents]:
+        if (p / "pretrained_models").is_dir():
+            return p
+    cwd = Path.cwd()
+    if (cwd / "pretrained_models").is_dir():
+        return cwd
+    return start
+REPO_ROOT = _find_repo_root(THIS_FILE)
+PRETRAINED_ROOT = REPO_ROOT / "pretrained_models"
+CKPTS_ROOT = PRETRAINED_ROOT / "ckpts"
+DATA_ROOT = PRETRAINED_ROOT / "data"
+UVR5_CODE_DIR = REPO_ROOT / "code" / "uvr5"
+UVR5_MODEL_DIR = PRETRAINED_ROOT / "uvr5" / "models" / "MDX_Net_Models" / "model_data"
+from lemas_tts.api import F5TTS
+import torch, torchaudio
+import soundfile as sf
+# Global variables
+tts_api = None
+last_checkpoint = ""
+last_device = ""
+last_ema = None
+# Device detection
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "xpu"
+    if torch.xpu.is_available()
+    else "mps"
+    if torch.backends.mps.is_available()
+    else "cpu"
+)
+class UVR5:
+    def __init__(self, model_dir):
+        code_dir = str(UVR5_CODE_DIR)
+        self.model = self.load_model(str(model_dir), code_dir)
+    def load_model(self, model_dir, code_dir):
+        import sys, json, os
+        sys.path.append(code_dir)
+        from multiprocess_cuda_infer import ModelData, Inference
+        model_path = os.path.join(model_dir, 'Kim_Vocal_1.onnx')
+        config_path = os.path.join(model_dir, 'MDX-Net-Kim-Vocal1.json')
+        configs = json.loads(open(config_path, 'r', encoding='utf-8').read())
+        model_data = ModelData(
+            model_path=model_path,
+            audio_path = model_dir,
+            result_path = model_dir,
+            device = 'cpu',
+            process_method = "MDX-Net",
+            base_dir=code_dir,
+            **configs
+        )
+        uvr5_model = Inference(model_data, 'cpu')
+        uvr5_model.load_model(model_path, 1)
+        return uvr5_model
+    def denoise(self, audio_info):
+        print("denoise UVR5: ", audio_info)
+        input_audio = load_wav(audio_info, sr=44100, channel=2)
+        output_audio = self.model.demix_base({0:input_audio.squeeze()}, is_match_mix=False)
+        # transform = torchaudio.transforms.Resample(44100, 16000)
+        # output_audio = transform(output_audio)
+        return output_audio.squeeze().T.numpy(), 44100
+denoise_model = UVR5(UVR5_MODEL_DIR)
+def load_wav(audio_info, sr=16000, channel=1):
+    print("load audio:", audio_info)
+    audio, raw_sr = torchaudio.load(audio_info)
+    audio = audio.T if len(audio.shape) > 1 and audio.shape[1] == 2 else audio
+    audio = audio / torch.max(torch.abs(audio))
+    audio = audio.squeeze().float()
+    if channel == 1 and len(audio.shape) == 2:  # stereo to mono
+        audio = audio.mean(dim=0, keepdim=True)
+    elif channel == 2 and len(audio.shape) == 1:
+        audio = torch.stack((audio, audio)) # mono to stereo
+    if raw_sr != sr:
+        audio = torchaudio.functional.resample(audio.squeeze(), raw_sr, sr)
+    audio = torch.clip(audio, -0.999, 0.999).squeeze()
+    return audio
+def denoise(audio_info):
+    save_path = "./denoised_audio.wav"
+    denoised_audio, sr = denoise_model.denoise(audio_info)
+    sf.write(save_path, denoised_audio, sr, format='wav', subtype='PCM_24')
+    print("save denoised audio:", save_path)
+    return save_path
+def cancel_denoise(audio_info):
+    return audio_info
+def get_checkpoints_project(project_name=None, is_gradio=True):
+    """Get available checkpoint files"""
+    checkpoint_dir = [str(CKPTS_ROOT)]
+    if project_name is None:
+        # Look for checkpoints in common locations
+        files_checkpoints = []
+        for path in checkpoint_dir:
+            if os.path.isdir(path):
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.pt"), recursive=True))
+                files_checkpoints.extend(glob(os.path.join(path, "**/*.safetensors"), recursive=True))
+                break
+    else:
+        # project_name = project_name.replace("_pinyin", "").replace("_char", "")
+        project_name = "_".join(["F5TTS_v1_Base", "vocos", "custom", project_name.replace("_custom", "")]) if project_name != "F5TTS_v1_Base" else project_name
+        if os.path.isdir(checkpoint_dir[0]):
+            files_checkpoints = glob(os.path.join(checkpoint_dir[0], project_name, "*.pt"))
+            files_checkpoints.extend(glob(os.path.join(checkpoint_dir[0], project_name, "*.safetensors")))
+        else:
+            files_checkpoints = []
+    print("files_checkpoints:", project_name, files_checkpoints)
+    # Separate pretrained and regular checkpoints
+    pretrained_checkpoints = [f for f in files_checkpoints if "pretrained_" in os.path.basename(f)]
+    regular_checkpoints = [
+        f
+        for f in files_checkpoints
+        if "pretrained_" not in os.path.basename(f) and "model_last.pt" not in os.path.basename(f)
+    ]
+    last_checkpoint = [f for f in files_checkpoints if "model_last.pt" in os.path.basename(f)]
+    # Sort regular checkpoints by number
+    try:
+        regular_checkpoints = sorted(
+            regular_checkpoints, key=lambda x: int(os.path.basename(x).split("_")[1].split(".")[0])
+        )
+    except (IndexError, ValueError):
+        regular_checkpoints = sorted(regular_checkpoints)
+    # Combine in order: pretrained, regular, last
+    files_checkpoints = pretrained_checkpoints + regular_checkpoints + last_checkpoint
+    select_checkpoint = None if not files_checkpoints else files_checkpoints[-1]
+    if is_gradio:
+        return gr.update(choices=files_checkpoints, value=select_checkpoint)
+    return files_checkpoints, select_checkpoint
+def get_available_projects():
+    """Get available project names from data directory"""
+    data_path = str(DATA_ROOT)
+    project_list = []
+    if os.path.isdir(data_path):
+        for folder in os.listdir(data_path):
+            if "test" in folder:
+                continue
+            project_list.append(folder)
+    # Fallback to a sensible default if no projects are found
+    if not project_list:
+        project_list = ["multilingual_acc_grl_custom"]
+    return project_list
+def infer(
+    project, file_checkpoint, exp_name, ref_text, ref_audio, denoise_audio, gen_text, nfe_step, use_ema, separate_langs, frontend, speed, cfg_strength, use_acc_grl, ref_ratio, no_ref_audio, sway_sampling_coef, use_prosody_encoder, seed
+):
+    global last_checkpoint, last_device, tts_api, last_ema
+    if not os.path.isfile(file_checkpoint):
+        return None, "Checkpoint not found!", ""
+    if denoise_audio:
+        ref_audio = denoise_audio
+    device_test = device  # Use the global device
+    if last_checkpoint != file_checkpoint or last_device != device_test or last_ema != use_ema or tts_api is None:
+        if last_checkpoint != file_checkpoint:
+            last_checkpoint = file_checkpoint
+        if last_device != device_test:
+            last_device = device_test
+        if last_ema != use_ema:
+            last_ema = use_ema
+        # Try to find vocab file
+        vocab_file = None
+        possible_vocab_paths = [
+            str(DATA_ROOT / project / "vocab.txt"),
+            # legacy fallbacks for older layouts
+            f"./data/{project}/vocab.txt",
+            f"../../data/{project}/vocab.txt",
+            "./data/Emilia_ZH_EN_pinyin/vocab.txt",
+            "../../data/Emilia_ZH_EN_pinyin/vocab.txt",
+        ]
+        for path in possible_vocab_paths:
+            if os.path.isfile(path):
+                vocab_file = path
+                break
+        if vocab_file is None:
+            return None, "Vocab file not found!", ""
+        try:
+            tts_api = F5TTS(
+                model=exp_name,
+                ckpt_file=file_checkpoint,
+                vocab_file=vocab_file,
+                device=device_test,
+                use_ema=use_ema,
+                frontend=frontend,
+                use_prosody_encoder=use_prosody_encoder,
+                prosody_cfg_path=str(CKPTS_ROOT / "prosody_encoder" / "pretssel_cfg.json"),
+                prosody_ckpt_path=str(CKPTS_ROOT / "prosody_encoder" / "prosody_encoder_UnitY2.pt"),
+            )
+        except Exception as e:
+            traceback.print_exc()
+            return None, f"Error loading model: {str(e)}", ""
+        print("Model loaded >>", device_test, file_checkpoint, use_ema)
+    if seed == -1:  # -1 used for random
+        seed = None
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
+            tts_api.infer(
+                ref_file=ref_audio,
+                ref_text=ref_text.strip(),
+                gen_text=gen_text.strip(),
+                nfe_step=nfe_step,
+                separate_langs=separate_langs,
+                speed=speed,
+                cfg_strength=cfg_strength,
+                sway_sampling_coef=sway_sampling_coef,
+                use_acc_grl=use_acc_grl,
+                ref_ratio=ref_ratio,
+                no_ref_audio=no_ref_audio,
+                use_prosody_encoder=use_prosody_encoder,
+                file_wave=f.name,
+                seed=seed,
+            )
+            return f.name, f"Device: {tts_api.device}", str(tts_api.seed)
+    except Exception as e:
+        traceback.print_exc()
+        return None, f"Inference error: {str(e)}", ""
+def get_gpu_stats():
+    """Get GPU statistics"""
+    gpu_stats = ""
+    if torch.cuda.is_available():
+        gpu_count = torch.cuda.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.cuda.get_device_name(i)
+            gpu_properties = torch.cuda.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.cuda.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.cuda.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.xpu.is_available():
+        gpu_count = torch.xpu.device_count()
+        for i in range(gpu_count):
+            gpu_name = torch.xpu.get_device_name(i)
+            gpu_properties = torch.xpu.get_device_properties(i)
+            total_memory = gpu_properties.total_memory / (1024**3)  # in GB
+            allocated_memory = torch.xpu.memory_allocated(i) / (1024**2)  # in MB
+            reserved_memory = torch.xpu.memory_reserved(i) / (1024**2)  # in MB
+            gpu_stats += (
+                f"GPU {i} Name: {gpu_name}\n"
+                f"Total GPU memory (GPU {i}): {total_memory:.2f} GB\n"
+                f"Allocated GPU memory (GPU {i}): {allocated_memory:.2f} MB\n"
+                f"Reserved GPU memory (GPU {i}): {reserved_memory:.2f} MB\n\n"
+            )
+    elif torch.backends.mps.is_available():
+        gpu_count = 1
+        gpu_stats += "MPS GPU\n"
+        total_memory = psutil.virtual_memory().total / (
+            1024**3
+        )  # Total system memory (MPS doesn't have its own memory)
+        allocated_memory = 0
+        reserved_memory = 0
+        gpu_stats += (
+            f"Total system memory: {total_memory:.2f} GB\n"
+            f"Allocated GPU memory (MPS): {allocated_memory:.2f} MB\n"
+            f"Reserved GPU memory (MPS): {reserved_memory:.2f} MB\n"
+        )
+    else:
+        gpu_stats = "No GPU available"
+    return gpu_stats
+def get_cpu_stats():
+    """Get CPU statistics"""
+    cpu_usage = psutil.cpu_percent(interval=1)
+    memory_info = psutil.virtual_memory()
+    memory_used = memory_info.used / (1024**2)
+    memory_total = memory_info.total / (1024**2)
+    memory_percent = memory_info.percent
+    pid = os.getpid()
+    process = psutil.Process(pid)
+    nice_value = process.nice()
+    cpu_stats = (
+        f"CPU Usage: {cpu_usage:.2f}%\n"
+        f"System Memory: {memory_used:.2f} MB used / {memory_total:.2f} MB total ({memory_percent}% used)\n"
+        f"Process Priority (Nice value): {nice_value}"
+    )
+    return cpu_stats
+def get_combined_stats():
+    """Get combined system stats"""
+    gpu_stats = get_gpu_stats()
+    cpu_stats = get_cpu_stats()
+    combined_stats = f"### GPU Stats\n{gpu_stats}\n\n### CPU Stats\n{cpu_stats}"
+    return combined_stats
+# Create Gradio interface
+with gr.Blocks(title="LEMAS-TTS Inference") as app:
+    gr.Markdown(
+        """
+        # Zero-Shot TTS
+        Set seed to -1 for random generation.
+        """
+    )
+    with gr.Accordion("Model configuration", open=False):
+    # Model configuration
+        with gr.Row():
+            exp_name = gr.Radio(
+                label="Model", choices=["F5TTS_v1_Base", "F5TTS_Base", "E2TTS_Base"], value="F5TTS_v1_Base", visible=False
+            )
+        # Project selection
+        available_projects = get_available_projects()
+        # Get initial checkpoints
+        list_checkpoints, checkpoint_select = get_checkpoints_project(available_projects[0] if available_projects else None, False)
+        with gr.Row():
+            with gr.Column(scale=1):
+                # load_models_btn = gr.Button(value="Load models")
+                cm_project = gr.Dropdown(
+                    choices=available_projects,
+                    value=available_projects[0] if available_projects else None,
+                    label="Project",
+                    allow_custom_value=True,
+                    scale=4
+                )
+            with gr.Column(scale=5):
+                cm_checkpoint = gr.Dropdown(
+                    choices=list_checkpoints, value=checkpoint_select, label="Checkpoints", allow_custom_value=True # scale=4,
+)
+            bt_checkpoint_refresh = gr.Button("Refresh", scale=1)
+        with gr.Row():
+            ch_use_ema = gr.Checkbox(label="Use EMA", value=True, scale=2, info="Turn off at early stage might offer better results")
+            frontend = gr.Radio(label="Frontend", choices=["phone", "char", "bpe"], value="phone", scale=3)
+            separate_langs = gr.Checkbox(label="Separate Languages", value=True, scale=2, info="separate language tokens")
+        # Inference parameters
+        with gr.Row():
+            nfe_step = gr.Number(label="NFE Step", scale=1, value=64)
+            speed = gr.Slider(label="Speed", scale=3, value=1.0, minimum=0.5, maximum=1.5, step=0.1)
+            cfg_strength = gr.Slider(label="CFG Strength", scale=2, value=5.0, minimum=0.0, maximum=10.0, step=1)
+            sway_sampling_coef = gr.Slider(label="Sway Sampling Coef", scale=2, value=3, minimum=-1, maximum=5, step=0.1)
+            ref_ratio = gr.Slider(label="Ref Ratio", scale=2, value=1.0, minimum=0.0, maximum=1.0, step=0.1)
+            no_ref_audio = gr.Checkbox(label="No Reference Audio", value=False, scale=1, info="No mel condition")
+            use_acc_grl = gr.Checkbox(label="Use accent grl condition", value=False, scale=1, info="Use accent grl condition")
+            use_prosody_encoder = gr.Checkbox(label="Use prosody encoder", value=False, scale=1, info="Use prosody encoder")
+            seed = gr.Number(label="Random Seed", scale=1, value=5828684826493313192, minimum=-1)
+    # Input fields
+    ref_text = gr.Textbox(label="Reference Text", placeholder="Enter the text for the reference audio...")
+    ref_audio = gr.Audio(label="Reference Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    with gr.Row():
+        denoise_btn = gr.Button(value="Denoise")
+        cancel_btn = gr.Button(value="Cancel Denoise")
+    denoise_audio = gr.Audio(label="Denoised Audio", value=None, type="filepath", interactive=True, show_download_button=True, editable=True)
+    gen_text = gr.Textbox(label="Text to Generate", placeholder="Enter the text you want to generate...")
+    # Inference button and outputs
+    with gr.Row():
+        txt_info_gpu = gr.Textbox("", label="Device Info")
+        seed_info = gr.Textbox(label="Used Random Seed")
+        check_button_infer = gr.Button("Generate Audio", variant="primary")
+    gen_audio = gr.Audio(label="Generated Audio", type="filepath", interactive=True, show_download_button=True, editable=True)
+    # Examples
+    examples = gr.Examples(
+        examples=[
+            [
+                "Ich glaub, mein Schwein pfeift.",
+                str(DATA_ROOT / "test_examples" / "de.wav"),
+                "我觉得我的猪在吹口哨。",
+            ],
+            [
+                "em, #1 I have a list of YouTubers, and I'm gonna be going to their houses and raiding them by.",
+                str(DATA_ROOT / "test_examples" / "en.wav"),
+                "我有一份 YouTuber 名单，我打算去他们家，对他们进行突袭。",
+            ],
+            [
+                "Te voy a dar un tip #1 que le copia a John Rockefeller, uno de los empresarios más picudos de la historia.",
+                str(DATA_ROOT / "test_examples" / "es.wav"),
+                "我要给你一个从历史上最精明的商人之一约翰·洛克菲勒那里抄来的秘诀。",
+            ],
+            [
+                "Per l'amor di Dio #1 fai, #2 se pensi di non poterti fermare, fallo #1 e fallo.",
+                str(DATA_ROOT / "test_examples" / "it.wav"),
+                "看在上帝的份上，去做吧，如果你认为你无法停止，那就去做吧，继续做下去。",
+            ],
+            [
+                "Nova, #1 dia 25 desse mês vai rolar operação the last Frontier.",
+                str(DATA_ROOT / "test_examples" / "pt.wav"),
+                "新消息，本月二十五日，'最后的边疆行动'将启动。",
+            ],
+            # ["Good morning! #1 ",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_0.wav",
+            # " #1"
+            # ],
+            # ["Good morning! #1 ",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_1.wav",
+            # " #1",
+            # ],
+            # ["Good morning! #1 ",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_2.wav",
+            # " #1",
+            # ],
+            # ["Oh, and in case I don't see ya, #1",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_3.wav",
+            # " #1",
+            # ],
+            # ["Good afternoon, good evening, and good night. #1",
+            # "/mnt/code/lemas/F5-TTS/data/trueman/recognition_d0a02641c090813574a8ec398220339f_4.wav",
+            # " #1",
+            # ],
+        ],
+        inputs=[
+            ref_text,
+            ref_audio,
+            gen_text,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+        fn=infer,
+        cache_examples=False
+    )
+    # System Info section at the bottom
+    gr.Markdown("---")
+    gr.Markdown("## System Information")
+    with gr.Accordion("Update System Stats", open=False):
+        update_button = gr.Button("Update System Stats", scale=1)
+        output_box = gr.Textbox(label="GPU and CPU Information", lines=5, scale=5)
+    def update_stats():
+        return get_combined_stats()
+    denoise_btn.click(fn=denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    cancel_btn.click(fn=cancel_denoise,
+                        inputs=[ref_audio],
+                        outputs=[denoise_audio])
+    # Event handlers
+    check_button_infer.click(
+        fn=infer,
+        inputs=[
+            cm_project,
+            cm_checkpoint,
+            exp_name,
+            ref_text,
+            ref_audio,
+            denoise_audio,
+            gen_text,
+            nfe_step,
+            ch_use_ema,
+            separate_langs,
+            frontend,
+            speed,
+            cfg_strength,
+            use_acc_grl,
+            ref_ratio,
+            no_ref_audio,
+            sway_sampling_coef,
+            use_prosody_encoder,
+            seed,
+        ],
+        outputs=[gen_audio, txt_info_gpu, seed_info],
+    )
+    bt_checkpoint_refresh.click(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    cm_project.change(fn=get_checkpoints_project, inputs=[cm_project], outputs=[cm_checkpoint])
+    ref_audio.change(
+            fn=lambda x: None,
+            inputs=[ref_audio],
+            outputs=[denoise_audio]
+        )
+    update_button.click(fn=update_stats, outputs=output_box)
+    # Auto-load system stats on startup
+    app.load(fn=update_stats, outputs=output_box)
+@click.command()
+@click.option("--port", "-p", default=7860, type=int, help="Port to run the app on")
+@click.option("--host", "-H", default="0.0.0.0", help="Host to run the app on")
+@click.option(
+    "--share",
+    "-s",
+    default=False,
+    is_flag=True,
+    help="Share the app via Gradio share link",
+)
+@click.option("--api", "-a", default=True, is_flag=True, help="Allow API access")
+def main(port, host, share, api):
+    global app
+    print("Starting LEMAS-TTS Inference Interface...")
+    print(f"Device: {device}")
+    app.queue(api_open=api).launch(
+        server_name=host,
+        server_port=port,
+        share=share,
+        show_api=api,
+        allowed_paths=[str(DATA_ROOT)],
+    )
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,182 @@

+--extra-index-url https://download.pytorch.org/whl/cu121
+accelerate>=0.33.0
+aiofiles==23.2.1
+aiohappyeyeballs==2.6.1
+aiohttp==3.13.2
+aiosignal==1.4.0
+annotated-doc==0.0.4
+annotated-types==0.7.0
+antlr4-python3-runtime==4.9.3
+anyio==4.12.0
+attrs==25.4.0
+audioread==3.1.0
+babel==2.17.0
+bitsandbytes>0.37.0; platform_machine != "arm64" and platform_system != "Darwin"
+boto3==1.42.16
+botocore==1.42.16
+brotli==1.2.0
+cached_path
+cachetools==6.2.4
+certifi==2025.11.12
+cffi==2.0.0
+charset-normalizer==3.4.4
+click
+contourpy==1.3.2
+csvw==3.7.0
+cycler==0.12.1
+datasets
+decorator==5.2.1
+dill==0.4.0
+dlinfo==2.0.0
+docopt==0.6.2
+einops==0.8.1
+einx==0.3.0
+ema-pytorch==0.7.3
+encodec==0.1.1
+espeakng==1.0.2
+espeak_phonemizer==1.3.1
+fastapi==0.127.0
+ffmpy==1.0.0
+filelock==3.20.1
+fonttools==4.61.1
+frozendict==2.4.7
+frozenlist==1.8.0
+fsspec==2025.10.0
+gitdb==4.0.12
+GitPython==3.1.45
+google-api-core==2.28.1
+google-auth==2.45.0
+google-cloud-core==2.5.0
+google-cloud-storage==3.7.0
+google-crc32c==1.8.0
+google-resumable-media==2.8.0
+googleapis-common-protos==1.72.0
+gradio==5.38.0
+gradio-client==1.11.0
+groovy==0.1.2
+h11==0.16.0
+hf-xet==1.2.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.36.0
+hydra-core>=1.3.0
+idna==3.11
+isodate==0.7.2
+jieba
+Jinja2==3.1.6
+jmespath==1.0.1
+joblib==1.5.3
+jsonschema==4.25.1
+jsonschema-specifications==2025.9.1
+kiwisolver==1.4.9
+langid==1.1.6
+language-tags==1.2.0
+lazy_loader==0.4
+librosa
+llvmlite==0.42.0
+loguru==0.7.3
+markdown-it-py==4.0.0
+MarkupSafe
+matplotlib
+mdurl==0.1.2
+mpmath==1.3.0
+msgpack==1.1.2
+multidict==6.7.0
+multiprocess==0.70.18
+networkx==3.1
+num2words==0.5.13
+numba==0.59.0
+numpy==1.26.0
+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==8.9.2.26
+nvidia-cufft-cu12==11.0.2.54
+nvidia-cufile-cu12==1.11.1.6
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-cusparselt-cu12==0.6.3
+nvidia-nccl-cu12==2.20.5
+nvidia-nvjitlink-cu12==12.6.85
+nvidia-nvtx-cu12==12.1.105
+omegaconf==2.3.0
+onnx==1.16.0
+onnxruntime
+onnxruntime-gpu
+orjson==3.11.5
+packaging==25.0
+pandas==2.3.3
+phonemizer==3.3.0
+pillow==11.3.0
+platformdirs==4.5.1
+pooch==1.8.2
+propcache==0.4.1
+proto-plus==1.27.0
+protobuf==6.33.2
+psutil==7.2.0
+pyarrow==22.0.0
+pyasn1==0.6.1
+pyasn1_modules==0.4.2
+pycparser==2.23
+pydantic<=2.10.6
+pydantic_core==2.27.2
+pydub
+py-espeak-ng==0.1.8
+Pygments==2.19.2
+pyparsing==3.3.1
+pypinyin
+pypinyin-dict
+python-dateutil==2.9.0.post0
+python-multipart==0.0.21
+pytz==2025.2
+PyYAML==6.0.3
+rdflib==7.5.0
+referencing==0.37.0
+regex
+requests==2.32.5
+rfc3986==1.5.0
+rich==13.9.4
+rpds-py==0.30.0
+rsa==4.9.1
+s3transfer==0.16.0
+safehttpx==0.1.7
+safetensors
+scikit-learn==1.7.1
+scipy==1.15.3
+segments==2.3.0
+semantic-version==2.10.0
+sentry-sdk==2.48.0
+setuptools==80.9.0
+shellingham==1.5.4
+six==1.17.0
+smmap==5.0.2
+soundfile
+soxr==1.0.0
+starlette==0.50.0
+sympy==1.14.0
+termcolor==3.2.0
+threadpoolctl==3.6.0
+tokenizers==0.22.1
+tomli
+tomlkit==0.13.3
+torch==2.3.1
+torchaudio==2.3.1
+torchdiffeq==0.2.4
+tqdm>=4.65.0
+transformers
+transformers-stream-generator
+triton==2.3.1
+typer==0.16.0
+typing_extensions==4.12.2
+tzdata==2025.3
+uritemplate==4.2.0
+urllib3==2.6.2
+uroman
+uvicorn==0.40.0
+vocos
+x-transformers>=1.31.14
+xxhash==3.6.0
+yarl==1.22.0
+zhconv

uvr5/gui_data/constants.py ADDED Viewed

	@@ -0,0 +1,1147 @@

+import platform
+#Platform Details
+OPERATING_SYSTEM = platform.system()
+SYSTEM_ARCH = platform.platform()
+SYSTEM_PROC = platform.processor()
+ARM = 'arm'
+#Main Font
+MAIN_FONT_NAME = "Century Gothic"
+#Model Types
+VR_ARCH_TYPE = 'VR Arc'
+MDX_ARCH_TYPE = 'MDX-Net'
+DEMUCS_ARCH_TYPE = 'Demucs'
+VR_ARCH_PM = 'VR Architecture'
+ENSEMBLE_MODE = 'Ensemble Mode'
+ENSEMBLE_STEM_CHECK = 'Ensemble Stem'
+SECONDARY_MODEL = 'Secondary Model'
+DEMUCS_6_STEM_MODEL = 'htdemucs_6s'
+DEMUCS_V3_ARCH_TYPE = 'Demucs v3'
+DEMUCS_V4_ARCH_TYPE = 'Demucs v4'
+DEMUCS_NEWER_ARCH_TYPES = [DEMUCS_V3_ARCH_TYPE, DEMUCS_V4_ARCH_TYPE]
+DEMUCS_V1 = 'v1'
+DEMUCS_V2 = 'v2'
+DEMUCS_V3 = 'v3'
+DEMUCS_V4 = 'v4'
+DEMUCS_V1_TAG = 'v1 | '
+DEMUCS_V2_TAG = 'v2 | '
+DEMUCS_V3_TAG = 'v3 | '
+DEMUCS_V4_TAG = 'v4 | '
+DEMUCS_NEWER_TAGS = [DEMUCS_V3_TAG, DEMUCS_V4_TAG]
+DEMUCS_VERSION_MAPPER = {
+            DEMUCS_V1:DEMUCS_V1_TAG,
+            DEMUCS_V2:DEMUCS_V2_TAG,
+            DEMUCS_V3:DEMUCS_V3_TAG,
+            DEMUCS_V4:DEMUCS_V4_TAG}
+#Download Center
+DOWNLOAD_FAILED = 'Download Failed'
+DOWNLOAD_STOPPED = 'Download Stopped'
+DOWNLOAD_COMPLETE = 'Download Complete'
+DOWNLOAD_UPDATE_COMPLETE = 'Update Download Complete'
+SETTINGS_MENU_EXIT = 'exit'
+NO_CONNECTION = 'No Internet Connection'
+VIP_SELECTION = 'VIP:'
+DEVELOPER_SELECTION = 'VIP:'
+NO_NEW_MODELS = 'All Available Models Downloaded'
+ENSEMBLE_PARTITION = ': '
+NO_MODEL = 'No Model Selected'
+CHOOSE_MODEL = 'Choose Model'
+SINGLE_DOWNLOAD = 'Downloading Item 1/1...'
+DOWNLOADING_ITEM = 'Downloading Item'
+FILE_EXISTS = 'File already exists!'
+DOWNLOADING_UPDATE = 'Downloading Update...'
+DOWNLOAD_MORE = 'Download More Models'
+#Menu Options
+AUTO_SELECT = 'Auto'
+#LINKS
+DOWNLOAD_CHECKS = "https://raw.githubusercontent.com/TRvlvr/application_data/main/filelists/download_checks.json"
+MDX_MODEL_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/mdx_model_data/model_data.json"
+VR_MODEL_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/vr_model_data/model_data.json"
+DEMUCS_MODEL_NAME_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/demucs_model_data/model_name_mapper.json"
+MDX_MODEL_NAME_DATA_LINK = "https://raw.githubusercontent.com/TRvlvr/application_data/main/mdx_model_data/model_name_mapper.json"
+DONATE_LINK_BMAC = "https://www.buymeacoffee.com/uvr5"
+DONATE_LINK_PATREON = "https://www.patreon.com/uvr"
+#DOWNLOAD REPOS
+NORMAL_REPO = "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/"
+UPDATE_REPO = "https://github.com/TRvlvr/model_repo/releases/download/uvr_update_patches/"
+UPDATE_MAC_ARM_REPO = "https://github.com/Anjok07/ultimatevocalremovergui/releases/download/v5.5.0/Ultimate_Vocal_Remover_v5_5_MacOS_arm64.dmg"
+UPDATE_MAC_X86_64_REPO = "https://github.com/Anjok07/ultimatevocalremovergui/releases/download/v5.5.0/Ultimate_Vocal_Remover_v5_5_MacOS_x86_64.dmg"
+UPDATE_LINUX_REPO = "https://github.com/Anjok07/ultimatevocalremovergui#linux-installation"
+UPDATE_REPO = "https://github.com/TRvlvr/model_repo/releases/download/uvr_update_patches/"
+ISSUE_LINK = 'https://github.com/Anjok07/ultimatevocalremovergui/issues/new'
+VIP_REPO = b'\xf3\xc2W\x19\x1foI)\xc2\xa9\xcc\xb67(Z\xf5',\
+           b'gAAAAABjQAIQ-NpNMMxMedpKHHb7ze_nqB05hw0YhbOy3pFzuzDrfqumn8_qvraxEoUpZC5ZXC0gGvfDxFMqyq9VWbYKlA67SUFI_wZB6QoVyGI581vs7kaGfUqlXHIdDS6tQ_U-BfjbEAK9EU_74-R2zXjz8Xzekw=='
+NO_CODE = 'incorrect_code'
+#Extensions
+ONNX = '.onnx'
+CKPT = '.ckpt'
+YAML = '.yaml'
+PTH = '.pth'
+TH_EXT = '.th'
+JSON = '.json'
+#GUI Buttons
+START_PROCESSING = 'Start Processing'
+WAIT_PROCESSING = 'Please wait...'
+STOP_PROCESSING = 'Halting process, please wait...'
+LOADING_MODELS = 'Loading models...'
+#---Messages and Logs----
+MISSING_MODEL = 'missing'
+MODEL_PRESENT = 'present'
+UNRECOGNIZED_MODEL = 'Unrecognized Model Detected', ' is an unrecognized model.\n\n' + \
+                     'Would you like to select the correct parameters before continuing?'
+STOP_PROCESS_CONFIRM = 'Confirmation', 'You are about to stop all active processes.\n\nAre you sure you wish to continue?'
+NO_ENSEMBLE_SELECTED = 'No Models Selected', 'Please select ensemble and try again.'
+PICKLE_CORRU = 'File Corrupted', 'Unable to load this ensemble.\n\n' + \
+               'Would you like to remove this ensemble from your list?'
+DELETE_ENS_ENTRY = 'Confirm Removal', 'Are you sure you want to remove this entry?'
+ALL_STEMS = 'All Stems'
+VOCAL_STEM = 'Vocals'
+INST_STEM = 'Instrumental'
+OTHER_STEM = 'Other'
+BASS_STEM = 'Bass'
+DRUM_STEM = 'Drums'
+GUITAR_STEM = 'Guitar'
+PIANO_STEM = 'Piano'
+SYNTH_STEM = 'Synthesizer'
+STRINGS_STEM = 'Strings'
+WOODWINDS_STEM = 'Woodwinds'
+BRASS_STEM = 'Brass'
+WIND_INST_STEM = 'Wind Inst'
+NO_OTHER_STEM = 'No Other'
+NO_BASS_STEM = 'No Bass'
+NO_DRUM_STEM = 'No Drums'
+NO_GUITAR_STEM = 'No Guitar'
+NO_PIANO_STEM = 'No Piano'
+NO_SYNTH_STEM = 'No Synthesizer'
+NO_STRINGS_STEM = 'No Strings'
+NO_WOODWINDS_STEM = 'No Woodwinds'
+NO_WIND_INST_STEM = 'No Wind Inst'
+NO_BRASS_STEM = 'No Brass'
+PRIMARY_STEM = 'Primary Stem'
+SECONDARY_STEM = 'Secondary Stem'
+#Other Constants
+DEMUCS_2_SOURCE = ["instrumental", "vocals"]
+DEMUCS_4_SOURCE = ["drums", "bass", "other", "vocals"]
+DEMUCS_2_SOURCE_MAPPER = {
+                        INST_STEM: 0,
+                        VOCAL_STEM: 1}
+DEMUCS_4_SOURCE_MAPPER = {
+                        BASS_STEM: 0,
+                        DRUM_STEM: 1,
+                        OTHER_STEM: 2,
+                        VOCAL_STEM: 3}
+DEMUCS_6_SOURCE_MAPPER = {
+                        BASS_STEM: 0,
+                        DRUM_STEM: 1,
+                        OTHER_STEM: 2,
+                        VOCAL_STEM: 3,
+                        GUITAR_STEM:4,
+                        PIANO_STEM:5}
+DEMUCS_4_SOURCE_LIST = [BASS_STEM, DRUM_STEM, OTHER_STEM, VOCAL_STEM]
+DEMUCS_6_SOURCE_LIST = [BASS_STEM, DRUM_STEM, OTHER_STEM, VOCAL_STEM, GUITAR_STEM, PIANO_STEM]
+DEMUCS_UVR_MODEL = 'UVR_Model'
+CHOOSE_STEM_PAIR = 'Choose Stem Pair'
+STEM_SET_MENU = (VOCAL_STEM,
+                 INST_STEM,
+                 OTHER_STEM,
+                 BASS_STEM,
+                 DRUM_STEM,
+                 GUITAR_STEM,
+                 PIANO_STEM,
+                 SYNTH_STEM,
+                 STRINGS_STEM,
+                 WOODWINDS_STEM,
+                 BRASS_STEM,
+                 WIND_INST_STEM,
+                 NO_OTHER_STEM,
+                 NO_BASS_STEM,
+                 NO_DRUM_STEM,
+                 NO_GUITAR_STEM,
+                 NO_PIANO_STEM,
+                 NO_SYNTH_STEM,
+                 NO_STRINGS_STEM,
+                 NO_WOODWINDS_STEM,
+                 NO_BRASS_STEM,
+                 NO_WIND_INST_STEM)
+STEM_PAIR_MAPPER = {
+            VOCAL_STEM: INST_STEM,
+            INST_STEM: VOCAL_STEM,
+            OTHER_STEM: NO_OTHER_STEM,
+            BASS_STEM: NO_BASS_STEM,
+            DRUM_STEM: NO_DRUM_STEM,
+            GUITAR_STEM: NO_GUITAR_STEM,
+            PIANO_STEM: NO_PIANO_STEM,
+            SYNTH_STEM: NO_SYNTH_STEM,
+            STRINGS_STEM: NO_STRINGS_STEM,
+            WOODWINDS_STEM: NO_WOODWINDS_STEM,
+            BRASS_STEM: NO_BRASS_STEM,
+            WIND_INST_STEM: NO_WIND_INST_STEM,
+            NO_OTHER_STEM: OTHER_STEM,
+            NO_BASS_STEM: BASS_STEM,
+            NO_DRUM_STEM: DRUM_STEM,
+            NO_GUITAR_STEM: GUITAR_STEM,
+            NO_PIANO_STEM: PIANO_STEM,
+            NO_SYNTH_STEM: SYNTH_STEM,
+            NO_STRINGS_STEM: STRINGS_STEM,
+            NO_WOODWINDS_STEM: WOODWINDS_STEM,
+            NO_BRASS_STEM: BRASS_STEM,
+            NO_WIND_INST_STEM: WIND_INST_STEM,
+            PRIMARY_STEM: SECONDARY_STEM}
+NON_ACCOM_STEMS = (
+            VOCAL_STEM,
+            OTHER_STEM,
+            BASS_STEM,
+            DRUM_STEM,
+            GUITAR_STEM,
+            PIANO_STEM,
+            SYNTH_STEM,
+            STRINGS_STEM,
+            WOODWINDS_STEM,
+            BRASS_STEM,
+            WIND_INST_STEM)
+MDX_NET_FREQ_CUT = [VOCAL_STEM, INST_STEM]
+DEMUCS_4_STEM_OPTIONS = (ALL_STEMS, VOCAL_STEM, OTHER_STEM, BASS_STEM, DRUM_STEM)
+DEMUCS_6_STEM_OPTIONS = (ALL_STEMS, VOCAL_STEM, OTHER_STEM, BASS_STEM, DRUM_STEM, GUITAR_STEM, PIANO_STEM)
+DEMUCS_2_STEM_OPTIONS = (VOCAL_STEM, INST_STEM)
+DEMUCS_4_STEM_CHECK = (OTHER_STEM, BASS_STEM, DRUM_STEM)
+#Menu Dropdowns
+VOCAL_PAIR = f'{VOCAL_STEM}/{INST_STEM}'
+INST_PAIR = f'{INST_STEM}/{VOCAL_STEM}'
+OTHER_PAIR = f'{OTHER_STEM}/{NO_OTHER_STEM}'
+DRUM_PAIR = f'{DRUM_STEM}/{NO_DRUM_STEM}'
+BASS_PAIR = f'{BASS_STEM}/{NO_BASS_STEM}'
+FOUR_STEM_ENSEMBLE = '4 Stem Ensemble'
+ENSEMBLE_MAIN_STEM = (CHOOSE_STEM_PAIR, VOCAL_PAIR, OTHER_PAIR, DRUM_PAIR, BASS_PAIR, FOUR_STEM_ENSEMBLE)
+MIN_SPEC = 'Min Spec'
+MAX_SPEC = 'Max Spec'
+AUDIO_AVERAGE = 'Average'
+MAX_MIN = f'{MAX_SPEC}/{MIN_SPEC}'
+MAX_MAX = f'{MAX_SPEC}/{MAX_SPEC}'
+MAX_AVE = f'{MAX_SPEC}/{AUDIO_AVERAGE}'
+MIN_MAX = f'{MIN_SPEC}/{MAX_SPEC}'
+MIN_MIX = f'{MIN_SPEC}/{MIN_SPEC}'
+MIN_AVE = f'{MIN_SPEC}/{AUDIO_AVERAGE}'
+AVE_MAX = f'{AUDIO_AVERAGE}/{MAX_SPEC}'
+AVE_MIN = f'{AUDIO_AVERAGE}/{MIN_SPEC}'
+AVE_AVE = f'{AUDIO_AVERAGE}/{AUDIO_AVERAGE}'
+ENSEMBLE_TYPE = (MAX_MIN, MAX_MAX, MAX_AVE, MIN_MAX, MIN_MIX, MIN_AVE, AVE_MAX, AVE_MIN, AVE_AVE)
+ENSEMBLE_TYPE_4_STEM = (MAX_SPEC, MIN_SPEC, AUDIO_AVERAGE)
+BATCH_MODE = 'Batch Mode'
+BETA_VERSION = 'BETA'
+DEF_OPT = 'Default'
+CHUNKS = (AUTO_SELECT, '1', '5', '10', '15', '20',
+          '25', '30', '35', '40', '45', '50',
+          '55', '60', '65', '70', '75', '80',
+          '85', '90', '95', 'Full')
+BATCH_SIZE = (DEF_OPT, '2', '3', '4', '5',
+          '6', '7', '8', '9', '10')
+VOL_COMPENSATION = (AUTO_SELECT, '1.035', '1.08')
+MARGIN_SIZE = ('44100', '22050', '11025')
+AUDIO_TOOLS = 'Audio Tools'
+MANUAL_ENSEMBLE = 'Manual Ensemble'
+TIME_STRETCH = 'Time Stretch'
+CHANGE_PITCH = 'Change Pitch'
+ALIGN_INPUTS = 'Align Inputs'
+if OPERATING_SYSTEM == 'Windows' or OPERATING_SYSTEM == 'Darwin':
+   AUDIO_TOOL_OPTIONS = (MANUAL_ENSEMBLE, TIME_STRETCH, CHANGE_PITCH, ALIGN_INPUTS)
+else:
+   AUDIO_TOOL_OPTIONS = (MANUAL_ENSEMBLE, ALIGN_INPUTS)
+MANUAL_ENSEMBLE_OPTIONS = (MIN_SPEC, MAX_SPEC, AUDIO_AVERAGE)
+PROCESS_METHODS = (VR_ARCH_PM, MDX_ARCH_TYPE, DEMUCS_ARCH_TYPE, ENSEMBLE_MODE, AUDIO_TOOLS)
+DEMUCS_SEGMENTS = ('Default', '1', '5', '10', '15', '20',
+                  '25', '30', '35', '40', '45', '50',
+                  '55', '60', '65', '70', '75', '80',
+                  '85', '90', '95', '100')
+DEMUCS_SHIFTS = (0, 1, 2, 3, 4, 5,
+                 6, 7, 8, 9, 10, 11,
+                 12, 13, 14, 15, 16, 17,
+                 18, 19, 20)
+DEMUCS_OVERLAP = (0.25, 0.50, 0.75, 0.99)
+VR_AGGRESSION = (1, 2, 3, 4, 5,
+                 6, 7, 8, 9, 10, 11,
+                 12, 13, 14, 15, 16, 17,
+                 18, 19, 20)
+VR_WINDOW = ('320', '512','1024')
+VR_CROP = ('256', '512', '1024')
+POST_PROCESSES_THREASHOLD_VALUES = ('0.1', '0.2', '0.3')
+MDX_POP_PRO = ('MDX-NET_Noise_Profile_14_kHz', 'MDX-NET_Noise_Profile_17_kHz', 'MDX-NET_Noise_Profile_Full_Band')
+MDX_POP_STEMS = ('Vocals', 'Instrumental', 'Other', 'Drums', 'Bass')
+MDX_POP_NFFT = ('4096', '5120', '6144', '7680', '8192', '16384')
+MDX_POP_DIMF = ('2048', '3072', '4096')
+SAVE_ENSEMBLE = 'Save Ensemble'
+CLEAR_ENSEMBLE = 'Clear Selection(s)'
+MENU_SEPARATOR = 35*'•'
+CHOOSE_ENSEMBLE_OPTION = 'Choose Option'
+INVALID_ENTRY = 'Invalid Input, Please Try Again'
+ENSEMBLE_INPUT_RULE = '1. Only letters, numbers, spaces, and dashes allowed.\n2. No dashes or spaces at the start or end of input.'
+ENSEMBLE_OPTIONS = (SAVE_ENSEMBLE, CLEAR_ENSEMBLE)
+ENSEMBLE_CHECK = 'ensemble check'
+SELECT_SAVED_ENSEMBLE = 'Select Saved Ensemble'
+SELECT_SAVED_SETTING = 'Select Saved Setting'
+ENSEMBLE_OPTION = "Ensemble Customization Options"
+MDX_OPTION = "Advanced MDX-Net Options"
+DEMUCS_OPTION = "Advanced Demucs Options"
+VR_OPTION = "Advanced VR Options"
+HELP_OPTION = "Open Information Guide"
+ERROR_OPTION = "Open Error Log"
+VERIFY_BEGIN = 'Verifying file '
+SAMPLE_BEGIN = 'Creating Sample '
+MODEL_MISSING_CHECK = 'Model Missing:'
+# Audio Player
+PLAYING_SONG = ": Playing"
+PAUSE_SONG = ": Paused"
+STOP_SONG = ": Stopped"
+SELECTED_VER = 'Selected'
+DETECTED_VER = 'Detected'
+SAMPLE_MODE_CHECKBOX = lambda v:f'Sample Mode ({v}s)'
+REMOVED_FILES = lambda r, e:f'Audio Input Verification Report:\n\nRemoved Files:\n\n{r}\n\nError Details:\n\n{e}'
+ADVANCED_SETTINGS = (ENSEMBLE_OPTION, MDX_OPTION, DEMUCS_OPTION, VR_OPTION, HELP_OPTION, ERROR_OPTION)
+WAV = 'WAV'
+FLAC = 'FLAC'
+MP3 = 'MP3'
+MP3_BIT_RATES = ('96k', '128k', '160k', '224k', '256k', '320k')
+WAV_TYPE = ('PCM_U8', 'PCM_16', 'PCM_24', 'PCM_32', '32-bit Float', '64-bit Float')
+SELECT_SAVED_SET = 'Choose Option'
+SAVE_SETTINGS = 'Save Current Settings'
+RESET_TO_DEFAULT = 'Reset to Default'
+RESET_FULL_TO_DEFAULT = 'Reset to Default'
+RESET_PM_TO_DEFAULT = 'Reset All Application Settings to Default'
+SAVE_SET_OPTIONS = (SAVE_SETTINGS, RESET_TO_DEFAULT)
+TIME_PITCH = ('1.0', '2.0', '3.0', '4.0')
+TIME_TEXT = '_time_stretched'
+PITCH_TEXT = '_pitch_shifted'
+#RegEx Input Validation
+REG_PITCH = r'^[-+]?(1[0]|[0-9]([.][0-9]*)?)$'
+REG_TIME = r'^[+]?(1[0]|[0-9]([.][0-9]*)?)$'
+REG_COMPENSATION = r'\b^(1[0]|[0-9]([.][0-9]*)?|Auto|None)$\b'
+REG_THES_POSTPORCESS = r'\b^([0]([.][0-9]{0,6})?)$\b'
+REG_CHUNKS = r'\b^(200|1[0-9][0-9]|[1-9][0-9]?|Auto|Full)$\b'
+REG_CHUNKS_DEMUCS = r'\b^(200|1[0-9][0-9]|[1-9][0-9]?|Auto|Full)$\b'
+REG_MARGIN = r'\b^[0-9]*$\b'
+REG_SEGMENTS = r'\b^(200|1[0-9][0-9]|[1-9][0-9]?|Default)$\b'
+REG_SAVE_INPUT = r'\b^([a-zA-Z0-9 -]{0,25})$\b'
+REG_AGGRESSION = r'^[-+]?[0-9]\d*?$'
+REG_WINDOW = r'\b^[0-9]{0,4}$\b'
+REG_SHIFTS = r'\b^[0-9]*$\b'
+REG_BATCHES = r'\b^([0-9]*?|Default)$\b'
+REG_OVERLAP = r'\b^([0]([.][0-9]{0,6})?|None)$\b'
+# Sub Menu
+VR_ARCH_SETTING_LOAD = 'Load for VR Arch'
+MDX_SETTING_LOAD = 'Load for MDX-Net'
+DEMUCS_SETTING_LOAD = 'Load for Demucs'
+ALL_ARCH_SETTING_LOAD = 'Load for Full Application'
+# Mappers
+DEFAULT_DATA = {
+        'chosen_process_method': MDX_ARCH_TYPE,
+        'vr_model': CHOOSE_MODEL,
+        'aggression_setting': 10,
+        'window_size': 512,
+        'batch_size': 4,
+        'crop_size': 256,
+        'is_tta': False,
+        'is_output_image': False,
+        'is_post_process': False,
+        'is_high_end_process': False,
+        'post_process_threshold': 0.2,
+        'vr_voc_inst_secondary_model': NO_MODEL,
+        'vr_other_secondary_model': NO_MODEL,
+        'vr_bass_secondary_model': NO_MODEL,
+        'vr_drums_secondary_model': NO_MODEL,
+        'vr_is_secondary_model_activate': False,
+        'vr_voc_inst_secondary_model_scale': 0.9,
+        'vr_other_secondary_model_scale': 0.7,
+        'vr_bass_secondary_model_scale': 0.5,
+        'vr_drums_secondary_model_scale': 0.5,
+        'demucs_model': CHOOSE_MODEL,
+        'demucs_stems': ALL_STEMS,
+        'segment': DEMUCS_SEGMENTS[0],
+        'overlap': DEMUCS_OVERLAP[0],
+        'shifts': 2,
+        'chunks_demucs': CHUNKS[0],
+        'margin_demucs': 44100,
+        'is_chunk_demucs': False,
+        'is_chunk_mdxnet': False,
+        'is_primary_stem_only_Demucs': False,
+        'is_secondary_stem_only_Demucs': False,
+        'is_split_mode': True,
+        'is_demucs_combine_stems': True,
+        'demucs_voc_inst_secondary_model': NO_MODEL,
+        'demucs_other_secondary_model': NO_MODEL,
+        'demucs_bass_secondary_model': NO_MODEL,
+        'demucs_drums_secondary_model': NO_MODEL,
+        'demucs_is_secondary_model_activate': False,
+        'demucs_voc_inst_secondary_model_scale': 0.9,
+        'demucs_other_secondary_model_scale': 0.7,
+        'demucs_bass_secondary_model_scale': 0.5,
+        'demucs_drums_secondary_model_scale': 0.5,
+        'demucs_stems': ALL_STEMS,
+        'demucs_pre_proc_model': NO_MODEL,
+        'is_demucs_pre_proc_model_activate': False,
+        'is_demucs_pre_proc_model_inst_mix': False,
+        'mdx_net_model': CHOOSE_MODEL,
+        'chunks': CHUNKS[0],
+        'margin': 44100,
+        'compensate': AUTO_SELECT,
+        'is_denoise': False,
+        'is_invert_spec': False,
+        'is_mixer_mode': False,
+        'mdx_batch_size': DEF_OPT,
+        'mdx_voc_inst_secondary_model': NO_MODEL,
+        'mdx_other_secondary_model': NO_MODEL,
+        'mdx_bass_secondary_model': NO_MODEL,
+        'mdx_drums_secondary_model': NO_MODEL,
+        'mdx_is_secondary_model_activate': False,
+        'mdx_voc_inst_secondary_model_scale': 0.9,
+        'mdx_other_secondary_model_scale': 0.7,
+        'mdx_bass_secondary_model_scale': 0.5,
+        'mdx_drums_secondary_model_scale': 0.5,
+        'is_save_all_outputs_ensemble': True,
+        'is_append_ensemble_name': False,
+        'chosen_audio_tool': AUDIO_TOOL_OPTIONS[0],
+        'choose_algorithm': MANUAL_ENSEMBLE_OPTIONS[0],
+        'time_stretch_rate': 2.0,
+        'pitch_rate': 2.0,
+        'is_gpu_conversion': False,
+        'is_primary_stem_only': False,
+        'is_secondary_stem_only': False,
+        'is_testing_audio': False,
+        'is_add_model_name': False,
+        'is_accept_any_input': False,
+        'is_task_complete': False,
+        'is_normalization': False,
+        'is_create_model_folder': False,
+        'mp3_bit_set': '320k',
+        'save_format': WAV,
+        'wav_type_set': 'PCM_16',
+        'user_code': '',
+        'export_path': '',
+        'input_paths': [],
+        'lastDir': None,
+        'export_path': '',
+        'model_hash_table': None,
+        'help_hints_var': False,
+        'model_sample_mode': False,
+        'model_sample_mode_duration': 30
+}
+SETTING_CHECK = ('vr_model',
+               'aggression_setting',
+               'window_size',
+               'batch_size',
+               'crop_size',
+               'is_tta',
+               'is_output_image',
+               'is_post_process',
+               'is_high_end_process',
+               'post_process_threshold',
+               'vr_voc_inst_secondary_model',
+               'vr_other_secondary_model',
+               'vr_bass_secondary_model',
+               'vr_drums_secondary_model',
+               'vr_is_secondary_model_activate',
+               'vr_voc_inst_secondary_model_scale',
+               'vr_other_secondary_model_scale',
+               'vr_bass_secondary_model_scale',
+               'vr_drums_secondary_model_scale',
+               'demucs_model',
+               'segment',
+               'overlap',
+               'shifts',
+               'chunks_demucs',
+               'margin_demucs',
+               'is_chunk_demucs',
+               'is_primary_stem_only_Demucs',
+               'is_secondary_stem_only_Demucs',
+               'is_split_mode',
+               'is_demucs_combine_stems',
+               'demucs_voc_inst_secondary_model',
+               'demucs_other_secondary_model',
+               'demucs_bass_secondary_model',
+               'demucs_drums_secondary_model',
+               'demucs_is_secondary_model_activate',
+               'demucs_voc_inst_secondary_model_scale',
+               'demucs_other_secondary_model_scale',
+               'demucs_bass_secondary_model_scale',
+               'demucs_drums_secondary_model_scale',
+               'demucs_stems',
+               'mdx_net_model',
+               'chunks',
+               'margin',
+               'compensate',
+               'is_denoise',
+               'is_invert_spec',
+               'mdx_batch_size',
+               'mdx_voc_inst_secondary_model',
+               'mdx_other_secondary_model',
+               'mdx_bass_secondary_model',
+               'mdx_drums_secondary_model',
+               'mdx_is_secondary_model_activate',
+               'mdx_voc_inst_secondary_model_scale',
+               'mdx_other_secondary_model_scale',
+               'mdx_bass_secondary_model_scale',
+               'mdx_drums_secondary_model_scale',
+               'is_save_all_outputs_ensemble',
+               'is_append_ensemble_name',
+               'chosen_audio_tool',
+               'choose_algorithm',
+               'time_stretch_rate',
+               'pitch_rate',
+               'is_primary_stem_only',
+               'is_secondary_stem_only',
+               'is_testing_audio',
+               'is_add_model_name',
+               "is_accept_any_input",
+               'is_task_complete',
+               'is_create_model_folder',
+               'mp3_bit_set',
+               'save_format',
+               'wav_type_set',
+               'user_code',
+               'is_gpu_conversion',
+               'is_normalization',
+               'help_hints_var',
+               'model_sample_mode',
+               'model_sample_mode_duration')
+# Message Box Text
+INVALID_INPUT = 'Invalid Input', 'The input is invalid.\n\nPlease verify the input still exists or is valid and try again.'
+INVALID_EXPORT = 'Invalid Export Directory', 'You have selected an invalid export directory.\n\nPlease make sure the selected directory still exists.'
+INVALID_ENSEMBLE = 'Not Enough Models', 'You must select 2 or more models to run ensemble.'
+INVALID_MODEL = 'No Model Chosen', 'You must select an model to continue.'
+MISSING_MODEL = 'Model Missing', 'The selected model is missing or not valid.'
+ERROR_OCCURED = 'Error Occured', '\n\nWould you like to open the error log for more details?\n'
+# GUI Text Constants
+BACK_TO_MAIN_MENU = 'Back to Main Menu'
+# Help Hint Text
+INTERNAL_MODEL_ATT = 'Internal model attribute. \n\n ***Do not change this setting if you are unsure!***'
+STOP_HELP = 'Halts any running processes. \n A pop-up window will ask the user to confirm the action.'
+SETTINGS_HELP = 'Opens the main settings guide. This window includes the \"Download Center\"'
+COMMAND_TEXT_HELP = 'Provides information on the progress of the current process.'
+SAVE_CURRENT_SETTINGS_HELP = 'Allows the user to open any saved settings or save the current application settings.'
+CHUNKS_HELP = ('For MDX-Net, all values use the same amount of resources. Using chunks is no longer recommended.\n\n' + \
+                '• This option is now only for output quality.\n' + \
+                '• Some tracks may fare better depending on the value.\n' + \
+                '• Some tracks may fare worse depending on the value.\n' + \
+                '• Larger chunk sizes use will take less time to process.\n' +\
+                '• Smaller chunk sizes use will take more time to process.\n')
+CHUNKS_DEMUCS_HELP = ('This option allows the user to reduce (or increase) RAM or V-RAM usage.\n\n' + \
+                '• Smaller chunk sizes use less RAM or V-RAM but can also increase processing times.\n' + \
+                '• Larger chunk sizes use more RAM or V-RAM but can also reduce processing times.\n' + \
+                '• Selecting \"Auto\" calculates an appropriate chuck size based on how much RAM or V-RAM your system has.\n' + \
+                '• Selecting \"Full\" will process the track as one whole chunk. (not recommended)\n' + \
+                '• The default selection is \"Auto\".')
+MARGIN_HELP = 'Selects the frequency margins to slice the chunks from.\n\n• The recommended margin size is 44100.\n• Other values can give unpredictable results.'
+AGGRESSION_SETTING_HELP = ('This option allows you to set how strong the primary stem extraction will be.\n\n' + \
+                           '• The range is 0-100.\n' + \
+                           '• Higher values perform deeper extractions.\n' + \
+                           '• The default is 10 for instrumental & vocal models.\n' + \
+                           '• Values over 10 can result in muddy-sounding instrumentals for the non-vocal models')
+WINDOW_SIZE_HELP = ('The smaller your window size, the better your conversions will be. \nHowever, a smaller window means longer conversion times and heavier resource usage.\n\n' + \
+                    'Breakdown of the selectable window size values:\n' + \
+                    '• 1024 - Low conversion quality, shortest conversion time, low resource usage.\n' + \
+                    '• 512 - Average conversion quality, average conversion time, normal resource usage.\n' + \
+                    '• 320 - Better conversion quality.')
+DEMUCS_STEMS_HELP = ('Here, you can choose which stem to extract using the selected model.\n\n' +\
+                     'Stem Selections:\n\n' +\
+                     '• All Stems - Saves all of the stems the model is able to extract.\n' +\
+                     '• Vocals - Pulls vocal stem only.\n' +\
+                     '• Other - Pulls other stem only.\n' +\
+                     '• Bass - Pulls bass stem only.\n' +\
+                     '• Drums - Pulls drum stem only.\n')
+SEGMENT_HELP = ('This option allows the user to reduce (or increase) RAM or V-RAM usage.\n\n' + \
+                '• Smaller segment sizes use less RAM or V-RAM but can also increase processing times.\n' + \
+                '• Larger segment sizes use more RAM or V-RAM but can also reduce processing times.\n' + \
+                '• Selecting \"Default\" uses the recommended segment size.\n' + \
+                '• It is recommended that you not use segments with \"Chunking\".')
+ENSEMBLE_MAIN_STEM_HELP = 'Allows the user to select the type of stems they wish to ensemble.\n\nOptions:\n\n' +\
+                          f'• {VOCAL_PAIR} - The primary stem will be the vocals and the secondary stem will be the the instrumental\n' +\
+                          f'• {OTHER_PAIR} - The primary stem will be other and the secondary stem will be no other (the mixture without the \'other\' stem)\n' +\
+                          f'• {BASS_PAIR} - The primary stem will be bass and the secondary stem will be no bass (the mixture without the \'bass\' stem)\n' +\
+                          f'• {DRUM_PAIR} - The primary stem will be drums and the secondary stem will be no drums (the mixture without the \'drums\' stem)\n' +\
+                          f'• {FOUR_STEM_ENSEMBLE} - This option will gather all the 4 stem Demucs models and ensemble all of the outputs.\n'
+ENSEMBLE_TYPE_HELP = 'Allows the user to select the ensemble algorithm to be used to generate the final output.\n\nExample & Other Note:\n\n' +\
+                     f'• {MAX_MIN} - If this option is chosen, the primary stem outputs will be processed through \nthe \'Max Spec\' algorithm, and the secondary stem will be processed through the \'Min Spec\' algorithm.\n' +\
+                     f'• Only a single algorithm will be shown when the \'4 Stem Ensemble\' option is chosen.\n\nAlgorithm Details:\n\n' +\
+                     f'• {MAX_SPEC} - This algorithm combines the final results and generates the highest possible output from them.\nFor example, if this algorithm were processing vocal stems, you would get the fullest possible \n' +\
+                        'result making the ensembled vocal stem sound cleaner. However, it might result in more unwanted artifacts.\n' +\
+                     f'• {MIN_SPEC} - This algorithm combines the results and generates the lowest possible output from them.\nFor example, if this algorithm were processing instrumental stems, you would get the cleanest possible result \n' +\
+                        'result, eliminating more unwanted artifacts. However, the result might also sound \'muddy\' and lack a fuller sound.\n' +\
+                     f'• {AUDIO_AVERAGE} - This algorithm simply combines the results and averages all of them together. \n'
+ENSEMBLE_LISTBOX_HELP = 'List of the all the models available for the main stem pair selected.'
+IS_GPU_CONVERSION_HELP = ('When checked, the application will attempt to use your GPU (if you have one).\n' +\
+                         'If you do not have a GPU but have this checked, the application will default to your CPU.\n\n' +\
+                         'Note: CPU conversions are much slower than those processed through the GPU.')
+SAVE_STEM_ONLY_HELP = 'Allows the user to save only the selected stem.'
+IS_NORMALIZATION_HELP = 'Normalizes output to prevent clipping.'
+CROP_SIZE_HELP = '**Only compatible with select models only!**\n\n Setting should match training crop-size value. Leave as is if unsure.'
+IS_TTA_HELP = ('This option performs Test-Time-Augmentation to improve the separation quality.\n\n' +\
+               'Note: Having this selected will increase the time it takes to complete a conversion')
+IS_POST_PROCESS_HELP = ('This option can potentially identify leftover instrumental artifacts within the vocal outputs. \nThis option may improve the separation of some songs.\n\n' +\
+                       'Note: Selecting this option can adversely affect the conversion process, depending on the track. Because of this, it is only recommended as a last resort.')
+IS_HIGH_END_PROCESS_HELP = 'The application will mirror the missing frequency range of the output.'
+SHIFTS_HELP = ('Performs multiple predictions with random shifts of the input and averages them.\n\n' +\
+              '• The higher number of shifts, the longer the prediction will take. \n- Not recommended unless you have a GPU.')
+OVERLAP_HELP = 'This option controls the amount of overlap between prediction windows (for Demucs one window is 10 seconds)'
+IS_CHUNK_DEMUCS_HELP = '• Enables \"Chunks\".\n• We recommend you not enable this option with \"Split Mode\" enabled or with the Demucs v4 Models.'
+IS_CHUNK_MDX_NET_HELP = '• Enables \"Chunks\".\n• Using this option for MDX-Net no longer effects RAM usage.\n• Having this enabled will effect output quality, for better or worse depending on the set value.'
+IS_SPLIT_MODE_HELP = ('• Enables \"Segments\". \n• We recommend you not enable this option with \"Enable Chunks\".\n' +\
+                      '• Deselecting this option is only recommended for those with powerful PCs or if using \"Chunk\" mode instead.')
+IS_DEMUCS_COMBINE_STEMS_HELP = 'The application will create the secondary stem by combining the remaining stems \ninstead of inverting the primary stem with the mixture.'
+COMPENSATE_HELP = 'Compensates the audio of the primary stems to allow for a better secondary stem.'
+IS_DENOISE_HELP = '• This option removes a majority of the noise generated by the MDX-Net models.\n• The conversion will take nearly twice as long with this enabled.'
+CLEAR_CACHE_HELP = 'Clears any user selected model settings for previously unrecognized models.'
+IS_SAVE_ALL_OUTPUTS_ENSEMBLE_HELP = 'Enabling this option will keep all indivudual outputs generated by an ensemble.'
+IS_APPEND_ENSEMBLE_NAME_HELP = 'The application will append the ensemble name to the final output \nwhen this option is enabled.'
+DONATE_HELP = 'Takes the user to an external web-site to donate to this project!'
+IS_INVERT_SPEC_HELP = '• This option may produce a better secondary stem.\n• Inverts primary stem with mixture using spectragrams instead of wavforms.\n• This inversion method is slightly slower.'
+IS_MIXER_MODE_HELP = '• This option may improve separations for outputs from 4-stem models.\n• Might produce more noise.\n• This option might slow down separation time.'
+IS_TESTING_AUDIO_HELP = 'Appends a unique 10 digit number to output files so the user \ncan compare results with different settings.'
+IS_MODEL_TESTING_AUDIO_HELP = 'Appends the model name to output files so the user \ncan compare results with different settings.'
+IS_ACCEPT_ANY_INPUT_HELP = 'The application will accept any input when enabled, even if it does not have an audio format extension.\n\nThis is for experimental purposes, and having it enabled is not recommended.'
+IS_TASK_COMPLETE_HELP = 'When enabled, chimes will be heard when a process completes or fails.'
+IS_CREATE_MODEL_FOLDER_HELP = 'Two new directories will be generated for the outputs in \nthe export directory after each conversion.\n\n' +\
+                              '• First directory - Named after the model.\n' +\
+                              '• Second directory - Named after the track.\n\n' +\
+                              '• Example: \n\n' +\
+                              '─ Export Directory\n' +\
+                              '   └── First Directory\n' +\
+                              '           └── Second Directory\n' +\
+                              '                    └── Output File(s)'
+DELETE_YOUR_SETTINGS_HELP = 'This menu contains your saved settings. You will be asked to \nconfirm if you wish to delete the selected setting.'
+SET_STEM_NAME_HELP = 'Choose the primary stem for the selected model.'
+MDX_DIM_T_SET_HELP = INTERNAL_MODEL_ATT
+MDX_DIM_F_SET_HELP = INTERNAL_MODEL_ATT
+MDX_N_FFT_SCALE_SET_HELP = 'Set the N_FFT size the model was trained with.'
+POPUP_COMPENSATE_HELP = f'Choose the appropriate voluem compensattion for the selected model\n\nReminder: {COMPENSATE_HELP}'
+VR_MODEL_PARAM_HELP = 'Choose the parameters needed to run the selected model.'
+CHOSEN_ENSEMBLE_HELP = 'Select saved enselble or save current ensemble.\n\nDefault Selections:\n\n• Save the current ensemble.\n• Clears all current model selections.'
+CHOSEN_PROCESS_METHOD_HELP = 'Here, you choose between different Al networks and algorithms to process your track.\n\n' +\
+                             'There are five options:\n\n' +\
+                             '• VR Architecture - These models use magnitude spectrograms for Source Separation.\n' +\
+                             '• MDX-Net - These models use Hybrid Spectrogram/Waveform for Source Separation.\n' +\
+                             '• Demucs v3 - These models use Hybrid Spectrogram/Waveform for Source Separation.\n' +\
+                             '• Ensemble Mode - Here, you can get the best results from multiple models and networks.\n' +\
+                             '• Audio Tools - These are additional tools for added convenience.'
+INPUT_FOLDER_ENTRY_HELP = 'Select Input:\n\nHere is where you select the audio files(s) you wish to process.'
+INPUT_FOLDER_ENTRY_HELP_2 = 'Input Option Menu:\n\nClick here to access the input option menu.'
+OUTPUT_FOLDER_ENTRY_HELP = 'Select Output:\n\nHere is where you select the directory where your processed files are to be saved.'
+INPUT_FOLDER_BUTTON_HELP = 'Open Input Folder Button: \n\nOpens the directory containing the selected input audio file(s).'
+OUTPUT_FOLDER_BUTTON_HELP = 'Open Output Folder Button: \n\nOpens the selected output folder.'
+CHOOSE_MODEL_HELP = 'Each process method comes with its own set of options and models.\n\nHere is where you choose the model associated with the selected process method.'
+FORMAT_SETTING_HELP = 'Save outputs as '
+SECONDARY_MODEL_ACTIVATE_HELP = 'When enabled, the application will run an additional inference with the selected model(s) above.'
+SECONDARY_MODEL_HELP = 'Choose the secondary model associated with this stem you wish to run with the current process method.'
+SECONDARY_MODEL_SCALE_HELP = 'The scale determines how the final audio outputs will be averaged between the primary and secondary models.\n\nFor example:\n\n' +\
+                             '• 10% - 10 percent of the main model result will be factored into the final result.\n' +\
+                             '• 50% - The results from the main and secondary models will be averaged evenly.\n' +\
+                             '• 90% - 90 percent of the main model result will be factored into the final result.'
+PRE_PROC_MODEL_ACTIVATE_HELP = 'The application will run an inference with the selected model above, pulling only the instrumental stem when enabled. \nFrom there, all of the non-vocal stems will be pulled from the generated instrumental.\n\nNotes:\n\n' +\
+                               '• This option can significantly reduce vocal bleed within the non-vocal stems.\n' +\
+                               '• It is only available in Demucs.\n' +\
+                               '• It is only compatible with non-vocal and non-instrumental stem outputs.\n' +\
+                               '• This will increase thetotal processing time.\n' +\
+                               '• Only VR and MDX-Net Vocal or Instrumental models are selectable above.'
+AUDIO_TOOLS_HELP = 'Here, you choose between different audio tools to process your track.\n\n' +\
+                               '• Manual Ensemble - You must have 2 or more files selected as your inputs. Allows the user to run their tracks through \nthe same algorithms used in Ensemble Mode.\n' +\
+                               '• Align Inputs - You must have exactly 2 files selected as your inputs. The second input will be aligned with the first input.\n' +\
+                               '• Time Stretch - The user can speed up or slow down the selected inputs.\n' +\
+                               '• Change Pitch - The user can change the pitch for the selected inputs.\n'
+PRE_PROC_MODEL_INST_MIX_HELP = 'When enabled, the application will generate a third output without the selected stem and vocals.'
+MODEL_SAMPLE_MODE_HELP = 'Allows the user to process only part of a track to sample settings or a model without \nrunning a full conversion.\n\nNotes:\n\n' +\
+                         '• The number in the parentheses is the current number of seconds the generated sample will be.\n' +\
+                         '• You can choose the number of seconds to extract from the track in the \"Additional Settings\" menu.'
+POST_PROCESS_THREASHOLD_HELP = 'Allows the user to control the intensity of the Post_process option.\n\nNotes:\n\n' +\
+                               '• Higher values potentially remove more artifacts. However, bleed might increase.\n' +\
+                               '• Lower values limit artifact removal.'
+BATCH_SIZE_HELP = 'Specify the number of batches to be processed at a time.\n\nNotes:\n\n' +\
+                               '• Higher values mean more RAM usage but slightly faster processing times.\n' +\
+                               '• Lower values mean less RAM usage but slightly longer processing times.\n' +\
+                               '• Batch size value has no effect on output quality.'
+# Warning Messages
+STORAGE_ERROR = 'Insufficient Storage', 'There is not enough storage on main drive to continue. Your main drive must have at least 3 GB\'s of storage in order for this application function properly. \n\nPlease ensure your main drive has at least 3 GB\'s of storage and try again.\n\n'
+STORAGE_WARNING = 'Available Storage Low', 'Your main drive is running low on storage. Your main drive must have at least 3 GB\'s of storage in order for this application function properly.\n\n'
+CONFIRM_WARNING = '\nAre you sure you wish to continue?'
+PROCESS_FAILED = 'Process failed, please see error log\n'
+EXIT_PROCESS_ERROR = 'Active Process', 'Please stop the active process or wait for it to complete before you exit.'
+EXIT_HALTED_PROCESS_ERROR = 'Halting Process', 'Please wait for the application to finish halting the process before exiting.'
+EXIT_DOWNLOAD_ERROR = 'Active Download', 'Please stop the download or wait for it to complete before you exit.'
+SET_TO_DEFAULT_PROCESS_ERROR = 'Active Process', 'You cannot reset all of the application settings during an active process.'
+SET_TO_ANY_PROCESS_ERROR = 'Active Process', 'You cannot reset the application settings during an active process.'
+RESET_ALL_TO_DEFAULT_WARNING = 'Reset Settings Confirmation', 'All application settings will be set to factory default.\n\nAre you sure you wish to continue?'
+AUDIO_VERIFICATION_CHECK = lambda i, e:f'++++++++++++++++++++++++++++++++++++++++++++++++++++\n\nBroken File Removed: \n\n{i}\n\nError Details:\n\n{e}\n++++++++++++++++++++++++++++++++++++++++++++++++++++'
+INVALID_ONNX_MODEL_ERROR = 'Invalid Model', 'The file selected is not a valid MDX-Net model. Please see the error log for more information.'
+# Separation Text
+LOADING_MODEL = 'Loading model...'
+INFERENCE_STEP_1 = 'Running inference...'
+INFERENCE_STEP_1_SEC = 'Running inference (secondary model)...'
+INFERENCE_STEP_1_4_STEM = lambda stem:f'Running inference (secondary model for {stem})...'
+INFERENCE_STEP_1_PRE = 'Running inference (pre-process model)...'
+INFERENCE_STEP_2_PRE = lambda pm, m:f'Loading pre-process model ({pm}: {m})...'
+INFERENCE_STEP_2_SEC = lambda pm, m:f'Loading secondary model ({pm}: {m})...'
+INFERENCE_STEP_2_SEC_CACHED_MODOEL = lambda pm, m:f'Secondary model ({pm}: {m}) cache loaded.\n'
+INFERENCE_STEP_2_PRE_CACHED_MODOEL = lambda pm, m:f'Pre-process model ({pm}: {m}) cache loaded.\n'
+INFERENCE_STEP_2_SEC_CACHED = 'Loading cached secondary model source(s)... Done!\n'
+INFERENCE_STEP_2_PRIMARY_CACHED = 'Model cache loaded.\n'
+INFERENCE_STEP_2 = 'Inference complete.'
+SAVING_STEM = 'Saving ', ' stem...'
+SAVING_ALL_STEMS = 'Saving all stems...'
+ENSEMBLING_OUTPUTS = 'Ensembling outputs...'
+DONE = ' Done!\n'
+ENSEMBLES_SAVED = 'Ensembled outputs saved!\n\n'
+NEW_LINES = "\n\n"
+NEW_LINE = "\n"
+NO_LINE = ''
+# Widget Placements
+MAIN_ROW_Y = -15, -17
+MAIN_ROW_X = -4, 21
+MAIN_ROW_WIDTH = -53
+MAIN_ROW_2_Y = -15, -17
+MAIN_ROW_2_X = -28, 1
+CHECK_BOX_Y = 0
+CHECK_BOX_X = 20
+CHECK_BOX_WIDTH = -50
+CHECK_BOX_HEIGHT = 2
+LEFT_ROW_WIDTH = -10
+LABEL_HEIGHT = -5
+OPTION_HEIGHT = 7
+LOW_MENU_Y = 18, 16
+FFMPEG_EXT = (".aac", ".aiff", ".alac" ,".flac", ".FLAC", ".mov", ".mp4", ".MP4",
+              ".m4a", ".M4A", ".mp2", ".mp3", "MP3", ".mpc", ".mpc8",
+              ".mpeg", ".ogg", ".OGG", ".tta", ".wav", ".wave", ".WAV", ".WAVE", ".wma", ".webm", ".eac3", ".mkv")
+FFMPEG_MORE_EXT = (".aa", ".aac", ".ac3", ".aiff", ".alac", ".avi", ".f4v",".flac", ".flic", ".flv",
+              ".m4v",".mlv", ".mov", ".mp4", ".m4a", ".mp2", ".mp3", ".mp4", ".mpc", ".mpc8",
+              ".mpeg", ".ogg", ".tta", ".tty", ".vcd", ".wav", ".wma")
+ANY_EXT = ""
+# Secondary Menu Constants
+VOCAL_PAIR_PLACEMENT = 1, 2, 3, 4
+OTHER_PAIR_PLACEMENT = 5, 6, 7, 8
+BASS_PAIR_PLACEMENT = 9, 10, 11, 12
+DRUMS_PAIR_PLACEMENT = 13, 14, 15, 16
+# Drag n Drop String Checks
+DOUBLE_BRACKET = "} {"
+RIGHT_BRACKET = "}"
+LEFT_BRACKET = "{"
+# Manual Downloads
+VR_PLACEMENT_TEXT = 'Place models in \"models/VR_Models\" directory.'
+MDX_PLACEMENT_TEXT = 'Place models in \"models/MDX_Net_Models\" directory.'
+DEMUCS_PLACEMENT_TEXT = 'Place models in \"models/Demucs_Models\" directory.'
+DEMUCS_V3_V4_PLACEMENT_TEXT = 'Place items in \"models/Demucs_Models/v3_v4_repo\" directory.'
+FULL_DOWNLOAD_LIST_VR = {
+                    "VR Arch Single Model v5: 1_HP-UVR": "1_HP-UVR.pth",
+                    "VR Arch Single Model v5: 2_HP-UVR": "2_HP-UVR.pth",
+                    "VR Arch Single Model v5: 3_HP-Vocal-UVR": "3_HP-Vocal-UVR.pth",
+                    "VR Arch Single Model v5: 4_HP-Vocal-UVR": "4_HP-Vocal-UVR.pth",
+                    "VR Arch Single Model v5: 5_HP-Karaoke-UVR": "5_HP-Karaoke-UVR.pth",
+                    "VR Arch Single Model v5: 6_HP-Karaoke-UVR": "6_HP-Karaoke-UVR.pth",
+                    "VR Arch Single Model v5: 7_HP2-UVR": "7_HP2-UVR.pth",
+                    "VR Arch Single Model v5: 8_HP2-UVR": "8_HP2-UVR.pth",
+                    "VR Arch Single Model v5: 9_HP2-UVR": "9_HP2-UVR.pth",
+                    "VR Arch Single Model v5: 10_SP-UVR-2B-32000-1": "10_SP-UVR-2B-32000-1.pth",
+                    "VR Arch Single Model v5: 11_SP-UVR-2B-32000-2": "11_SP-UVR-2B-32000-2.pth",
+                    "VR Arch Single Model v5: 12_SP-UVR-3B-44100": "12_SP-UVR-3B-44100.pth",
+                    "VR Arch Single Model v5: 13_SP-UVR-4B-44100-1": "13_SP-UVR-4B-44100-1.pth",
+                    "VR Arch Single Model v5: 14_SP-UVR-4B-44100-2": "14_SP-UVR-4B-44100-2.pth",
+                    "VR Arch Single Model v5: 15_SP-UVR-MID-44100-1": "15_SP-UVR-MID-44100-1.pth",
+                    "VR Arch Single Model v5: 16_SP-UVR-MID-44100-2": "16_SP-UVR-MID-44100-2.pth",
+                    "VR Arch Single Model v4: MGM_HIGHEND_v4": "MGM_HIGHEND_v4.pth",
+                    "VR Arch Single Model v4: MGM_LOWEND_A_v4": "MGM_LOWEND_A_v4.pth",
+                    "VR Arch Single Model v4: MGM_LOWEND_B_v4": "MGM_LOWEND_B_v4.pth",
+                    "VR Arch Single Model v4: MGM_MAIN_v4": "MGM_MAIN_v4.pth"
+                  }
+FULL_DOWNLOAD_LIST_MDX = {
+                    "MDX-Net Model: UVR-MDX-NET Main": "UVR_MDXNET_Main.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst Main": "UVR-MDX-NET-Inst_Main.onnx",
+                    "MDX-Net Model: UVR-MDX-NET 1": "UVR_MDXNET_1_9703.onnx",
+                    "MDX-Net Model: UVR-MDX-NET 2": "UVR_MDXNET_2_9682.onnx",
+                    "MDX-Net Model: UVR-MDX-NET 3": "UVR_MDXNET_3_9662.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst 1": "UVR-MDX-NET-Inst_1.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst 2": "UVR-MDX-NET-Inst_2.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Inst 3": "UVR-MDX-NET-Inst_3.onnx",
+                    "MDX-Net Model: UVR-MDX-NET Karaoke": "UVR_MDXNET_KARA.onnx",
+                    "MDX-Net Model: UVR_MDXNET_9482": "UVR_MDXNET_9482.onnx",
+                    "MDX-Net Model: Kim_Vocal_1": "Kim_Vocal_1.onnx",
+                    "MDX-Net Model: kuielab_a_vocals": "kuielab_a_vocals.onnx",
+                    "MDX-Net Model: kuielab_a_other": "kuielab_a_other.onnx",
+                    "MDX-Net Model: kuielab_a_bass": "kuielab_a_bass.onnx",
+                    "MDX-Net Model: kuielab_a_drums": "kuielab_a_drums.onnx",
+                    "MDX-Net Model: kuielab_b_vocals": "kuielab_b_vocals.onnx",
+                    "MDX-Net Model: kuielab_b_other": "kuielab_b_other.onnx",
+                    "MDX-Net Model: kuielab_b_bass": "kuielab_b_bass.onnx",
+                    "MDX-Net Model: kuielab_b_drums": "kuielab_b_drums.onnx"}
+FULL_DOWNLOAD_LIST_DEMUCS = {
+	                "Demucs v4: htdemucs_ft":{
+	                                "f7e0c4bc-ba3fe64a.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/f7e0c4bc-ba3fe64a.th",
+	                                "d12395a8-e57c48e6.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/d12395a8-e57c48e6.th",
+	                                "92cfc3b6-ef3bcb9c.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/92cfc3b6-ef3bcb9c.th",
+	                                "04573f0d-f3cf25b2.th":"https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/04573f0d-f3cf25b2.th",
+	                                "htdemucs_ft.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/htdemucs_ft.yaml"
+	                                },
+	                "Demucs v4: htdemucs":{
+	                                "955717e8-8726e21a.th": "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th",
+	                                "htdemucs.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/htdemucs.yaml"
+	                                },
+	                "Demucs v4: hdemucs_mmi":{
+	                                "75fc33f5-1941ce65.th": "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/75fc33f5-1941ce65.th",
+	                                "hdemucs_mmi.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/hdemucs_mmi.yaml"
+	                                },
+	                "Demucs v4: htdemucs_6s":{
+	                                "5c90dfd2-34c22ccb.th": "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/5c90dfd2-34c22ccb.th",
+	                                "htdemucs_6s.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/htdemucs_6s.yaml"
+	                                },
+	                "Demucs v3: mdx":{
+	                                "0d19c1c6-0f06f20e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/0d19c1c6-0f06f20e.th",
+	                                "7ecf8ec1-70f50cc9.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/7ecf8ec1-70f50cc9.th",
+	                                "c511e2ab-fe698775.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/c511e2ab-fe698775.th",
+	                                "7d865c68-3d5dd56b.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/7d865c68-3d5dd56b.th",
+	                                "mdx.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx.yaml"
+	                                },
+	                "Demucs v3: mdx_q":{
+	                                "6b9c2ca1-3fd82607.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/6b9c2ca1-3fd82607.th",
+	                                "b72baf4e-8778635e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/b72baf4e-8778635e.th",
+	                                "42e558d4-196e0e1b.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/42e558d4-196e0e1b.th",
+	                                "305bc58f-18378783.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/305bc58f-18378783.th",
+	                                "mdx_q.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx_q.yaml"
+	                                },
+	                "Demucs v3: mdx_extra":{
+	                                "e51eebcc-c1b80bdd.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/e51eebcc-c1b80bdd.th",
+	                                "a1d90b5c-ae9d2452.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/a1d90b5c-ae9d2452.th",
+	                                "5d2d6c55-db83574e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/5d2d6c55-db83574e.th",
+	                                "cfa93e08-61801ae1.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/cfa93e08-61801ae1.th",
+	                                "mdx_extra.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx_extra.yaml"
+	                                },
+	                "Demucs v3: mdx_extra_q": {
+	                                "83fc094f-4a16d450.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/83fc094f-4a16d450.th",
+	                                "464b36d7-e5a9386e.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/464b36d7-e5a9386e.th",
+	                                "14fc6a69-a89dd0ee.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/14fc6a69-a89dd0ee.th",
+	                                "7fd6ef75-a905dd85.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/7fd6ef75-a905dd85.th",
+	                                "mdx_extra_q.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx_extra_q.yaml"
+	                                },
+	                "Demucs v3: UVR Model":{
+	                                "ebf34a2db.th": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/ebf34a2db.th",
+	                                "UVR_Demucs_Model_1.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/UVR_Demucs_Model_1.yaml"
+	                                },
+	                "Demucs v3: repro_mdx_a":{
+	                                "9a6b4851-03af0aa6.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/9a6b4851-03af0aa6.th",
+	                                "1ef250f1-592467ce.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/1ef250f1-592467ce.th",
+	                                "fa0cb7f9-100d8bf4.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/fa0cb7f9-100d8bf4.th",
+	                                "902315c2-b39ce9c9.th": "https://dl.fbaipublicfiles.com/demucs/mdx_final/902315c2-b39ce9c9.th",
+	                                "repro_mdx_a.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/repro_mdx_a.yaml"
+	                                },
+	                "Demucs v3: repro_mdx_a_time_only":{
+	                                "9a6b4851-03af0aa6.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/9a6b4851-03af0aa6.th",
+	                                "1ef250f1-592467ce.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/1ef250f1-592467ce.th",
+	                                "repro_mdx_a_time_only.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/repro_mdx_a_time_only.yaml"
+	                                },
+	                "Demucs v3: repro_mdx_a_hybrid_only":{
+	                                "fa0cb7f9-100d8bf4.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/fa0cb7f9-100d8bf4.th",
+	                                "902315c2-b39ce9c9.th":"https://dl.fbaipublicfiles.com/demucs/mdx_final/902315c2-b39ce9c9.th",
+	                                "repro_mdx_a_hybrid_only.yaml": "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/repro_mdx_a_hybrid_only.yaml"
+	                                },
+	                "Demucs v2: demucs": {
+	                                "demucs-e07c671f.th": "https://dl.fbaipublicfiles.com/demucs/v3.0/demucs-e07c671f.th"
+	                                },
+	                "Demucs v2: demucs_extra": {
+	                                "demucs_extra-3646af93.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/demucs_extra-3646af93.th"
+	                                },
+	                "Demucs v2: demucs48_hq": {
+	                                "demucs48_hq-28a1282c.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/demucs48_hq-28a1282c.th"
+	                                },
+	                "Demucs v2: tasnet": {
+	                                "tasnet-beb46fac.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/tasnet-beb46fac.th"
+	                                },
+	                "Demucs v2: tasnet_extra": {
+	                                "tasnet_extra-df3777b2.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/tasnet_extra-df3777b2.th"
+	                                },
+	                "Demucs v2: demucs_unittest": {
+	                                "demucs_unittest-09ebc15f.th":"https://dl.fbaipublicfiles.com/demucs/v3.0/demucs_unittest-09ebc15f.th"
+	                                },
+	                "Demucs v1: demucs": {
+	                                "demucs.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/demucs.th"
+	                                },
+	                "Demucs v1: demucs_extra": {
+	                                "demucs_extra.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/demucs_extra.th"
+	                                },
+	                "Demucs v1: light": {
+	                                "light.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/light.th"
+	                                },
+	                "Demucs v1: light_extra": {
+	                                "light_extra.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/light_extra.th"
+	                                },
+	                "Demucs v1: tasnet": {
+	                                "tasnet.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/tasnet.th"
+	                                },
+	                "Demucs v1: tasnet_extra": {
+	                                "tasnet_extra.th":"https://dl.fbaipublicfiles.com/demucs/v2.0/tasnet_extra.th"
+	                                }
+                }
+# Main Menu Labels
+CHOOSE_PROC_METHOD_MAIN_LABEL = 'CHOOSE PROCESS METHOD'
+SELECT_SAVED_SETTINGS_MAIN_LABEL = 'SELECT SAVED SETTINGS'
+CHOOSE_MDX_MODEL_MAIN_LABEL = 'CHOOSE MDX-NET MODEL'
+BATCHES_MDX_MAIN_LABEL = 'BATCH SIZE'
+VOL_COMP_MDX_MAIN_LABEL = 'VOLUME COMPENSATION'
+SELECT_VR_MODEL_MAIN_LABEL = 'CHOOSE VR MODEL'
+AGGRESSION_SETTING_MAIN_LABEL = 'AGGRESSION SETTING'
+WINDOW_SIZE_MAIN_LABEL = 'WINDOW SIZE'
+CHOOSE_DEMUCS_MODEL_MAIN_LABEL = 'CHOOSE DEMUCS MODEL'
+CHOOSE_DEMUCS_STEMS_MAIN_LABEL = 'CHOOSE STEM(S)'
+CHOOSE_SEGMENT_MAIN_LABEL = 'SEGMENT'
+ENSEMBLE_OPTIONS_MAIN_LABEL = 'ENSEMBLE OPTIONS'
+CHOOSE_MAIN_PAIR_MAIN_LABEL = 'MAIN STEM PAIR'
+CHOOSE_ENSEMBLE_ALGORITHM_MAIN_LABEL = 'ENSEMBLE ALGORITHM'
+AVAILABLE_MODELS_MAIN_LABEL = 'AVAILABLE MODELS'
+CHOOSE_AUDIO_TOOLS_MAIN_LABEL = 'CHOOSE AUDIO TOOL'
+CHOOSE_MANUAL_ALGORITHM_MAIN_LABEL = 'CHOOSE ALGORITHM'
+CHOOSE_RATE_MAIN_LABEL = 'RATE'
+CHOOSE_SEMITONES_MAIN_LABEL = 'SEMITONES'
+GPU_CONVERSION_MAIN_LABEL = 'GPU Conversion'
+if OPERATING_SYSTEM=="Darwin":
+   LICENSE_OS_SPECIFIC_TEXT = '• This application is intended for those running macOS Catalina and above.\n' +\
+                              '• Application functionality for systems running macOS Mojave or lower is not guaranteed.\n' +\
+                              '• Application functionality for older or budget Mac systems is not guaranteed.\n\n'
+   FONT_SIZE_F1 = 13
+   FONT_SIZE_F2 = 11
+   FONT_SIZE_F3 = 12
+   FONT_SIZE_0 = 9
+   FONT_SIZE_1 = 11
+   FONT_SIZE_2 = 12
+   FONT_SIZE_3 = 13
+   FONT_SIZE_4 = 14
+   FONT_SIZE_5 = 15
+   FONT_SIZE_6 = 17
+   HELP_HINT_CHECKBOX_WIDTH = 13
+   MDX_CHECKBOXS_WIDTH = 14
+   VR_CHECKBOXS_WIDTH = 14
+   ENSEMBLE_CHECKBOXS_WIDTH = 18
+   DEMUCS_CHECKBOXS_WIDTH = 14
+   DEMUCS_PRE_CHECKBOXS_WIDTH = 20
+   GEN_SETTINGS_WIDTH = 17
+   MENU_COMBOBOX_WIDTH = 16
+elif OPERATING_SYSTEM=="Linux":
+   LICENSE_OS_SPECIFIC_TEXT = '• This application is intended for those running Linux Ubuntu 18.04+.\n' +\
+                              '• Application functionality for systems running other Linux platforms is not guaranteed.\n' +\
+                              '• Application functionality for older or budget systems is not guaranteed.\n\n'
+   FONT_SIZE_F1 = 10
+   FONT_SIZE_F2 = 8
+   FONT_SIZE_F3 = 9
+   FONT_SIZE_0 = 7
+   FONT_SIZE_1 = 8
+   FONT_SIZE_2 = 9
+   FONT_SIZE_3 = 10
+   FONT_SIZE_4 = 11
+   FONT_SIZE_5 = 12
+   FONT_SIZE_6 = 15
+   HELP_HINT_CHECKBOX_WIDTH = 13
+   MDX_CHECKBOXS_WIDTH = 14
+   VR_CHECKBOXS_WIDTH = 16
+   ENSEMBLE_CHECKBOXS_WIDTH = 25
+   DEMUCS_CHECKBOXS_WIDTH = 18
+   DEMUCS_PRE_CHECKBOXS_WIDTH = 27
+   GEN_SETTINGS_WIDTH = 17
+   MENU_COMBOBOX_WIDTH = 19
+elif OPERATING_SYSTEM=="Windows":
+   LICENSE_OS_SPECIFIC_TEXT = '• This application is intended for those running Windows 10 or higher.\n' +\
+                              '• Application functionality for systems running Windows 7 or lower is not guaranteed.\n' +\
+                              '• Application functionality for Intel Pentium & Celeron CPUs systems is not guaranteed.\n\n'
+   FONT_SIZE_F1 = 10
+   FONT_SIZE_F2 = 8
+   FONT_SIZE_F3 = 9
+   FONT_SIZE_0 = 7
+   FONT_SIZE_1 = 8
+   FONT_SIZE_2 = 9
+   FONT_SIZE_3 = 10
+   FONT_SIZE_4 = 11
+   FONT_SIZE_5 = 12
+   FONT_SIZE_6 = 15
+   HELP_HINT_CHECKBOX_WIDTH = 16
+   MDX_CHECKBOXS_WIDTH = 16
+   VR_CHECKBOXS_WIDTH = 16
+   ENSEMBLE_CHECKBOXS_WIDTH = 25
+   DEMUCS_CHECKBOXS_WIDTH = 18
+   DEMUCS_PRE_CHECKBOXS_WIDTH = 27
+   GEN_SETTINGS_WIDTH = 23
+   MENU_COMBOBOX_WIDTH = 19
+LICENSE_TEXT = lambda a, p:f'Current Application Version: Ultimate Vocal Remover {a}\n' +\
+               f'Current Patch Version: {p}\n\n' +\
+               'Copyright (c) 2022 Ultimate Vocal Remover\n\n' +\
+               'UVR is free and open-source, but MIT licensed. Please credit us if you use our\n' +\
+               f'models or code for projects unrelated to UVR.\n\n{LICENSE_OS_SPECIFIC_TEXT}' +\
+               'This bundle contains the UVR interface, Python, PyTorch, and other\n' +\
+               'dependencies needed to run the application effectively.\n\n' +\
+               'Website Links: This application, System or Service(s) may contain links to\n' +\
+               'other websites and downloads, and they are solely provided to you as an\n' +\
+               'additional convenience. You understand and acknowledge that by clicking\n' +\
+               'or activating such links you are accessing a site or service outside of\n' +\
+               'this application, and that we do not screen, review, approve, or otherwise\n' +\
+               'endorse any content or information contained in these linked websites.\n' +\
+               'You acknowledge and agree that we, our affiliates and partners are not\n' +\
+               'responsible for the contents of any of these linked websites, including\n' +\
+               'the accuracy or availability of information provided by the linked websites,\n' +\
+               'and we make no representations or warranties regarding your use of\n' +\
+               'the linked websites.\n\n' +\
+               'This application is MIT Licensed\n\n' +\
+               'Permission is hereby granted, free of charge, to any person obtaining a copy\n' +\
+               'of this software and associated documentation files (the "Software"), to deal\n' +\
+               'in the Software without restriction, including without limitation the rights\n' +\
+               'to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n' +\
+               'copies of the Software, and to permit persons to whom the Software is\n' +\
+               'furnished to do so, subject to the following conditions:\n\n' +\
+               'The above copyright notice and this permission notice shall be included in all\n' +\
+               'copies or substantial portions of the Software.\n\n' +\
+               'THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n' +\
+               'IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n' +\
+               'FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n' +\
+               'AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n' +\
+               'LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n' +\
+               'OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n' +\
+               'SOFTWARE.'
+CHANGE_LOG_HEADER = lambda patch:f"Patch Version:\n\n{patch}"
+#DND CONSTS
+MAC_DND_CHECK = ('/Users/',
+                 '/Applications/',
+                 '/Library/',
+                 '/System/')
+LINUX_DND_CHECK = ('/home/',
+                   '/usr/')
+WINDOWS_DND_CHECK = ('A:', 'B:', 'C:', 'D:', 'E:', 'F:', 'G:', 'H:', 'I:', 'J:', 'K:', 'L:', 'M:', 'N:', 'O:', 'P:', 'Q:', 'R:', 'S:', 'T:', 'U:', 'V:', 'W:', 'X:', 'Y:', 'Z:')
+WOOD_INST_MODEL_HASH = '0ec76fd9e65f81d8b4fbd13af4826ed8'
+WOOD_INST_PARAMS = {
+    "vr_model_param": "4band_v3",
+    "primary_stem": NO_WIND_INST_STEM
+                     }

uvr5/lib_v5/mdxnet.py ADDED Viewed

	@@ -0,0 +1,140 @@

+from abc import ABCMeta
+import torch
+import torch.nn as nn
+from pytorch_lightning import LightningModule
+from .modules import TFC_TDF
+dim_s = 4
+class AbstractMDXNet(LightningModule):
+    __metaclass__ = ABCMeta
+    def __init__(self, target_name, lr, optimizer, dim_c, dim_f, dim_t, n_fft, hop_length, overlap):
+        super().__init__()
+        self.target_name = target_name
+        self.lr = lr
+        self.optimizer = optimizer
+        self.dim_c = dim_c
+        self.dim_f = dim_f
+        self.dim_t = dim_t
+        self.n_fft = n_fft
+        self.n_bins = n_fft // 2 + 1
+        self.hop_length = hop_length
+        self.window = nn.Parameter(torch.hann_window(window_length=self.n_fft, periodic=True), requires_grad=False)
+        self.freq_pad = nn.Parameter(torch.zeros([1, dim_c, self.n_bins - self.dim_f, self.dim_t]), requires_grad=False)
+    def configure_optimizers(self):
+        if self.optimizer == 'rmsprop':
+            return torch.optim.RMSprop(self.parameters(), self.lr)
+        if self.optimizer == 'adamw':
+            return torch.optim.AdamW(self.parameters(), self.lr)
+class ConvTDFNet(AbstractMDXNet):
+    def __init__(self, target_name, lr, optimizer, dim_c, dim_f, dim_t, n_fft, hop_length,
+                 num_blocks, l, g, k, bn, bias, overlap):
+        super(ConvTDFNet, self).__init__(
+            target_name, lr, optimizer, dim_c, dim_f, dim_t, n_fft, hop_length, overlap)
+        self.save_hyperparameters()
+        self.num_blocks = num_blocks
+        self.l = l
+        self.g = g
+        self.k = k
+        self.bn = bn
+        self.bias = bias
+        if optimizer == 'rmsprop':
+            norm = nn.BatchNorm2d
+        if optimizer == 'adamw':
+            norm = lambda input:nn.GroupNorm(2, input)
+        self.n = num_blocks // 2
+        scale = (2, 2)
+        self.first_conv = nn.Sequential(
+            nn.Conv2d(in_channels=self.dim_c, out_channels=g, kernel_size=(1, 1)),
+            norm(g),
+            nn.ReLU(),
+        )
+        f = self.dim_f
+        c = g
+        self.encoding_blocks = nn.ModuleList()
+        self.ds = nn.ModuleList()
+        for i in range(self.n):
+            self.encoding_blocks.append(TFC_TDF(c, l, f, k, bn, bias=bias, norm=norm))
+            self.ds.append(
+                nn.Sequential(
+                    nn.Conv2d(in_channels=c, out_channels=c + g, kernel_size=scale, stride=scale),
+                    norm(c + g),
+                    nn.ReLU()
+                )
+            )
+            f = f // 2
+            c += g
+        self.bottleneck_block = TFC_TDF(c, l, f, k, bn, bias=bias, norm=norm)
+        self.decoding_blocks = nn.ModuleList()
+        self.us = nn.ModuleList()
+        for i in range(self.n):
+            self.us.append(
+                nn.Sequential(
+                    nn.ConvTranspose2d(in_channels=c, out_channels=c - g, kernel_size=scale, stride=scale),
+                    norm(c - g),
+                    nn.ReLU()
+                )
+            )
+            f = f * 2
+            c -= g
+            self.decoding_blocks.append(TFC_TDF(c, l, f, k, bn, bias=bias, norm=norm))
+        self.final_conv = nn.Sequential(
+            nn.Conv2d(in_channels=c, out_channels=self.dim_c, kernel_size=(1, 1)),
+        )
+    def forward(self, x):
+        x = self.first_conv(x)
+        x = x.transpose(-1, -2)
+        ds_outputs = []
+        for i in range(self.n):
+            x = self.encoding_blocks[i](x)
+            ds_outputs.append(x)
+            x = self.ds[i](x)
+        x = self.bottleneck_block(x)
+        for i in range(self.n):
+            x = self.us[i](x)
+            x *= ds_outputs[-i - 1]
+            x = self.decoding_blocks[i](x)
+        x = x.transpose(-1, -2)
+        x = self.final_conv(x)
+        return x
+class Mixer(nn.Module):
+    def __init__(self, device, mixer_path):
+        super(Mixer, self).__init__()
+        self.linear = nn.Linear((dim_s+1)*2, dim_s*2, bias=False)
+        self.load_state_dict(
+            torch.load(mixer_path, map_location=device)
+        )
+    def forward(self, x):
+        x = x.reshape(1,(dim_s+1)*2,-1).transpose(-1,-2)
+        x = self.linear(x)
+        return x.transpose(-1,-2).reshape(dim_s,2,-1)

uvr5/lib_v5/mixer.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea781bd52c6a523b825fa6cdbb6189f52e318edd8b17e6fe404f76f7af8caa9c
+size 1208

uvr5/lib_v5/modules.py ADDED Viewed

	@@ -0,0 +1,74 @@

+import torch
+import torch.nn as nn
+class TFC(nn.Module):
+    def __init__(self, c, l, k, norm):
+        super(TFC, self).__init__()
+        self.H = nn.ModuleList()
+        for i in range(l):
+            self.H.append(
+                nn.Sequential(
+                    nn.Conv2d(in_channels=c, out_channels=c, kernel_size=k, stride=1, padding=k // 2),
+                    norm(c),
+                    nn.ReLU(),
+                )
+            )
+    def forward(self, x):
+        for h in self.H:
+            x = h(x)
+        return x
+class DenseTFC(nn.Module):
+    def __init__(self, c, l, k, norm):
+        super(DenseTFC, self).__init__()
+        self.conv = nn.ModuleList()
+        for i in range(l):
+            self.conv.append(
+                nn.Sequential(
+                    nn.Conv2d(in_channels=c, out_channels=c, kernel_size=k, stride=1, padding=k // 2),
+                    norm(c),
+                    nn.ReLU(),
+                )
+            )
+    def forward(self, x):
+        for layer in self.conv[:-1]:
+            x = torch.cat([layer(x), x], 1)
+        return self.conv[-1](x)
+class TFC_TDF(nn.Module):
+    def __init__(self, c, l, f, k, bn, dense=False, bias=True, norm=nn.BatchNorm2d):
+        super(TFC_TDF, self).__init__()
+        self.use_tdf = bn is not None
+        self.tfc = DenseTFC(c, l, k, norm) if dense else TFC(c, l, k, norm)
+        if self.use_tdf:
+            if bn == 0:
+                self.tdf = nn.Sequential(
+                    nn.Linear(f, f, bias=bias),
+                    norm(c),
+                    nn.ReLU()
+                )
+            else:
+                self.tdf = nn.Sequential(
+                    nn.Linear(f, f // bn, bias=bias),
+                    norm(c),
+                    nn.ReLU(),
+                    nn.Linear(f // bn, f, bias=bias),
+                    norm(c),
+                    nn.ReLU()
+                )
+    def forward(self, x):
+        x = self.tfc(x)
+        return x + self.tdf(x) if self.use_tdf else x

uvr5/lib_v5/pyrb.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import os
+import subprocess
+import tempfile
+import six
+import numpy as np
+import soundfile as sf
+import sys
+if getattr(sys, 'frozen', False):
+    BASE_PATH_RUB = sys._MEIPASS
+else:
+    BASE_PATH_RUB = os.path.dirname(os.path.abspath(__file__))
+__all__ = ['time_stretch', 'pitch_shift']
+__RUBBERBAND_UTIL = os.path.join(BASE_PATH_RUB, 'rubberband')
+if six.PY2:
+    DEVNULL = open(os.devnull, 'w')
+else:
+    DEVNULL = subprocess.DEVNULL
+def __rubberband(y, sr, **kwargs):
+    assert sr > 0
+    # Get the input and output tempfile
+    fd, infile = tempfile.mkstemp(suffix='.wav')
+    os.close(fd)
+    fd, outfile = tempfile.mkstemp(suffix='.wav')
+    os.close(fd)
+    # dump the audio
+    sf.write(infile, y, sr)
+    try:
+        # Execute rubberband
+        arguments = [__RUBBERBAND_UTIL, '-q']
+        for key, value in six.iteritems(kwargs):
+            arguments.append(str(key))
+            arguments.append(str(value))
+        arguments.extend([infile, outfile])
+        subprocess.check_call(arguments, stdout=DEVNULL, stderr=DEVNULL)
+        # Load the processed audio.
+        y_out, _ = sf.read(outfile, always_2d=True)
+        # make sure that output dimensions matches input
+        if y.ndim == 1:
+            y_out = np.squeeze(y_out)
+    except OSError as exc:
+        six.raise_from(RuntimeError('Failed to execute rubberband. '
+                                    'Please verify that rubberband-cli '
+                                    'is installed.'),
+                       exc)
+    finally:
+        # Remove temp files
+        os.unlink(infile)
+        os.unlink(outfile)
+    return y_out
+def time_stretch(y, sr, rate, rbargs=None):
+    if rate <= 0:
+        raise ValueError('rate must be strictly positive')
+    if rate == 1.0:
+        return y
+    if rbargs is None:
+        rbargs = dict()
+    rbargs.setdefault('--tempo', rate)
+    return __rubberband(y, sr, **rbargs)
+def pitch_shift(y, sr, n_steps, rbargs=None):
+    if n_steps == 0:
+        return y
+    if rbargs is None:
+        rbargs = dict()
+    rbargs.setdefault('--pitch', n_steps)
+    return __rubberband(y, sr, **rbargs)

uvr5/lib_v5/spec_utils.py ADDED Viewed

	@@ -0,0 +1,703 @@

+import librosa
+import numpy as np
+import soundfile as sf
+import math
+import random
+import math
+import platform
+import traceback
+from . import pyrb
+#cur
+OPERATING_SYSTEM = platform.system()
+SYSTEM_ARCH = platform.platform()
+SYSTEM_PROC = platform.processor()
+ARM = 'arm'
+if OPERATING_SYSTEM == 'Windows':
+    from pyrubberband import pyrb
+else:
+    from . import pyrb
+if OPERATING_SYSTEM == 'Darwin':
+    wav_resolution = "polyphase" if SYSTEM_PROC == ARM or ARM in SYSTEM_ARCH else "sinc_fastest"
+else:
+    wav_resolution = "sinc_fastest"
+MAX_SPEC = 'Max Spec'
+MIN_SPEC = 'Min Spec'
+AVERAGE = 'Average'
+def crop_center(h1, h2):
+    h1_shape = h1.size()
+    h2_shape = h2.size()
+    if h1_shape[3] == h2_shape[3]:
+        return h1
+    elif h1_shape[3] < h2_shape[3]:
+        raise ValueError('h1_shape[3] must be greater than h2_shape[3]')
+    s_time = (h1_shape[3] - h2_shape[3]) // 2
+    e_time = s_time + h2_shape[3]
+    h1 = h1[:, :, :, s_time:e_time]
+    return h1
+def preprocess(X_spec):
+    X_mag = np.abs(X_spec)
+    X_phase = np.angle(X_spec)
+    return X_mag, X_phase
+def make_padding(width, cropsize, offset):
+    left = offset
+    roi_size = cropsize - offset * 2
+    if roi_size == 0:
+        roi_size = cropsize
+    right = roi_size - (width % roi_size) + left
+    return left, right, roi_size
+def wave_to_spectrogram(wave, hop_length, n_fft, mid_side=False, mid_side_b2=False, reverse=False):
+    if reverse:
+        wave_left = np.flip(np.asfortranarray(wave[0]))
+        wave_right = np.flip(np.asfortranarray(wave[1]))
+    elif mid_side:
+        wave_left = np.asfortranarray(np.add(wave[0], wave[1]) / 2)
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1]))
+    elif mid_side_b2:
+        wave_left = np.asfortranarray(np.add(wave[1], wave[0] * .5))
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1] * .5))
+    else:
+        wave_left = np.asfortranarray(wave[0])
+        wave_right = np.asfortranarray(wave[1])
+    spec_left = librosa.stft(wave_left, n_fft, hop_length=hop_length)
+    spec_right = librosa.stft(wave_right, n_fft, hop_length=hop_length)
+    spec = np.asfortranarray([spec_left, spec_right])
+    return spec
+def wave_to_spectrogram_mt(wave, hop_length, n_fft, mid_side=False, mid_side_b2=False, reverse=False):
+    import threading
+    if reverse:
+        wave_left = np.flip(np.asfortranarray(wave[0]))
+        wave_right = np.flip(np.asfortranarray(wave[1]))
+    elif mid_side:
+        wave_left = np.asfortranarray(np.add(wave[0], wave[1]) / 2)
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1]))
+    elif mid_side_b2:
+        wave_left = np.asfortranarray(np.add(wave[1], wave[0] * .5))
+        wave_right = np.asfortranarray(np.subtract(wave[0], wave[1] * .5))
+    else:
+        wave_left = np.asfortranarray(wave[0])
+        wave_right = np.asfortranarray(wave[1])
+    def run_thread(**kwargs):
+        global spec_left
+        spec_left = librosa.stft(**kwargs)
+    thread = threading.Thread(target=run_thread, kwargs={'y': wave_left, 'n_fft': n_fft, 'hop_length': hop_length})
+    thread.start()
+    # print(wave_right.shape, n_fft, hop_length)
+    spec_right = librosa.stft(wave_right, n_fft=n_fft, hop_length=hop_length)
+    thread.join()
+    spec = np.asfortranarray([spec_left, spec_right])
+    return spec
+def normalize(wave, is_normalize=False):
+    """Save output music files"""
+    maxv = np.max(np.abs(wave))
+    if maxv > 1.0:
+        print(f"\nNormalization Set {is_normalize}: Input above threshold for clipping. Max:{maxv}")
+        if is_normalize:
+            print(f"The result was normalized.")
+            wave /= maxv
+        else:
+            print(f"The result was not normalized.")
+    else:
+        print(f"\nNormalization Set {is_normalize}: Input not above threshold for clipping. Max:{maxv}")
+    # stereo to mono
+    if wave.shape[1] < wave.shape[0]:
+        wave = np.mean(wave, axis=1)
+    else:
+        wave = np.mean(wave, axis=0)
+    return wave
+def normalize_two_stem(wave, mix, is_normalize=False):
+    """Save output music files"""
+    maxv = np.abs(wave).max()
+    max_mix = np.abs(mix).max()
+    if maxv > 1.0:
+        print(f"\nNormalization Set {is_normalize}: Primary source above threshold for clipping. Max:{maxv}")
+        print(f"\nNormalization Set {is_normalize}: Mixture above threshold for clipping. Max:{max_mix}")
+        if is_normalize:
+            print(f"The result was normalized.")
+            wave /= maxv
+            mix /= maxv
+        else:
+            print(f"The result was not normalized.")
+    else:
+        print(f"\nNormalization Set {is_normalize}: Input not above threshold for clipping. Max:{maxv}")
+    print(f"\nNormalization Set {is_normalize}: Primary source - Max:{np.abs(wave).max()}")
+    print(f"\nNormalization Set {is_normalize}: Mixture - Max:{np.abs(mix).max()}")
+    return wave, mix
+def combine_spectrograms(specs, mp):
+    l = min([specs[i].shape[2] for i in specs])
+    spec_c = np.zeros(shape=(2, mp.param['bins'] + 1, l), dtype=np.complex64)
+    offset = 0
+    bands_n = len(mp.param['band'])
+    for d in range(1, bands_n + 1):
+        h = mp.param['band'][d]['crop_stop'] - mp.param['band'][d]['crop_start']
+        spec_c[:, offset:offset+h, :l] = specs[d][:, mp.param['band'][d]['crop_start']:mp.param['band'][d]['crop_stop'], :l]
+        offset += h
+    if offset > mp.param['bins']:
+        raise ValueError('Too much bins')
+    # lowpass fiter
+    if mp.param['pre_filter_start'] > 0: # and mp.param['band'][bands_n]['res_type'] in ['scipy', 'polyphase']:
+        if bands_n == 1:
+            spec_c = fft_lp_filter(spec_c, mp.param['pre_filter_start'], mp.param['pre_filter_stop'])
+        else:
+            gp = 1
+            for b in range(mp.param['pre_filter_start'] + 1, mp.param['pre_filter_stop']):
+                g = math.pow(10, -(b - mp.param['pre_filter_start']) * (3.5 - gp) / 20.0)
+                gp = g
+                spec_c[:, b, :] *= g
+    return np.asfortranarray(spec_c)
+def spectrogram_to_image(spec, mode='magnitude'):
+    if mode == 'magnitude':
+        if np.iscomplexobj(spec):
+            y = np.abs(spec)
+        else:
+            y = spec
+        y = np.log10(y ** 2 + 1e-8)
+    elif mode == 'phase':
+        if np.iscomplexobj(spec):
+            y = np.angle(spec)
+        else:
+            y = spec
+    y -= y.min()
+    y *= 255 / y.max()
+    img = np.uint8(y)
+    if y.ndim == 3:
+        img = img.transpose(1, 2, 0)
+        img = np.concatenate([
+            np.max(img, axis=2, keepdims=True), img
+        ], axis=2)
+    return img
+def reduce_vocal_aggressively(X, y, softmask):
+    v = X - y
+    y_mag_tmp = np.abs(y)
+    v_mag_tmp = np.abs(v)
+    v_mask = v_mag_tmp > y_mag_tmp
+    y_mag = np.clip(y_mag_tmp - v_mag_tmp * v_mask * softmask, 0, np.inf)
+    return y_mag * np.exp(1.j * np.angle(y))
+def merge_artifacts(y_mask, thres=0.01, min_range=64, fade_size=32):
+    mask = y_mask
+    try:
+        if min_range < fade_size * 2:
+            raise ValueError('min_range must be >= fade_size * 2')
+        idx = np.where(y_mask.min(axis=(0, 1)) > thres)[0]
+        start_idx = np.insert(idx[np.where(np.diff(idx) != 1)[0] + 1], 0, idx[0])
+        end_idx = np.append(idx[np.where(np.diff(idx) != 1)[0]], idx[-1])
+        artifact_idx = np.where(end_idx - start_idx > min_range)[0]
+        weight = np.zeros_like(y_mask)
+        if len(artifact_idx) > 0:
+            start_idx = start_idx[artifact_idx]
+            end_idx = end_idx[artifact_idx]
+            old_e = None
+            for s, e in zip(start_idx, end_idx):
+                if old_e is not None and s - old_e < fade_size:
+                    s = old_e - fade_size * 2
+                if s != 0:
+                    weight[:, :, s:s + fade_size] = np.linspace(0, 1, fade_size)
+                else:
+                    s -= fade_size
+                if e != y_mask.shape[2]:
+                    weight[:, :, e - fade_size:e] = np.linspace(1, 0, fade_size)
+                else:
+                    e += fade_size
+                weight[:, :, s + fade_size:e - fade_size] = 1
+                old_e = e
+        v_mask = 1 - y_mask
+        y_mask += weight * v_mask
+        mask = y_mask
+    except Exception as e:
+        error_name = f'{type(e).__name__}'
+        traceback_text = ''.join(traceback.format_tb(e.__traceback__))
+        message = f'{error_name}: "{e}"\n{traceback_text}"'
+        print('Post Process Failed: ', message)
+    return mask
+def align_wave_head_and_tail(a, b):
+    l = min([a[0].size, b[0].size])
+    return a[:l,:l], b[:l,:l]
+def spectrogram_to_wave(spec, hop_length, mid_side, mid_side_b2, reverse, clamp=False):
+    spec_left = np.asfortranarray(spec[0])
+    spec_right = np.asfortranarray(spec[1])
+    wave_left = librosa.istft(spec_left, hop_length=hop_length)
+    wave_right = librosa.istft(spec_right, hop_length=hop_length)
+    if reverse:
+        return np.asfortranarray([np.flip(wave_left), np.flip(wave_right)])
+    elif mid_side:
+        return np.asfortranarray([np.add(wave_left, wave_right / 2), np.subtract(wave_left, wave_right / 2)])
+    elif mid_side_b2:
+        return np.asfortranarray([np.add(wave_right / 1.25, .4 * wave_left), np.subtract(wave_left / 1.25, .4 * wave_right)])
+    else:
+        return np.asfortranarray([wave_left, wave_right])
+def spectrogram_to_wave_mt(spec, hop_length, mid_side, reverse, mid_side_b2):
+    import threading
+    spec_left = np.asfortranarray(spec[0])
+    spec_right = np.asfortranarray(spec[1])
+    def run_thread(**kwargs):
+        global wave_left
+        wave_left = librosa.istft(**kwargs)
+    thread = threading.Thread(target=run_thread, kwargs={'stft_matrix': spec_left, 'hop_length': hop_length})
+    thread.start()
+    wave_right = librosa.istft(spec_right, hop_length=hop_length)
+    thread.join()
+    if reverse:
+        return np.asfortranarray([np.flip(wave_left), np.flip(wave_right)])
+    elif mid_side:
+        return np.asfortranarray([np.add(wave_left, wave_right / 2), np.subtract(wave_left, wave_right / 2)])
+    elif mid_side_b2:
+        return np.asfortranarray([np.add(wave_right / 1.25, .4 * wave_left), np.subtract(wave_left / 1.25, .4 * wave_right)])
+    else:
+        return np.asfortranarray([wave_left, wave_right])
+def cmb_spectrogram_to_wave(spec_m, mp, extra_bins_h=None, extra_bins=None):
+    bands_n = len(mp.param['band'])
+    offset = 0
+    # print('spec_m: ', spec_m.shape, np.max(spec_m), np.min(spec_m))
+    for d in range(1, bands_n + 1):
+        bp = mp.param['band'][d]
+        spec_s = np.ndarray(shape=(2, bp['n_fft'] // 2 + 1, spec_m.shape[2]), dtype=complex)
+        h = bp['crop_stop'] - bp['crop_start']
+        spec_s[:, bp['crop_start']:bp['crop_stop'], :] = spec_m[:, offset:offset+h, :]
+        # print('\nbp', d, bands_n, bp)
+        # print('spec_s: ', spec_s.shape, np.max(spec_s), np.min(spec_s))
+        offset += h
+        if d == bands_n: # higher
+            # print('hpf_start: ', extra_bins_h, bp['hpf_start'])
+            if extra_bins_h: # if --high_end_process bypass
+                max_bin = bp['n_fft'] // 2
+                spec_s[:, max_bin-extra_bins_h:max_bin, :] = extra_bins[:, :extra_bins_h, :]
+                # print('extra_bins_h, max_bin, extra_bins: ', extra_bins_h, max_bin, extra_bins.shape, np.max(extra_bins), np.min(extra_bins))
+                # print('spec_s d=4: ', spec_s.shape, np.max(spec_s), np.min(spec_s))
+            if bp['hpf_start'] > 0:
+                spec_s = fft_hp_filter(spec_s, bp['hpf_start'], bp['hpf_stop'] - 1)
+                # print('spec_s fft: ', spec_s.shape, np.max(spec_s), np.min(spec_s) )
+            if bands_n == 1:
+                wave = spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse'])
+            else:
+                wave = np.add(wave, spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse']))
+        else:
+            sr = mp.param['band'][d+1]['sr']
+            if d == 1: # lower
+                spec_s = fft_lp_filter(spec_s, bp['lpf_start'], bp['lpf_stop'] - 1) # test
+                spec_s = fft_lp_filter(spec_s, bp['lpf_start'], bp['lpf_stop'])
+                wave = librosa.resample(spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse']), bp['sr'], sr, res_type=wav_resolution)
+            else: # mid
+                spec_s = fft_hp_filter(spec_s, bp['hpf_start'], bp['hpf_stop'] - 1)
+                spec_s = fft_lp_filter(spec_s, bp['lpf_start'], bp['lpf_stop'])
+                wave2 = np.add(wave, spectrogram_to_wave(spec_s, bp['hl'], mp.param['mid_side'], mp.param['mid_side_b2'], mp.param['reverse']))
+                wave = librosa.resample(wave2, bp['sr'], sr, res_type=wav_resolution)
+        # print('spec to wav shape: ', d, wave.shape, np.max(wave), np.min(wave), spec_s.shape, np.max(spec_s), np.min(spec_s))
+    return wave
+def fft_lp_filter(spec, bin_start, bin_stop):
+    g = 1.0
+    for b in range(bin_start, bin_stop):
+        g -= 1 / (bin_stop - bin_start)
+        spec[:, b, :] = g * spec[:, b, :]
+    spec[:, bin_stop:, :] *= 0
+    return spec
+def fft_hp_filter(spec, bin_start, bin_stop):
+    g = 1.0
+    for b in range(bin_start, bin_stop, -1):
+        g -= 1 / (bin_start - bin_stop)
+        spec[:, b, :] = g * spec[:, b, :]
+    spec[:, 0:bin_stop+1, :] *= 0
+    return spec
+def mirroring(a, spec_m, input_high_end, mp):
+    if 'mirroring' == a:
+        mirror = np.flip(np.abs(spec_m[:, mp.param['pre_filter_start']-10-input_high_end.shape[1]:mp.param['pre_filter_start']-10, :]), 1)
+        mirror = mirror * np.exp(1.j * np.angle(input_high_end))
+        return np.where(np.abs(input_high_end) <= np.abs(mirror), input_high_end, mirror)
+    if 'mirroring2' == a:
+        mirror = np.flip(np.abs(spec_m[:, mp.param['pre_filter_start']-10-input_high_end.shape[1]:mp.param['pre_filter_start']-10, :]), 1)
+        mi = np.multiply(mirror, input_high_end * 1.7)
+        return np.where(np.abs(input_high_end) <= np.abs(mi), input_high_end, mi)
+def adjust_aggr(mask, is_non_accom_stem, aggressiveness):
+    aggr = aggressiveness['value']
+    if aggr != 0:
+        if is_non_accom_stem:
+            aggr = 1 - aggr
+        aggr = [aggr, aggr]
+        if aggressiveness['aggr_correction'] is not None:
+            aggr[0] += aggressiveness['aggr_correction']['left']
+            aggr[1] += aggressiveness['aggr_correction']['right']
+        for ch in range(2):
+            mask[ch, :aggressiveness['split_bin']] = np.power(mask[ch, :aggressiveness['split_bin']], 1 + aggr[ch] / 3)
+            mask[ch, aggressiveness['split_bin']:] = np.power(mask[ch, aggressiveness['split_bin']:], 1 + aggr[ch])
+        # if is_non_accom_stem:
+        #     mask = (1.0 - mask)
+    return mask
+def stft(wave, nfft, hl):
+    wave_left = np.asfortranarray(wave[0])
+    wave_right = np.asfortranarray(wave[1])
+    spec_left = librosa.stft(wave_left, nfft, hop_length=hl)
+    spec_right = librosa.stft(wave_right, nfft, hop_length=hl)
+    spec = np.asfortranarray([spec_left, spec_right])
+    return spec
+def istft(spec, hl):
+    spec_left = np.asfortranarray(spec[0])
+    spec_right = np.asfortranarray(spec[1])
+    wave_left = librosa.istft(spec_left, hop_length=hl)
+    wave_right = librosa.istft(spec_right, hop_length=hl)
+    wave = np.asfortranarray([wave_left, wave_right])
+    return wave
+def spec_effects(wave, algorithm='Default', value=None):
+    spec = [stft(wave[0],2048,1024), stft(wave[1],2048,1024)]
+    if algorithm == 'Min_Mag':
+        v_spec_m = np.where(np.abs(spec[1]) <= np.abs(spec[0]), spec[1], spec[0])
+        wave = istft(v_spec_m,1024)
+    elif algorithm == 'Max_Mag':
+        v_spec_m = np.where(np.abs(spec[1]) >= np.abs(spec[0]), spec[1], spec[0])
+        wave = istft(v_spec_m,1024)
+    elif algorithm == 'Default':
+        wave = (wave[1] * value) + (wave[0] * (1-value))
+    elif algorithm == 'Invert_p':
+        X_mag = np.abs(spec[0])
+        y_mag = np.abs(spec[1])
+        max_mag = np.where(X_mag >= y_mag, X_mag, y_mag)
+        v_spec = spec[1] - max_mag * np.exp(1.j * np.angle(spec[0]))
+        wave = istft(v_spec,1024)
+    return wave
+def spectrogram_to_wave_no_mp(spec, n_fft=2048, hop_length=1024):
+    wave = librosa.istft(spec, n_fft=n_fft, hop_length=hop_length)
+    if wave.ndim == 1:
+        wave = np.asfortranarray([wave,wave])
+    return wave
+def wave_to_spectrogram_no_mp(wave):
+    spec = librosa.stft(wave, n_fft=2048, hop_length=1024)
+    if spec.ndim == 1:
+        spec = np.asfortranarray([spec,spec])
+    return spec
+def invert_audio(specs, invert_p=True):
+    ln = min([specs[0].shape[2], specs[1].shape[2]])
+    specs[0] = specs[0][:,:,:ln]
+    specs[1] = specs[1][:,:,:ln]
+    if invert_p:
+        X_mag = np.abs(specs[0])
+        y_mag = np.abs(specs[1])
+        max_mag = np.where(X_mag >= y_mag, X_mag, y_mag)
+        v_spec = specs[1] - max_mag * np.exp(1.j * np.angle(specs[0]))
+    else:
+        specs[1] = reduce_vocal_aggressively(specs[0], specs[1], 0.2)
+        v_spec = specs[0] - specs[1]
+    return v_spec
+def invert_stem(mixture, stem):
+    mixture = wave_to_spectrogram_no_mp(mixture)
+    stem = wave_to_spectrogram_no_mp(stem)
+    output = spectrogram_to_wave_no_mp(invert_audio([mixture, stem]))
+    return -output.T
+def ensembling(a, specs):
+    for i in range(1, len(specs)):
+        if i == 1:
+            spec = specs[0]
+        ln = min([spec.shape[2], specs[i].shape[2]])
+        spec = spec[:,:,:ln]
+        specs[i] = specs[i][:,:,:ln]
+        if MIN_SPEC == a:
+            spec = np.where(np.abs(specs[i]) <= np.abs(spec), specs[i], spec)
+        if MAX_SPEC == a:
+            spec = np.where(np.abs(specs[i]) >= np.abs(spec), specs[i], spec)
+        if AVERAGE == a:
+            spec = np.where(np.abs(specs[i]) == np.abs(spec), specs[i], spec)
+    return spec
+def ensemble_inputs(audio_input, algorithm, is_normalization, wav_type_set, save_path):
+    wavs_ = []
+    if algorithm == AVERAGE:
+        output = average_audio(audio_input)
+        samplerate = 44100
+    else:
+        specs = []
+        for i in range(len(audio_input)):
+            wave, samplerate = librosa.load(audio_input[i], mono=False, sr=44100)
+            wavs_.append(wave)
+            spec = wave_to_spectrogram_no_mp(wave)
+            specs.append(spec)
+        wave_shapes = [w.shape[1] for w in wavs_]
+        target_shape = wavs_[wave_shapes.index(max(wave_shapes))]
+        output = spectrogram_to_wave_no_mp(ensembling(algorithm, specs))
+        output = to_shape(output, target_shape.shape)
+    sf.write(save_path, normalize(output.T, is_normalization), samplerate, subtype=wav_type_set)
+def to_shape(x, target_shape):
+    padding_list = []
+    for x_dim, target_dim in zip(x.shape, target_shape):
+        pad_value = (target_dim - x_dim)
+        pad_tuple = ((0, pad_value))
+        padding_list.append(pad_tuple)
+    return np.pad(x, tuple(padding_list), mode='constant')
+def to_shape_minimize(x: np.ndarray, target_shape):
+    padding_list = []
+    for x_dim, target_dim in zip(x.shape, target_shape):
+        pad_value = (target_dim - x_dim)
+        pad_tuple = ((0, pad_value))
+        padding_list.append(pad_tuple)
+    return np.pad(x, tuple(padding_list), mode='constant')
+def augment_audio(export_path, audio_file, rate, is_normalization, wav_type_set, save_format=None, is_pitch=False):
+    wav, sr = librosa.load(audio_file, sr=44100, mono=False)
+    if wav.ndim == 1:
+        wav = np.asfortranarray([wav,wav])
+    if is_pitch:
+        wav_1 = pyrb.pitch_shift(wav[0], sr, rate, rbargs=None)
+        wav_2 = pyrb.pitch_shift(wav[1], sr, rate, rbargs=None)
+    else:
+        wav_1 = pyrb.time_stretch(wav[0], sr, rate, rbargs=None)
+        wav_2 = pyrb.time_stretch(wav[1], sr, rate, rbargs=None)
+    if wav_1.shape > wav_2.shape:
+        wav_2 = to_shape(wav_2, wav_1.shape)
+    if wav_1.shape < wav_2.shape:
+        wav_1 = to_shape(wav_1, wav_2.shape)
+    wav_mix = np.asfortranarray([wav_1, wav_2])
+    sf.write(export_path, normalize(wav_mix.T, is_normalization), sr, subtype=wav_type_set)
+    save_format(export_path)
+def average_audio(audio):
+    waves = []
+    wave_shapes = []
+    final_waves = []
+    for i in range(len(audio)):
+        wave = librosa.load(audio[i], sr=44100, mono=False)
+        waves.append(wave[0])
+        wave_shapes.append(wave[0].shape[1])
+    wave_shapes_index = wave_shapes.index(max(wave_shapes))
+    target_shape = waves[wave_shapes_index]
+    waves.pop(wave_shapes_index)
+    final_waves.append(target_shape)
+    for n_array in waves:
+        wav_target = to_shape(n_array, target_shape.shape)
+        final_waves.append(wav_target)
+    waves = sum(final_waves)
+    waves = waves/len(audio)
+    return waves
+def average_dual_sources(wav_1, wav_2, value):
+    if wav_1.shape > wav_2.shape:
+        wav_2 = to_shape(wav_2, wav_1.shape)
+    if wav_1.shape < wav_2.shape:
+        wav_1 = to_shape(wav_1, wav_2.shape)
+    wave = (wav_1 * value) + (wav_2 * (1-value))
+    return wave
+def reshape_sources(wav_1: np.ndarray, wav_2: np.ndarray):
+    if wav_1.shape > wav_2.shape:
+        wav_2 = to_shape(wav_2, wav_1.shape)
+    if wav_1.shape < wav_2.shape:
+        ln = min([wav_1.shape[1], wav_2.shape[1]])
+        wav_2 = wav_2[:,:ln]
+    ln = min([wav_1.shape[1], wav_2.shape[1]])
+    wav_1 = wav_1[:,:ln]
+    wav_2 = wav_2[:,:ln]
+    return wav_2
+def align_audio(file1, file2, file2_aligned, file_subtracted, wav_type_set, is_normalization, command_Text, progress_bar_main_var, save_format):
+    def get_diff(a, b):
+        corr = np.correlate(a, b, "full")
+        diff = corr.argmax() - (b.shape[0] - 1)
+        return diff
+    progress_bar_main_var.set(10)
+    # read tracks
+    wav1, sr1 = librosa.load(file1, sr=44100, mono=False)
+    wav2, sr2 = librosa.load(file2, sr=44100, mono=False)
+    wav1 = wav1.transpose()
+    wav2 = wav2.transpose()
+    command_Text(f"Audio file shapes: {wav1.shape} / {wav2.shape}\n")
+    wav2_org = wav2.copy()
+    progress_bar_main_var.set(20)
+    command_Text("Processing files... \n")
+  # pick random position and get diff
+    counts = {}       # counting up for each diff value
+    progress = 20
+    check_range = 64
+    base = (64 / check_range)
+    for i in range(check_range):
+        index = int(random.uniform(44100 * 2, min(wav1.shape[0], wav2.shape[0]) - 44100 * 2))
+        shift = int(random.uniform(-22050,+22050))
+        samp1 = wav1[index      :index      +44100, 0]          # currently use left channel
+        samp2 = wav2[index+shift:index+shift+44100, 0]
+        progress += 1 * base
+        progress_bar_main_var.set(progress)
+        diff = get_diff(samp1, samp2)
+        diff -= shift
+    if abs(diff) < 22050:
+        if not diff in counts:
+            counts[diff] = 0
+        counts[diff] += 1
+  # use max counted diff value
+    max_count = 0
+    est_diff  = 0
+    for diff in counts.keys():
+        if counts[diff] > max_count:
+            max_count = counts[diff]
+            est_diff = diff
+    command_Text(f"Estimated difference is {est_diff} (count: {max_count})\n")
+    progress_bar_main_var.set(90)
+    audio_files = []
+    def save_aligned_audio(wav2_aligned):
+        command_Text(f"Aligned File 2 with File 1.\n")
+        command_Text(f"Saving files... ")
+        sf.write(file2_aligned, normalize(wav2_aligned, is_normalization), sr2, subtype=wav_type_set)
+        save_format(file2_aligned)
+        min_len = min(wav1.shape[0], wav2_aligned.shape[0])
+        wav_sub = wav1[:min_len] - wav2_aligned[:min_len]
+        audio_files.append(file2_aligned)
+        return min_len, wav_sub
+  # make aligned track 2
+    if est_diff > 0:
+        wav2_aligned = np.append(np.zeros((est_diff, 2)), wav2_org, axis=0)
+        min_len, wav_sub = save_aligned_audio(wav2_aligned)
+    elif est_diff < 0:
+        wav2_aligned = wav2_org[-est_diff:]
+        min_len, wav_sub = save_aligned_audio(wav2_aligned)
+    else:
+        command_Text(f"Audio files already aligned.\n")
+        command_Text(f"Saving inverted track... ")
+        min_len = min(wav1.shape[0], wav2.shape[0])
+        wav_sub = wav1[:min_len] - wav2[:min_len]
+    wav_sub = np.clip(wav_sub, -1, +1)
+    sf.write(file_subtracted, normalize(wav_sub, is_normalization), sr1, subtype=wav_type_set)
+    save_format(file_subtracted)
+    progress_bar_main_var.set(95)

uvr5/lib_v5/vr_network/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # VR init.

uvr5/lib_v5/vr_network/layers.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+from lib_v5 import spec_utils
+class Conv2DBNActiv(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
+        super(Conv2DBNActiv, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                nin, nout,
+                kernel_size=ksize,
+                stride=stride,
+                padding=pad,
+                dilation=dilation,
+                bias=False),
+            nn.BatchNorm2d(nout),
+            activ()
+        )
+    def __call__(self, x):
+        return self.conv(x)
+class SeperableConv2DBNActiv(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
+        super(SeperableConv2DBNActiv, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                nin, nin,
+                kernel_size=ksize,
+                stride=stride,
+                padding=pad,
+                dilation=dilation,
+                groups=nin,
+                bias=False),
+            nn.Conv2d(
+                nin, nout,
+                kernel_size=1,
+                bias=False),
+            nn.BatchNorm2d(nout),
+            activ()
+        )
+    def __call__(self, x):
+        return self.conv(x)
+class Encoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
+        super(Encoder, self).__init__()
+        self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
+        self.conv2 = Conv2DBNActiv(nout, nout, ksize, stride, pad, activ=activ)
+    def __call__(self, x):
+        skip = self.conv1(x)
+        h = self.conv2(skip)
+        return h, skip
+class Decoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
+        super(Decoder, self).__init__()
+        self.conv = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
+        self.dropout = nn.Dropout2d(0.1) if dropout else None
+    def __call__(self, x, skip=None):
+        x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
+        if skip is not None:
+            skip = spec_utils.crop_center(skip, x)
+            x = torch.cat([x, skip], dim=1)
+        h = self.conv(x)
+        if self.dropout is not None:
+            h = self.dropout(h)
+        return h
+class ASPPModule(nn.Module):
+    def __init__(self, nn_architecture, nin, nout, dilations=(4, 8, 16), activ=nn.ReLU):
+        super(ASPPModule, self).__init__()
+        self.conv1 = nn.Sequential(
+            nn.AdaptiveAvgPool2d((1, None)),
+            Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
+        )
+        self.nn_architecture = nn_architecture
+        self.six_layer = [129605]
+        self.seven_layer = [537238, 537227, 33966]
+        extra_conv = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
+        self.conv2 = Conv2DBNActiv(nin, nin, 1, 1, 0, activ=activ)
+        self.conv3 = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[0], dilations[0], activ=activ)
+        self.conv4 = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[1], dilations[1], activ=activ)
+        self.conv5 = SeperableConv2DBNActiv(
+            nin, nin, 3, 1, dilations[2], dilations[2], activ=activ)
+        if self.nn_architecture in self.six_layer:
+            self.conv6 = extra_conv
+            nin_x = 6
+        elif self.nn_architecture in self.seven_layer:
+            self.conv6 = extra_conv
+            self.conv7 = extra_conv
+            nin_x = 7
+        else:
+            nin_x = 5
+        self.bottleneck = nn.Sequential(
+            Conv2DBNActiv(nin * nin_x, nout, 1, 1, 0, activ=activ),
+            nn.Dropout2d(0.1)
+        )
+    def forward(self, x):
+        _, _, h, w = x.size()
+        feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
+        feat2 = self.conv2(x)
+        feat3 = self.conv3(x)
+        feat4 = self.conv4(x)
+        feat5 = self.conv5(x)
+        if self.nn_architecture in self.six_layer:
+            feat6 = self.conv6(x)
+            out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6), dim=1)
+        elif self.nn_architecture in self.seven_layer:
+            feat6 = self.conv6(x)
+            feat7 = self.conv7(x)
+            out = torch.cat((feat1, feat2, feat3, feat4, feat5, feat6, feat7), dim=1)
+        else:
+            out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1)
+        bottle = self.bottleneck(out)
+        return bottle

uvr5/lib_v5/vr_network/layers_new.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import torch
+from torch import nn
+import torch.nn.functional as F
+from lib_v5 import spec_utils
+class Conv2DBNActiv(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, dilation=1, activ=nn.ReLU):
+        super(Conv2DBNActiv, self).__init__()
+        self.conv = nn.Sequential(
+            nn.Conv2d(
+                nin, nout,
+                kernel_size=ksize,
+                stride=stride,
+                padding=pad,
+                dilation=dilation,
+                bias=False),
+            nn.BatchNorm2d(nout),
+            activ()
+        )
+    def __call__(self, x):
+        return self.conv(x)
+class Encoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.LeakyReLU):
+        super(Encoder, self).__init__()
+        self.conv1 = Conv2DBNActiv(nin, nout, ksize, stride, pad, activ=activ)
+        self.conv2 = Conv2DBNActiv(nout, nout, ksize, 1, pad, activ=activ)
+    def __call__(self, x):
+        h = self.conv1(x)
+        h = self.conv2(h)
+        return h
+class Decoder(nn.Module):
+    def __init__(self, nin, nout, ksize=3, stride=1, pad=1, activ=nn.ReLU, dropout=False):
+        super(Decoder, self).__init__()
+        self.conv1 = Conv2DBNActiv(nin, nout, ksize, 1, pad, activ=activ)
+        # self.conv2 = Conv2DBNActiv(nout, nout, ksize, 1, pad, activ=activ)
+        self.dropout = nn.Dropout2d(0.1) if dropout else None
+    def __call__(self, x, skip=None):
+        x = F.interpolate(x, scale_factor=2, mode='bilinear', align_corners=True)
+        if skip is not None:
+            skip = spec_utils.crop_center(skip, x)
+            x = torch.cat([x, skip], dim=1)
+        h = self.conv1(x)
+        # h = self.conv2(h)
+        if self.dropout is not None:
+            h = self.dropout(h)
+        return h
+class ASPPModule(nn.Module):
+    def __init__(self, nin, nout, dilations=(4, 8, 12), activ=nn.ReLU, dropout=False):
+        super(ASPPModule, self).__init__()
+        self.conv1 = nn.Sequential(
+            nn.AdaptiveAvgPool2d((1, None)),
+            Conv2DBNActiv(nin, nout, 1, 1, 0, activ=activ)
+        )
+        self.conv2 = Conv2DBNActiv(nin, nout, 1, 1, 0, activ=activ)
+        self.conv3 = Conv2DBNActiv(
+            nin, nout, 3, 1, dilations[0], dilations[0], activ=activ
+        )
+        self.conv4 = Conv2DBNActiv(
+            nin, nout, 3, 1, dilations[1], dilations[1], activ=activ
+        )
+        self.conv5 = Conv2DBNActiv(
+            nin, nout, 3, 1, dilations[2], dilations[2], activ=activ
+        )
+        self.bottleneck = Conv2DBNActiv(nout * 5, nout, 1, 1, 0, activ=activ)
+        self.dropout = nn.Dropout2d(0.1) if dropout else None
+    def forward(self, x):
+        _, _, h, w = x.size()
+        feat1 = F.interpolate(self.conv1(x), size=(h, w), mode='bilinear', align_corners=True)
+        feat2 = self.conv2(x)
+        feat3 = self.conv3(x)
+        feat4 = self.conv4(x)
+        feat5 = self.conv5(x)
+        out = torch.cat((feat1, feat2, feat3, feat4, feat5), dim=1)
+        out = self.bottleneck(out)
+        if self.dropout is not None:
+            out = self.dropout(out)
+        return out
+class LSTMModule(nn.Module):
+    def __init__(self, nin_conv, nin_lstm, nout_lstm):
+        super(LSTMModule, self).__init__()
+        self.conv = Conv2DBNActiv(nin_conv, 1, 1, 1, 0)
+        self.lstm = nn.LSTM(
+            input_size=nin_lstm,
+            hidden_size=nout_lstm // 2,
+            bidirectional=True
+        )
+        self.dense = nn.Sequential(
+            nn.Linear(nout_lstm, nin_lstm),
+            nn.BatchNorm1d(nin_lstm),
+            nn.ReLU()
+        )
+    def forward(self, x):
+        N, _, nbins, nframes = x.size()
+        h = self.conv(x)[:, 0]  # N, nbins, nframes
+        h = h.permute(2, 0, 1)  # nframes, N, nbins
+        h, _ = self.lstm(h)
+        h = self.dense(h.reshape(-1, h.size()[-1]))  # nframes * N, nbins
+        h = h.reshape(nframes, N, 1, nbins)
+        h = h.permute(1, 2, 3, 0)
+        return h

uvr5/lib_v5/vr_network/model_param_init.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import json
+import pathlib
+default_param = {}
+default_param['bins'] = 768
+default_param['unstable_bins'] = 9 # training only
+default_param['reduction_bins'] = 762 # training only
+default_param['sr'] = 44100
+default_param['pre_filter_start'] = 757
+default_param['pre_filter_stop'] = 768
+default_param['band'] = {}
+default_param['band'][1] = {
+    'sr': 11025,
+    'hl': 128,
+    'n_fft': 960,
+    'crop_start': 0,
+    'crop_stop': 245,
+    'lpf_start': 61, # inference only
+    'res_type': 'polyphase'
+}
+default_param['band'][2] = {
+    'sr': 44100,
+    'hl': 512,
+    'n_fft': 1536,
+    'crop_start': 24,
+    'crop_stop': 547,
+    'hpf_start': 81, # inference only
+    'res_type': 'sinc_best'
+}
+def int_keys(d):
+    r = {}
+    for k, v in d:
+        if k.isdigit():
+            k = int(k)
+        r[k] = v
+    return r
+class ModelParameters(object):
+    def __init__(self, config_path=''):
+        if '.pth' == pathlib.Path(config_path).suffix:
+            import zipfile
+            with zipfile.ZipFile(config_path, 'r') as zip:
+                self.param = json.loads(zip.read('param.json'), object_pairs_hook=int_keys)
+        elif '.json' == pathlib.Path(config_path).suffix:
+            with open(config_path, 'r') as f:
+                self.param = json.loads(f.read(), object_pairs_hook=int_keys)
+        else:
+            self.param = default_param
+        for k in ['mid_side', 'mid_side_b', 'mid_side_b2', 'stereo_w', 'stereo_n', 'reverse']:
+            if not k in self.param:
+                self.param[k] = False

uvr5/lib_v5/vr_network/modelparams/1band_sr16000_hl512.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 16000,
+			"hl": 512,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 16000,
+	"pre_filter_start": 1023,
+	"pre_filter_stop": 1024
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr32000_hl512.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 32000,
+			"hl": 512,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "kaiser_fast"
+		}
+	},
+	"sr": 32000,
+	"pre_filter_start": 1000,
+	"pre_filter_stop": 1021
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr33075_hl384.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 33075,
+			"hl": 384,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 33075,
+	"pre_filter_start": 1000,
+	"pre_filter_stop": 1021
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl1024.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 44100,
+			"hl": 1024,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 44100,
+	"pre_filter_start": 1023,
+	"pre_filter_stop": 1024
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl256.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 256,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 44100,
+			"hl": 256,
+			"n_fft": 512,
+			"crop_start": 0,
+			"crop_stop": 256,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 44100,
+	"pre_filter_start": 256,
+	"pre_filter_stop": 256
+}

uvr5/lib_v5/vr_network/modelparams/1band_sr44100_hl512.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+	"bins": 1024,
+	"unstable_bins": 0,
+	"reduction_bins": 0,
+	"band": {
+		"1": {
+			"sr": 44100,
+			"hl": 512,
+			"n_fft": 2048,
+			"crop_start": 0,
+			"crop_stop": 1024,
+			"hpf_start": -1,
+			"res_type": "sinc_best"
+		}
+	},
+	"sr": 44100,
+	"pre_filter_start": 1023,
+	"pre_filter_stop": 1024
+}