Upload folder using huggingface_hub
Browse files- README.md +63 -5
- config.json +52 -43
- model.ckpt +2 -2
README.md
CHANGED
|
@@ -7,11 +7,12 @@ tags:
|
|
| 7 |
- vits
|
| 8 |
- japanese
|
| 9 |
- piper
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# Piper Plus Base Model (Japanese)
|
| 13 |
|
| 14 |
-
日本語TTS用の事前学習済みベースモデルです。
|
| 15 |
|
| 16 |
## Model Details
|
| 17 |
|
|
@@ -22,7 +23,30 @@ tags:
|
|
| 22 |
| サンプルレート | 22050 Hz |
|
| 23 |
| 品質 | medium |
|
| 24 |
| 音素タイプ | OpenJTalk |
|
|
|
|
| 25 |
| **prosody_dim** | **16** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
## Usage
|
| 28 |
|
|
@@ -41,8 +65,6 @@ uv run python -m piper_train.preprocess \
|
|
| 41 |
|
| 42 |
### Step 2: Add Prosody Features (Recommended)
|
| 43 |
|
| 44 |
-
既存のデータセットにprosody_featuresを追加します:
|
| 45 |
-
|
| 46 |
```bash
|
| 47 |
uv run python add_prosody_features.py \
|
| 48 |
--input-dataset /path/to/dataset/dataset.jsonl \
|
|
@@ -67,6 +89,28 @@ uv run python -m piper_train \
|
|
| 67 |
--default_root_dir /path/to/output
|
| 68 |
```
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
## Recommended Parameters
|
| 71 |
|
| 72 |
| パラメータ | 値 | 説明 |
|
|
@@ -75,12 +119,26 @@ uv run python -m piper_train \
|
|
| 75 |
| `--disable_auto_lr_scaling` | - | 学習率の自動スケーリングを無効化 |
|
| 76 |
| `--max_epochs` | 50-100 | 少量データの場合は短め |
|
| 77 |
| `--batch-size` | 32 | GPUメモリに応じて調整 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
## Citation
|
| 80 |
|
| 81 |
```bibtex
|
| 82 |
@software{piper_plus,
|
| 83 |
-
title = {Piper Plus: Japanese TTS with VITS and Prosody Features},
|
| 84 |
author = {ayousanz},
|
| 85 |
year = {2024},
|
| 86 |
url = {https://github.com/ayutaz/piper-plus}
|
|
|
|
| 7 |
- vits
|
| 8 |
- japanese
|
| 9 |
- piper
|
| 10 |
+
- wavlm
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Piper Plus Base Model (Japanese) with WavLM Discriminator & Prosody Features
|
| 14 |
|
| 15 |
+
日本語TTS用の事前学習済みベースモデルです。WavLM Discriminatorによる高品質学習とprosody_features (A1/A2/A3) に対応しています。
|
| 16 |
|
| 17 |
## Model Details
|
| 18 |
|
|
|
|
| 23 |
| サンプルレート | 22050 Hz |
|
| 24 |
| 品質 | medium |
|
| 25 |
| 音素タイプ | OpenJTalk |
|
| 26 |
+
| 話者数 | 0 (単一話者ファインチューニング用) |
|
| 27 |
| **prosody_dim** | **16** |
|
| 28 |
+
| **WavLM Discriminator** | **対応** |
|
| 29 |
+
| 音素数 | 65 (Issue #204, #207 拡張トークン含む) |
|
| 30 |
+
|
| 31 |
+
## Features
|
| 32 |
+
|
| 33 |
+
### WavLM Discriminator
|
| 34 |
+
Microsoft WavLMベースの知覚品質判別器を使用して学習されたモデルです。
|
| 35 |
+
- MOS向上: +0.15-0.25
|
| 36 |
+
- 推論速度への影響: なし(学習時のみ使用)
|
| 37 |
+
|
| 38 |
+
### Prosody Features (A1/A2/A3)
|
| 39 |
+
OpenJTalkから抽出されるプロソディ特徴量をサポート:
|
| 40 |
+
|
| 41 |
+
| フィールド | 意味 | 値の例 |
|
| 42 |
+
|-----------|------|--------|
|
| 43 |
+
| A1 | アクセント核からの相対位置 | -4, -3, ..., 0, 1, ... |
|
| 44 |
+
| A2 | アクセント句内のモーラ位置 | 1, 2, 3, ... |
|
| 45 |
+
| A3 | アクセント句内の総モーラ数 | 1-10+ |
|
| 46 |
+
|
| 47 |
+
### 拡張音素
|
| 48 |
+
- 疑問詞マーカー (Issue #204): `?!`, `?.`, `?~`
|
| 49 |
+
- 文脈依存「ん」バリアント (Issue #207): `N_m`, `N_n`, `N_ng`, `N_uvular`
|
| 50 |
|
| 51 |
## Usage
|
| 52 |
|
|
|
|
| 65 |
|
| 66 |
### Step 2: Add Prosody Features (Recommended)
|
| 67 |
|
|
|
|
|
|
|
| 68 |
```bash
|
| 69 |
uv run python add_prosody_features.py \
|
| 70 |
--input-dataset /path/to/dataset/dataset.jsonl \
|
|
|
|
| 89 |
--default_root_dir /path/to/output
|
| 90 |
```
|
| 91 |
|
| 92 |
+
### Step 4: ONNX Export
|
| 93 |
+
|
| 94 |
+
WavLMモデルは `--stochastic` フラグを推奨:
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
CUDA_VISIBLE_DEVICES="" uv run python -m piper_train.export_onnx \
|
| 98 |
+
--stochastic \
|
| 99 |
+
/path/to/checkpoint.ckpt \
|
| 100 |
+
/path/to/output.onnx
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
### Step 5: Inference
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
CUDA_VISIBLE_DEVICES="" uv run python -m piper_train.infer_onnx \
|
| 107 |
+
--model /path/to/output.onnx \
|
| 108 |
+
--config /path/to/config.json \
|
| 109 |
+
--output-dir /path/to/output \
|
| 110 |
+
--text "こんにちは、今日は良い天気ですね。" \
|
| 111 |
+
--speaker-id 0 --noise-scale 0.5
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
## Recommended Parameters
|
| 115 |
|
| 116 |
| パラメータ | 値 | 説明 |
|
|
|
|
| 119 |
| `--disable_auto_lr_scaling` | - | 学習率の自動スケーリングを無効化 |
|
| 120 |
| `--max_epochs` | 50-100 | 少量データの場合は短め |
|
| 121 |
| `--batch-size` | 32 | GPUメモリに応じて調整 |
|
| 122 |
+
| `--noise-scale` | 0.5 | 推論時の推奨値(WavLMモデル) |
|
| 123 |
+
|
| 124 |
+
## Origin
|
| 125 |
+
|
| 126 |
+
このベースモデルは20話者WavLMモデル(150エポック学習)から変換されました:
|
| 127 |
+
- 元データセット: moe-speech-20speakers-v2 (60,164発話)
|
| 128 |
+
- 学習設定: WavLM Discriminator有効, prosody_dim=16
|
| 129 |
+
- 話者埋め込み層を削除
|
| 130 |
+
- prosody_dim=16を保持
|
| 131 |
+
|
| 132 |
+
## Files
|
| 133 |
+
|
| 134 |
+
- `model.ckpt` - PyTorch Lightningチェックポイント(WavLM Discriminator重み含む)
|
| 135 |
+
- `config.json` - モデル設定(65音素マップ、prosody設定等)
|
| 136 |
|
| 137 |
## Citation
|
| 138 |
|
| 139 |
```bibtex
|
| 140 |
@software{piper_plus,
|
| 141 |
+
title = {Piper Plus: Japanese TTS with VITS, WavLM Discriminator and Prosody Features},
|
| 142 |
author = {ayousanz},
|
| 143 |
year = {2024},
|
| 144 |
url = {https://github.com/ayutaz/piper-plus}
|
config.json
CHANGED
|
@@ -1,45 +1,54 @@
|
|
| 1 |
{
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
"
|
| 42 |
-
"
|
| 43 |
-
"
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"dataset": "moe-speech-20speakers-wavlm",
|
| 3 |
+
"audio": {
|
| 4 |
+
"sample_rate": 22050,
|
| 5 |
+
"quality": "medium"
|
| 6 |
+
},
|
| 7 |
+
"espeak": {
|
| 8 |
+
"voice": "ja"
|
| 9 |
+
},
|
| 10 |
+
"language": {
|
| 11 |
+
"code": "ja"
|
| 12 |
+
},
|
| 13 |
+
"inference": {
|
| 14 |
+
"noise_scale": 0.667,
|
| 15 |
+
"length_scale": 1,
|
| 16 |
+
"noise_w": 0.8
|
| 17 |
+
},
|
| 18 |
+
"phoneme_type": "openjtalk",
|
| 19 |
+
"phoneme_map": {},
|
| 20 |
+
"phoneme_id_map": {
|
| 21 |
+
"_": [0], "^": [1], "$": [2], "?": [3],
|
| 22 |
+
"\ue016": [4], "\ue017": [5], "\ue018": [6],
|
| 23 |
+
"#": [7], "[": [8], "]": [9],
|
| 24 |
+
"a": [10], "i": [11], "u": [12], "e": [13], "o": [14],
|
| 25 |
+
"A": [15], "I": [16], "U": [17], "E": [18], "O": [19],
|
| 26 |
+
"\u00e7": [20], "\u0255": [21], "\u026f": [22], "\u0274": [23], "\u027e": [24],
|
| 27 |
+
"N": [25],
|
| 28 |
+
"\ue019": [26], "\ue01a": [27], "\ue01b": [28], "\ue01c": [29],
|
| 29 |
+
"\u0291": [30], "q": [31], "k": [32],
|
| 30 |
+
"k\u02b2": [33], "\u0261\u02b2": [34], "g": [35], "\u0261": [36], "d\u0291": [37],
|
| 31 |
+
"t": [38], "t\u0255": [39], "d": [40], "d\u02b2": [41],
|
| 32 |
+
"p": [42], "p\u02b2": [43], "b": [44], "b\u02b2": [45],
|
| 33 |
+
"c\u00e7": [46], "\u00e7\u02d0": [47], "s": [48], "\u0283": [49],
|
| 34 |
+
"z": [50], "j": [51], "\u0272": [52],
|
| 35 |
+
"f": [53], "h": [54], "h\u02b2": [55], "v": [56],
|
| 36 |
+
"n": [57], "n\u02b2": [58], "m": [59], "m\u02b2": [60],
|
| 37 |
+
"r": [61], "\u027d": [62], "w": [63], "y": [64]
|
| 38 |
+
},
|
| 39 |
+
"num_symbols": 65,
|
| 40 |
+
"num_speakers": 0,
|
| 41 |
+
"piper_version": "1.5.4",
|
| 42 |
+
"prosody_dim": 16,
|
| 43 |
+
"prosody_features": {
|
| 44 |
+
"a1": "アクセント核からの相対位置",
|
| 45 |
+
"a2": "アクセント句内のモーラ位置",
|
| 46 |
+
"a3": "アクセント句内の総モーラ数"
|
| 47 |
+
},
|
| 48 |
+
"prosody_num_symbols": 11,
|
| 49 |
+
"prosody_id_map": {
|
| 50 |
+
"0": [0], "1": [1], "2": [2], "3": [3], "4": [4],
|
| 51 |
+
"5": [5], "6": [6], "7": [7], "8": [8], "9": [9], "10": [10]
|
| 52 |
+
},
|
| 53 |
+
"use_wavlm_discriminator": true
|
| 54 |
}
|
model.ckpt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a68ef464af9a7303fbd85f74cc3ccb421e9640931099b16517d5657b779b15bf
|
| 3 |
+
size 669746377
|