Commit
·
89a8916
1
Parent(s):
aee38e5
Add TTS Tokenizer, Technical Report, and Basic Tests
Browse files- Implemented `TTSTokenizer` for multi-lingual TTS models, including character and punctuation handling.
- Created a comprehensive technical report detailing the multi-lingual TTS system, architecture, API specifications, and healthcare use case.
- Added a basic test script to verify the functionality of the TTS system, including module imports, configuration checks, tokenizer functionality, text normalization, model downloading, and engine initialization.
This view is limited to 50 files because it contains too many changes.
See raw diff
- .DS_Store +0 -0
- .gitignore +4 -0
- Procfile +1 -0
- README.md +246 -0
- download_models.py +55 -0
- models/.DS_Store +0 -0
- models/bho_female/.gitattributes +35 -0
- models/bho_female/README.md +3 -0
- models/bho_female/checkpoint_340000.pth +3 -0
- models/bho_female/config.json +257 -0
- models/bho_male/.gitattributes +35 -0
- models/bho_male/README.md +3 -0
- models/bho_male/checkpoint_200000.pth +3 -0
- models/bho_male/config.json +257 -0
- models/bn_female/bn_female_vits_30hrs.pt +3 -0
- models/bn_female/chars.txt +1 -0
- models/bn_female/jit_infer.py +32 -0
- models/bn_male/bn_male_vits_30hrs.pt +3 -0
- models/bn_male/chars.txt +1 -0
- models/bn_male/extra.py +787 -0
- models/bn_male/jit_infer.py +32 -0
- models/en_female/.gitattributes +35 -0
- models/en_female/README.md +3 -0
- models/en_female/chars.txt +1 -0
- models/en_female/en_female_vits_30hrs.pt +3 -0
- models/en_female/extra.py +787 -0
- models/en_female/jit_infer.py +33 -0
- models/en_male/.gitattributes +35 -0
- models/en_male/README.md +3 -0
- models/en_male/chars.txt +1 -0
- models/en_male/en_male_vits_30hrs.pt +3 -0
- models/en_male/extra.py +787 -0
- models/en_male/jit_infer.py +32 -0
- models/gu_mms/config.json +82 -0
- models/gu_mms/special_tokens_map.json +4 -0
- models/gu_mms/tokenizer_config.json +12 -0
- models/gu_mms/vocab.json +62 -0
- models/hi_female/__pycache__/extra.cpython-310.pyc +0 -0
- models/hi_female/chars.txt +1 -0
- models/hi_female/extra.py +787 -0
- models/hi_female/hi_female_vits_30hrs.pt +3 -0
- models/hi_female/jit_infer.py +32 -0
- models/hi_male/chars.txt +1 -0
- models/hi_male/extra.py +787 -0
- models/hi_male/hi_male_vits_30hrs.pt +3 -0
- models/hi_male/jit_infer.py +32 -0
- models/hne_female/.gitattributes +35 -0
- models/hne_female/README.md +3 -0
- models/hne_female/ch_female_vits_30hrs.pt +3 -0
- models/hne_female/chars.txt +1 -0
.DS_Store
ADDED
|
Binary file (8.2 kB). View file
|
|
|
.gitignore
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Ignore generated audio + specs
|
| 2 |
+
outputs/
|
| 3 |
+
*.wav
|
| 4 |
+
Voicetech API Specification.pdf
|
Procfile
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
web: python download_models.py && uvicorn src.api:app --host 0.0.0.0 --port ${PORT:-8000}
|
README.md
ADDED
|
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Voice Tech for All - Multi-lingual TTS System
|
| 2 |
+
|
| 3 |
+
A lightweight, multi-lingual Text-to-Speech system supporting **11 Indian languages** with **style/prosody control** and REST API.
|
| 4 |
+
|
| 5 |
+
## 🎯 Hackathon: Voice Tech for All
|
| 6 |
+
|
| 7 |
+
Built for the healthcare assistant use case - helping pregnant mothers in low-income communities access healthcare information in their native languages.
|
| 8 |
+
|
| 9 |
+
## ✨ Features
|
| 10 |
+
|
| 11 |
+
- **11 Indian Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English, **Gujarati**
|
| 12 |
+
- **21 Voice Options**: Male & Female voices for each language
|
| 13 |
+
- **Style/Prosody Control**: 9 presets (happy, sad, calm, excited, etc.)
|
| 14 |
+
- **Pitch & Speed Control**: Fine-tune voice characteristics
|
| 15 |
+
- **Lightweight**: VITS-based models optimized for fast inference
|
| 16 |
+
- **REST API**: FastAPI-powered server with OpenAPI docs
|
| 17 |
+
- **Text Normalization**: Handles numbers, punctuation for Indian scripts
|
| 18 |
+
|
| 19 |
+
## 🚀 Quick Start
|
| 20 |
+
|
| 21 |
+
### 1. Installation
|
| 22 |
+
|
| 23 |
+
```bash
|
| 24 |
+
# Clone and navigate
|
| 25 |
+
git clone https://github.com/harshil748/VoiceAPI
|
| 26 |
+
cd VoiceAPI
|
| 27 |
+
|
| 28 |
+
# Create virtual environment
|
| 29 |
+
python3 -m venv tts
|
| 30 |
+
source tts/bin/activate
|
| 31 |
+
|
| 32 |
+
# Install dependencies
|
| 33 |
+
pip install -r requirements.txt
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
### 2. Download Models
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
# Download Hindi models (male + female)
|
| 40 |
+
python -m src.cli download --lang hi
|
| 41 |
+
|
| 42 |
+
# Or download a specific voice
|
| 43 |
+
python -m src.cli download --voice hi_male
|
| 44 |
+
|
| 45 |
+
# Gujarati uses Facebook MMS (auto-downloads on first use)
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
### 3. Synthesize Speech
|
| 49 |
+
|
| 50 |
+
```bash
|
| 51 |
+
# Basic synthesis
|
| 52 |
+
python -m src.cli synthesize --text "नमस्ते दोस्तों" --voice hi_male --output hello.wav
|
| 53 |
+
|
| 54 |
+
# Play the audio (macOS)
|
| 55 |
+
afplay hello.wav
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### 4. Start API Server
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
python -m src.cli serve --port 8000
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
Visit `http://localhost:8000/docs` for interactive API documentation.
|
| 65 |
+
|
| 66 |
+
## 🎨 Style Presets
|
| 67 |
+
|
| 68 |
+
| Preset | Speed | Pitch | Energy | Best For |
|
| 69 |
+
| --------- | ----- | ----- | ------ | ----------------------- |
|
| 70 |
+
| `default` | 1.0 | 1.0 | 1.0 | Normal speech |
|
| 71 |
+
| `slow` | 0.75 | 1.0 | 1.0 | Elderly users, clarity |
|
| 72 |
+
| `fast` | 1.25 | 1.0 | 1.0 | Quick information |
|
| 73 |
+
| `soft` | 0.9 | 0.95 | 0.7 | Calming content |
|
| 74 |
+
| `loud` | 1.0 | 1.05 | 1.3 | Alerts, emphasis |
|
| 75 |
+
| `happy` | 1.1 | 1.1 | 1.2 | Positive messages |
|
| 76 |
+
| `sad` | 0.85 | 0.9 | 0.8 | Empathetic responses |
|
| 77 |
+
| `calm` | 0.9 | 0.95 | 0.85 | **Healthcare guidance** |
|
| 78 |
+
| `excited` | 1.2 | 1.15 | 1.3 | Celebrations |
|
| 79 |
+
|
| 80 |
+
## 📡 API Usage
|
| 81 |
+
|
| 82 |
+
### 🏆 Hackathon API - GET /Get_Inference
|
| 83 |
+
|
| 84 |
+
**This is the official hackathon endpoint** that follows the Voice Tech for All specification:
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
import requests
|
| 88 |
+
|
| 89 |
+
base_url = 'http://localhost:8000/Get_Inference'
|
| 90 |
+
WavPath = 'path/to/reference.wav'
|
| 91 |
+
|
| 92 |
+
params = {
|
| 93 |
+
'text': 'ಮಾದರಿಯು ಸರಿಯಾಗಿ ಕಾರ್ಯನಿರ್ವಹಿಸುತ್ತಿದೆಯೇ ಎಂದು ಖಚಿತಪಡಿಸಿಕೊಳ್ಳಲು ಬಳಸಲಾಗುವ ಪರೀಕ್ಷಾ ವಾಕ್ಯ ಇದು.',
|
| 94 |
+
'lang': 'kannada',
|
| 95 |
+
}
|
| 96 |
+
|
| 97 |
+
with open(WavPath, "rb") as AudioFile:
|
| 98 |
+
response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile})
|
| 99 |
+
|
| 100 |
+
if response.status_code == 200:
|
| 101 |
+
with open('output.wav', 'wb') as f:
|
| 102 |
+
f.write(response.content)
|
| 103 |
+
print("Audio saved as 'output.wav'")
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
**Query Parameters:**
|
| 107 |
+
|
| 108 |
+
| Parameter | Type | Required | Description |
|
| 109 |
+
| ------------- | ------ | --------- | ---------------------------------------------------------------------------------------------------------------- |
|
| 110 |
+
| `text` | string | Mandatory | Input text to convert to speech. For English, text must be lowercase. |
|
| 111 |
+
| `lang` | string | Mandatory | Language: bhojpuri, bengali, english, gujarati, hindi, chhattisgarhi, kannada, magahi, maithili, marathi, telugu |
|
| 112 |
+
| `speaker_wav` | file | Mandatory | Reference WAV file for speaker voice |
|
| 113 |
+
|
| 114 |
+
**Response:** `200 OK` with `Content-Type: audio/wav`
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
### Synthesize with Style (POST)
|
| 119 |
+
|
| 120 |
+
```bash
|
| 121 |
+
curl -X POST "http://localhost:8000/synthesize" \
|
| 122 |
+
-H "Content-Type: application/json" \
|
| 123 |
+
-d '{
|
| 124 |
+
"text": "आपका दिन शुभ हो",
|
| 125 |
+
"voice": "hi_female",
|
| 126 |
+
"style": "happy",
|
| 127 |
+
"speed": 1.0,
|
| 128 |
+
"pitch": 1.0
|
| 129 |
+
}' \
|
| 130 |
+
--output speech.wav
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### Gujarati Synthesis
|
| 134 |
+
|
| 135 |
+
```bash
|
| 136 |
+
curl -X POST "http://localhost:8000/synthesize" \
|
| 137 |
+
-H "Content-Type: application/json" \
|
| 138 |
+
-d '{"text": "નમસ્તે, કેમ છો?", "voice": "gu_mms", "style": "calm"}' \
|
| 139 |
+
--output gujarati.wav
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
### List Style Presets
|
| 143 |
+
|
| 144 |
+
```bash
|
| 145 |
+
curl http://localhost:8000/styles
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
## 🎤 Available Voices
|
| 149 |
+
|
| 150 |
+
| Language | Code | Male | Female | Notes |
|
| 151 |
+
| ------------- | ---- | ----------- | ------------- | ------------ |
|
| 152 |
+
| Hindi | hi | ✅ hi_male | ✅ hi_female | SYSPIN |
|
| 153 |
+
| Bengali | bn | ✅ bn_male | ✅ bn_female | SYSPIN |
|
| 154 |
+
| Marathi | mr | ✅ mr_male | ✅ mr_female | SYSPIN |
|
| 155 |
+
| Telugu | te | ✅ te_male | ✅ te_female | SYSPIN |
|
| 156 |
+
| Kannada | kn | ✅ kn_male | ✅ kn_female | SYSPIN |
|
| 157 |
+
| Bhojpuri | bho | ✅ bho_male | ✅ bho_female | SYSPIN |
|
| 158 |
+
| Chhattisgarhi | hne | ✅ hne_male | ✅ hne_female | SYSPIN |
|
| 159 |
+
| Maithili | mai | ✅ mai_male | ✅ mai_female | SYSPIN |
|
| 160 |
+
| Magahi | mag | ✅ mag_male | ✅ mag_female | SYSPIN |
|
| 161 |
+
| English | en | ✅ en_male | ✅ en_female | SYSPIN |
|
| 162 |
+
| **Gujarati** | gu | ✅ gu_mms | - | Facebook MMS |
|
| 163 |
+
|
| 164 |
+
## 🐍 Python API
|
| 165 |
+
|
| 166 |
+
```python
|
| 167 |
+
from src.engine import TTSEngine
|
| 168 |
+
|
| 169 |
+
# Initialize engine
|
| 170 |
+
engine = TTSEngine(device="auto")
|
| 171 |
+
|
| 172 |
+
# Basic synthesis
|
| 173 |
+
output = engine.synthesize(
|
| 174 |
+
text="गर्भावस्था में स्वस्थ आहार महत्वपूर्ण है",
|
| 175 |
+
voice="hi_female"
|
| 176 |
+
)
|
| 177 |
+
|
| 178 |
+
# With style control
|
| 179 |
+
output = engine.synthesize(
|
| 180 |
+
text="आपका दिन शुभ हो",
|
| 181 |
+
voice="hi_male",
|
| 182 |
+
style="happy", # Use preset
|
| 183 |
+
pitch=1.1, # Or manual control
|
| 184 |
+
speed=1.0,
|
| 185 |
+
energy=1.2
|
| 186 |
+
)
|
| 187 |
+
|
| 188 |
+
# Gujarati
|
| 189 |
+
output = engine.synthesize(
|
| 190 |
+
text="સ્વસ્થ રહો, ખુશ રહો",
|
| 191 |
+
voice="gu_mms",
|
| 192 |
+
style="calm"
|
| 193 |
+
)
|
| 194 |
+
|
| 195 |
+
# Save to file
|
| 196 |
+
engine.synthesize_to_file(
|
| 197 |
+
text="નમસ્તે",
|
| 198 |
+
output_path="hello.wav",
|
| 199 |
+
voice="gu_mms",
|
| 200 |
+
style="calm"
|
| 201 |
+
)
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
## 📁 Project Structure
|
| 205 |
+
|
| 206 |
+
```text
|
| 207 |
+
VoiceAPI/
|
| 208 |
+
├── src/
|
| 209 |
+
│ ├── config.py # Language/voice/style configurations
|
| 210 |
+
│ ├── tokenizer.py # Text tokenization & normalization
|
| 211 |
+
│ ├── engine.py # Main TTS engine with style processor
|
| 212 |
+
│ ├── downloader.py # HuggingFace model downloader
|
| 213 |
+
│ ├── api.py # FastAPI REST server
|
| 214 |
+
│ └── cli.py # Command-line interface
|
| 215 |
+
├── models/ # Downloaded models
|
| 216 |
+
├── dataset/ # SPICOR dataset (for fine-tuning)
|
| 217 |
+
├── technical_report.md
|
| 218 |
+
├── requirements.txt
|
| 219 |
+
└── README.md
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
## 📊 Performance
|
| 223 |
+
|
| 224 |
+
| Metric | Value |
|
| 225 |
+
| -------------- | ------------------------------- |
|
| 226 |
+
| Languages | 11 |
|
| 227 |
+
| Voice Variants | 21 |
|
| 228 |
+
| Style Presets | 9 |
|
| 229 |
+
| Model Size | ~300MB (VITS), ~145MB (MMS) |
|
| 230 |
+
| Inference Time | ~0.3s (M2 Mac, CPU) |
|
| 231 |
+
| Sample Rate | 22050 Hz (VITS), 16000 Hz (MMS) |
|
| 232 |
+
|
| 233 |
+
## 🙏 Credits
|
| 234 |
+
|
| 235 |
+
- **SYSPIN Models**: [IISc Bangalore](https://huggingface.co/SYSPIN)
|
| 236 |
+
- **MMS Models**: [Facebook Research](https://huggingface.co/facebook/mms-tts-guj)
|
| 237 |
+
- **Architecture**: VITS (Coqui AI)
|
| 238 |
+
- **Dataset**: SPICOR TTS Project, IISc SPIRE Lab
|
| 239 |
+
|
| 240 |
+
## 📜 License
|
| 241 |
+
|
| 242 |
+
CC BY 4.0 (SYSPIN), CC BY-NC 4.0 (MMS)
|
| 243 |
+
|
| 244 |
+
---
|
| 245 |
+
|
| 246 |
+
Built with ❤️ for **Voice Tech for All Hackathon**
|
download_models.py
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Download all required TTS models from HuggingFace
|
| 4 |
+
Run this on deployment to fetch models before starting the server
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
import sys
|
| 9 |
+
|
| 10 |
+
# Add src to path
|
| 11 |
+
sys.path.insert(0, os.path.dirname(__file__))
|
| 12 |
+
|
| 13 |
+
from src.downloader import ModelDownloader
|
| 14 |
+
from src.config import LANGUAGE_CONFIGS
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
def main():
|
| 18 |
+
print("=" * 60)
|
| 19 |
+
print("Downloading TTS Models from HuggingFace...")
|
| 20 |
+
print("=" * 60)
|
| 21 |
+
|
| 22 |
+
downloader = ModelDownloader()
|
| 23 |
+
|
| 24 |
+
# Download all configured models
|
| 25 |
+
voices = list(LANGUAGE_CONFIGS.keys())
|
| 26 |
+
print(f"\nModels to download: {len(voices)}")
|
| 27 |
+
for v in voices:
|
| 28 |
+
print(f" - {v}")
|
| 29 |
+
|
| 30 |
+
print("\n")
|
| 31 |
+
|
| 32 |
+
success = 0
|
| 33 |
+
failed = []
|
| 34 |
+
|
| 35 |
+
for voice in voices:
|
| 36 |
+
try:
|
| 37 |
+
print(f"Downloading {voice}...")
|
| 38 |
+
downloader.download_model(voice)
|
| 39 |
+
success += 1
|
| 40 |
+
print(f" ✓ {voice} downloaded\n")
|
| 41 |
+
except Exception as e:
|
| 42 |
+
print(f" ✗ {voice} failed: {e}\n")
|
| 43 |
+
failed.append(voice)
|
| 44 |
+
|
| 45 |
+
print("=" * 60)
|
| 46 |
+
print(f"Download complete: {success}/{len(voices)} models")
|
| 47 |
+
if failed:
|
| 48 |
+
print(f"Failed: {', '.join(failed)}")
|
| 49 |
+
return 1
|
| 50 |
+
print("=" * 60)
|
| 51 |
+
return 0
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
if __name__ == "__main__":
|
| 55 |
+
sys.exit(main())
|
models/.DS_Store
ADDED
|
Binary file (10.2 kB). View file
|
|
|
models/bho_female/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
models/bho_female/README.md
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
---
|
models/bho_female/checkpoint_340000.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2182258024b05f739bf79002cb52cfa863605d54ee2eee5b4a5cd1fbaac797ab
|
| 3 |
+
size 997764677
|
models/bho_female/config.json
ADDED
|
@@ -0,0 +1,257 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"output_path": ".",
|
| 3 |
+
"logger_uri": null,
|
| 4 |
+
"run_name": "vits_Bhojpuri_Female_30hrs",
|
| 5 |
+
"project_name": null,
|
| 6 |
+
"run_description": "\ud83d\udc38Coqui trainer run.",
|
| 7 |
+
"print_step": 25,
|
| 8 |
+
"plot_step": 100,
|
| 9 |
+
"model_param_stats": false,
|
| 10 |
+
"wandb_entity": null,
|
| 11 |
+
"dashboard_logger": "tensorboard",
|
| 12 |
+
"log_model_step": null,
|
| 13 |
+
"save_step": 20000,
|
| 14 |
+
"save_n_checkpoints": 1000,
|
| 15 |
+
"save_checkpoints": true,
|
| 16 |
+
"save_all_best": false,
|
| 17 |
+
"save_best_after": 10000,
|
| 18 |
+
"target_loss": null,
|
| 19 |
+
"print_eval": true,
|
| 20 |
+
"test_delay_epochs": -1,
|
| 21 |
+
"run_eval": true,
|
| 22 |
+
"run_eval_steps": null,
|
| 23 |
+
"distributed_backend": "nccl",
|
| 24 |
+
"distributed_url": "tcp://localhost:54321",
|
| 25 |
+
"mixed_precision": true,
|
| 26 |
+
"epochs": 1000,
|
| 27 |
+
"batch_size": 40,
|
| 28 |
+
"eval_batch_size": 16,
|
| 29 |
+
"grad_clip": [
|
| 30 |
+
1000,
|
| 31 |
+
1000
|
| 32 |
+
],
|
| 33 |
+
"scheduler_after_epoch": true,
|
| 34 |
+
"lr": 0.001,
|
| 35 |
+
"optimizer": "AdamW",
|
| 36 |
+
"optimizer_params": {
|
| 37 |
+
"betas": [
|
| 38 |
+
0.8,
|
| 39 |
+
0.99
|
| 40 |
+
],
|
| 41 |
+
"eps": 1e-09,
|
| 42 |
+
"weight_decay": 0.01
|
| 43 |
+
},
|
| 44 |
+
"lr_scheduler": null,
|
| 45 |
+
"lr_scheduler_params": {},
|
| 46 |
+
"use_grad_scaler": false,
|
| 47 |
+
"cudnn_enable": true,
|
| 48 |
+
"cudnn_deterministic": false,
|
| 49 |
+
"cudnn_benchmark": false,
|
| 50 |
+
"training_seed": 54321,
|
| 51 |
+
"model": "vits",
|
| 52 |
+
"num_loader_workers": 8,
|
| 53 |
+
"num_eval_loader_workers": 4,
|
| 54 |
+
"use_noise_augment": false,
|
| 55 |
+
"audio": {
|
| 56 |
+
"fft_size": 1024,
|
| 57 |
+
"sample_rate": 22050,
|
| 58 |
+
"win_length": 1024,
|
| 59 |
+
"hop_length": 256,
|
| 60 |
+
"num_mels": 80,
|
| 61 |
+
"mel_fmin": 0,
|
| 62 |
+
"mel_fmax": null
|
| 63 |
+
},
|
| 64 |
+
"use_phonemes": false,
|
| 65 |
+
"phonemizer": null,
|
| 66 |
+
"phoneme_language": "en-us",
|
| 67 |
+
"compute_input_seq_cache": true,
|
| 68 |
+
"text_cleaner": "multilingual_cleaners",
|
| 69 |
+
"enable_eos_bos_chars": false,
|
| 70 |
+
"test_sentences_file": "",
|
| 71 |
+
"phoneme_cache_path": "./phoneme_cache",
|
| 72 |
+
"characters": {
|
| 73 |
+
"characters_class": "TTS.tts.models.vits.VitsCharacters",
|
| 74 |
+
"vocab_dict": null,
|
| 75 |
+
"pad": "<PAD>",
|
| 76 |
+
"eos": "<EOS>",
|
| 77 |
+
"bos": "<BOS>",
|
| 78 |
+
"blank": "<BLNK>",
|
| 79 |
+
"characters": "\u091a.\u0947\u0910\u0925\u092e\u0959\u091d\u0906\u0949?\u092d\u092a \u0939\u0928\u093d\u091f\u0940\u0938\u0935\u091b\u0923\u0921\u091e\u0926\u094b\u0915\u0924\u0948\u0943\u095b\u0941\u095e\u092c\u0908\u094c\u0927\u090b\u093e\u0922\u0907\u093c\u0902\u0937\u0920\u0905\u095c\u0913\u092f,\u093f\u0930\u0914\u0901\u092b\u0909\u0916\u0911\u094d\u0932\u091c\u090f\u090a\u0917\u0936\u095d\u0919\u0918\u0942",
|
| 80 |
+
"punctuations": "!\u00a1'(),-.:;\u00bf? ",
|
| 81 |
+
"phonemes": null,
|
| 82 |
+
"is_unique": true,
|
| 83 |
+
"is_sorted": true
|
| 84 |
+
},
|
| 85 |
+
"add_blank": true,
|
| 86 |
+
"batch_group_size": 5,
|
| 87 |
+
"loss_masking": null,
|
| 88 |
+
"min_audio_len": 1,
|
| 89 |
+
"max_audio_len": Infinity,
|
| 90 |
+
"min_text_len": 1,
|
| 91 |
+
"max_text_len": Infinity,
|
| 92 |
+
"compute_f0": false,
|
| 93 |
+
"compute_energy": false,
|
| 94 |
+
"compute_linear_spec": true,
|
| 95 |
+
"precompute_num_workers": 0,
|
| 96 |
+
"start_by_longest": false,
|
| 97 |
+
"shuffle": false,
|
| 98 |
+
"drop_last": false,
|
| 99 |
+
"datasets": [
|
| 100 |
+
{
|
| 101 |
+
"formatter": "syspin",
|
| 102 |
+
"dataset_name": "",
|
| 103 |
+
"path": ".",
|
| 104 |
+
"meta_file_train": "../manifests/Bhojpuri_Female/30hrs.tsv",
|
| 105 |
+
"ignored_speakers": null,
|
| 106 |
+
"language": "",
|
| 107 |
+
"phonemizer": "",
|
| 108 |
+
"meta_file_val": "",
|
| 109 |
+
"meta_file_attn_mask": ""
|
| 110 |
+
}
|
| 111 |
+
],
|
| 112 |
+
"test_sentences": [
|
| 113 |
+
[
|
| 114 |
+
"\u090f\u0928\u094d\u091f\u094d\u0930\u093e\u092a\u0940 \u0915\u0902\u092a\u094d\u092f\u0942\u091f\u093f\u0902\u0917 \u092e\u0947\u0902 \u090f\u0928\u094d\u091f\u094d\u0930\u094b\u092a\u0940 \u090a \u0911\u092a\u0930\u0947\u091f\u093f\u0902\u0917 \u0938\u093f\u0938\u094d\u091f\u092e \u0939 \u091c\u0947 \u092a\u0947 \u0938\u0930\u093e \u0915\u094d\u0930\u093f\u092a\u094d\u091f\u094b\u0917\u094d\u0930\u093e\u092b\u093f\u0915 \u092b\u0902\u0915\u094d\u0936\u0928 \u0938\u092c \u0915\u093e\u092e \u0915\u0930\u0947 \u0932\u0947\u0902",
|
| 115 |
+
"Bhojpuri_Female",
|
| 116 |
+
null,
|
| 117 |
+
"bh"
|
| 118 |
+
]
|
| 119 |
+
],
|
| 120 |
+
"eval_split_max_size": null,
|
| 121 |
+
"eval_split_size": 0.01,
|
| 122 |
+
"use_speaker_weighted_sampler": false,
|
| 123 |
+
"speaker_weighted_sampler_alpha": 1.0,
|
| 124 |
+
"use_language_weighted_sampler": false,
|
| 125 |
+
"language_weighted_sampler_alpha": 1.0,
|
| 126 |
+
"use_length_weighted_sampler": false,
|
| 127 |
+
"length_weighted_sampler_alpha": 1.0,
|
| 128 |
+
"model_args": {
|
| 129 |
+
"num_chars": 85,
|
| 130 |
+
"out_channels": 513,
|
| 131 |
+
"spec_segment_size": 32,
|
| 132 |
+
"hidden_channels": 192,
|
| 133 |
+
"hidden_channels_ffn_text_encoder": 768,
|
| 134 |
+
"num_heads_text_encoder": 2,
|
| 135 |
+
"num_layers_text_encoder": 6,
|
| 136 |
+
"kernel_size_text_encoder": 3,
|
| 137 |
+
"dropout_p_text_encoder": 0.1,
|
| 138 |
+
"dropout_p_duration_predictor": 0.5,
|
| 139 |
+
"kernel_size_posterior_encoder": 5,
|
| 140 |
+
"dilation_rate_posterior_encoder": 1,
|
| 141 |
+
"num_layers_posterior_encoder": 16,
|
| 142 |
+
"kernel_size_flow": 5,
|
| 143 |
+
"dilation_rate_flow": 1,
|
| 144 |
+
"num_layers_flow": 4,
|
| 145 |
+
"resblock_type_decoder": "1",
|
| 146 |
+
"resblock_kernel_sizes_decoder": [
|
| 147 |
+
3,
|
| 148 |
+
7,
|
| 149 |
+
11
|
| 150 |
+
],
|
| 151 |
+
"resblock_dilation_sizes_decoder": [
|
| 152 |
+
[
|
| 153 |
+
1,
|
| 154 |
+
3,
|
| 155 |
+
5
|
| 156 |
+
],
|
| 157 |
+
[
|
| 158 |
+
1,
|
| 159 |
+
3,
|
| 160 |
+
5
|
| 161 |
+
],
|
| 162 |
+
[
|
| 163 |
+
1,
|
| 164 |
+
3,
|
| 165 |
+
5
|
| 166 |
+
]
|
| 167 |
+
],
|
| 168 |
+
"upsample_rates_decoder": [
|
| 169 |
+
8,
|
| 170 |
+
8,
|
| 171 |
+
2,
|
| 172 |
+
2
|
| 173 |
+
],
|
| 174 |
+
"upsample_initial_channel_decoder": 512,
|
| 175 |
+
"upsample_kernel_sizes_decoder": [
|
| 176 |
+
16,
|
| 177 |
+
16,
|
| 178 |
+
4,
|
| 179 |
+
4
|
| 180 |
+
],
|
| 181 |
+
"periods_multi_period_discriminator": [
|
| 182 |
+
2,
|
| 183 |
+
3,
|
| 184 |
+
5,
|
| 185 |
+
7,
|
| 186 |
+
11
|
| 187 |
+
],
|
| 188 |
+
"use_sdp": true,
|
| 189 |
+
"noise_scale": 1.0,
|
| 190 |
+
"inference_noise_scale": 0.667,
|
| 191 |
+
"length_scale": 1,
|
| 192 |
+
"noise_scale_dp": 1.0,
|
| 193 |
+
"inference_noise_scale_dp": 1.0,
|
| 194 |
+
"max_inference_len": null,
|
| 195 |
+
"init_discriminator": true,
|
| 196 |
+
"use_spectral_norm_disriminator": false,
|
| 197 |
+
"use_speaker_embedding": false,
|
| 198 |
+
"num_speakers": 0,
|
| 199 |
+
"speakers_file": null,
|
| 200 |
+
"d_vector_file": null,
|
| 201 |
+
"speaker_embedding_channels": 256,
|
| 202 |
+
"use_d_vector_file": false,
|
| 203 |
+
"d_vector_dim": 0,
|
| 204 |
+
"detach_dp_input": true,
|
| 205 |
+
"use_language_embedding": false,
|
| 206 |
+
"embedded_language_dim": 4,
|
| 207 |
+
"num_languages": 0,
|
| 208 |
+
"language_ids_file": null,
|
| 209 |
+
"use_speaker_encoder_as_loss": false,
|
| 210 |
+
"speaker_encoder_config_path": "",
|
| 211 |
+
"speaker_encoder_model_path": "",
|
| 212 |
+
"condition_dp_on_speaker": true,
|
| 213 |
+
"freeze_encoder": false,
|
| 214 |
+
"freeze_DP": false,
|
| 215 |
+
"freeze_PE": false,
|
| 216 |
+
"freeze_flow_decoder": false,
|
| 217 |
+
"freeze_waveform_decoder": false,
|
| 218 |
+
"encoder_sample_rate": null,
|
| 219 |
+
"interpolate_z": true,
|
| 220 |
+
"reinit_DP": false,
|
| 221 |
+
"reinit_text_encoder": false
|
| 222 |
+
},
|
| 223 |
+
"lr_gen": 0.0002,
|
| 224 |
+
"lr_disc": 0.0002,
|
| 225 |
+
"lr_scheduler_gen": "ExponentialLR",
|
| 226 |
+
"lr_scheduler_gen_params": {
|
| 227 |
+
"gamma": 0.999875,
|
| 228 |
+
"last_epoch": -1
|
| 229 |
+
},
|
| 230 |
+
"lr_scheduler_disc": "ExponentialLR",
|
| 231 |
+
"lr_scheduler_disc_params": {
|
| 232 |
+
"gamma": 0.999875,
|
| 233 |
+
"last_epoch": -1
|
| 234 |
+
},
|
| 235 |
+
"kl_loss_alpha": 1.0,
|
| 236 |
+
"disc_loss_alpha": 1.0,
|
| 237 |
+
"gen_loss_alpha": 1.0,
|
| 238 |
+
"feat_loss_alpha": 1.0,
|
| 239 |
+
"mel_loss_alpha": 45.0,
|
| 240 |
+
"dur_loss_alpha": 1.0,
|
| 241 |
+
"speaker_encoder_loss_alpha": 1.0,
|
| 242 |
+
"return_wav": true,
|
| 243 |
+
"use_weighted_sampler": false,
|
| 244 |
+
"weighted_sampler_attrs": {},
|
| 245 |
+
"weighted_sampler_multipliers": {},
|
| 246 |
+
"r": 1,
|
| 247 |
+
"num_speakers": 0,
|
| 248 |
+
"use_speaker_embedding": false,
|
| 249 |
+
"speakers_file": null,
|
| 250 |
+
"speaker_embedding_channels": 256,
|
| 251 |
+
"language_ids_file": null,
|
| 252 |
+
"use_language_embedding": false,
|
| 253 |
+
"use_d_vector_file": false,
|
| 254 |
+
"d_vector_file": null,
|
| 255 |
+
"d_vector_dim": 0,
|
| 256 |
+
"github_branch": "* dev"
|
| 257 |
+
}
|
models/bho_male/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
models/bho_male/README.md
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
---
|
models/bho_male/checkpoint_200000.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c4fb6ce54092c79ab526d4e9bc70514d7ea7f820b0184ef99e6ad3a7b9b72abc
|
| 3 |
+
size 997766981
|
models/bho_male/config.json
ADDED
|
@@ -0,0 +1,257 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"output_path": ".",
|
| 3 |
+
"logger_uri": null,
|
| 4 |
+
"run_name": "vits_Bhojpuri_Male_30hrs",
|
| 5 |
+
"project_name": null,
|
| 6 |
+
"run_description": "\ud83d\udc38Coqui trainer run.",
|
| 7 |
+
"print_step": 25,
|
| 8 |
+
"plot_step": 100,
|
| 9 |
+
"model_param_stats": false,
|
| 10 |
+
"wandb_entity": null,
|
| 11 |
+
"dashboard_logger": "tensorboard",
|
| 12 |
+
"log_model_step": null,
|
| 13 |
+
"save_step": 20000,
|
| 14 |
+
"save_n_checkpoints": 1000,
|
| 15 |
+
"save_checkpoints": true,
|
| 16 |
+
"save_all_best": false,
|
| 17 |
+
"save_best_after": 10000,
|
| 18 |
+
"target_loss": null,
|
| 19 |
+
"print_eval": true,
|
| 20 |
+
"test_delay_epochs": -1,
|
| 21 |
+
"run_eval": true,
|
| 22 |
+
"run_eval_steps": null,
|
| 23 |
+
"distributed_backend": "nccl",
|
| 24 |
+
"distributed_url": "tcp://localhost:54321",
|
| 25 |
+
"mixed_precision": true,
|
| 26 |
+
"epochs": 500,
|
| 27 |
+
"batch_size": 40,
|
| 28 |
+
"eval_batch_size": 16,
|
| 29 |
+
"grad_clip": [
|
| 30 |
+
1000,
|
| 31 |
+
1000
|
| 32 |
+
],
|
| 33 |
+
"scheduler_after_epoch": true,
|
| 34 |
+
"lr": 0.001,
|
| 35 |
+
"optimizer": "AdamW",
|
| 36 |
+
"optimizer_params": {
|
| 37 |
+
"betas": [
|
| 38 |
+
0.8,
|
| 39 |
+
0.99
|
| 40 |
+
],
|
| 41 |
+
"eps": 1e-09,
|
| 42 |
+
"weight_decay": 0.01
|
| 43 |
+
},
|
| 44 |
+
"lr_scheduler": null,
|
| 45 |
+
"lr_scheduler_params": {},
|
| 46 |
+
"use_grad_scaler": false,
|
| 47 |
+
"cudnn_enable": true,
|
| 48 |
+
"cudnn_deterministic": false,
|
| 49 |
+
"cudnn_benchmark": false,
|
| 50 |
+
"training_seed": 54321,
|
| 51 |
+
"model": "vits",
|
| 52 |
+
"num_loader_workers": 8,
|
| 53 |
+
"num_eval_loader_workers": 4,
|
| 54 |
+
"use_noise_augment": false,
|
| 55 |
+
"audio": {
|
| 56 |
+
"fft_size": 1024,
|
| 57 |
+
"sample_rate": 22050,
|
| 58 |
+
"win_length": 1024,
|
| 59 |
+
"hop_length": 256,
|
| 60 |
+
"num_mels": 80,
|
| 61 |
+
"mel_fmin": 0,
|
| 62 |
+
"mel_fmax": null
|
| 63 |
+
},
|
| 64 |
+
"use_phonemes": false,
|
| 65 |
+
"phonemizer": null,
|
| 66 |
+
"phoneme_language": "en-us",
|
| 67 |
+
"compute_input_seq_cache": true,
|
| 68 |
+
"text_cleaner": "multilingual_cleaners",
|
| 69 |
+
"enable_eos_bos_chars": false,
|
| 70 |
+
"test_sentences_file": "",
|
| 71 |
+
"phoneme_cache_path": "./phoneme_cache",
|
| 72 |
+
"characters": {
|
| 73 |
+
"characters_class": "TTS.tts.models.vits.VitsCharacters",
|
| 74 |
+
"vocab_dict": null,
|
| 75 |
+
"pad": "<PAD>",
|
| 76 |
+
"eos": "<EOS>",
|
| 77 |
+
"bos": "<BOS>",
|
| 78 |
+
"blank": "<BLNK>",
|
| 79 |
+
"characters": "\u091a.\u0947\u0910\u0925\u092e\u0959\u091d\u0906\u0949?\u092d \u092a\u0939\u0928\u093d\u091f\u0938\u0935\u0940\u091b\u0923\u0921\u091e\u0926\u094b\u0915\u0924\u0948\u0943\u095b\u0941\u095e\u092c\u0908\u0946\u094c\u0927\u090b\u093e\u0922\u0907\u093c\u0902\u0905\u0937\u0920\u095c\u0913\u092f,\u093f\u0930\u0901\u0914\u092b\u0909\u0916\u0911\u094d\u0932\u091c\u090f\u090a\u0917\u0936\u095d\u0919\u0918\u0942",
|
| 80 |
+
"punctuations": "!\u00a1'(),-.:;\u00bf? ",
|
| 81 |
+
"phonemes": null,
|
| 82 |
+
"is_unique": true,
|
| 83 |
+
"is_sorted": true
|
| 84 |
+
},
|
| 85 |
+
"add_blank": true,
|
| 86 |
+
"batch_group_size": 5,
|
| 87 |
+
"loss_masking": null,
|
| 88 |
+
"min_audio_len": 1,
|
| 89 |
+
"max_audio_len": Infinity,
|
| 90 |
+
"min_text_len": 1,
|
| 91 |
+
"max_text_len": Infinity,
|
| 92 |
+
"compute_f0": false,
|
| 93 |
+
"compute_energy": false,
|
| 94 |
+
"compute_linear_spec": true,
|
| 95 |
+
"precompute_num_workers": 0,
|
| 96 |
+
"start_by_longest": false,
|
| 97 |
+
"shuffle": false,
|
| 98 |
+
"drop_last": false,
|
| 99 |
+
"datasets": [
|
| 100 |
+
{
|
| 101 |
+
"formatter": "syspin",
|
| 102 |
+
"dataset_name": "",
|
| 103 |
+
"path": ".",
|
| 104 |
+
"meta_file_train": "../manifests/Bhojpuri_Male/30hrs.tsv",
|
| 105 |
+
"ignored_speakers": null,
|
| 106 |
+
"language": "",
|
| 107 |
+
"phonemizer": "",
|
| 108 |
+
"meta_file_val": "",
|
| 109 |
+
"meta_file_attn_mask": ""
|
| 110 |
+
}
|
| 111 |
+
],
|
| 112 |
+
"test_sentences": [
|
| 113 |
+
[
|
| 114 |
+
"\u090f\u0928\u094d\u091f\u094d\u0930\u093e\u092a\u0940 \u0915\u0902\u092a\u094d\u092f\u0942\u091f\u093f\u0902\u0917 \u092e\u0947\u0902 \u090f\u0928\u094d\u091f\u094d\u0930\u094b\u092a\u0940 \u090a \u0911\u092a\u0930\u0947\u091f\u093f\u0902\u0917 \u0938\u093f\u0938\u094d\u091f\u092e \u0939 \u091c\u0947 \u092a\u0947 \u0938\u0930\u093e \u0915\u094d\u0930\u093f\u092a\u094d\u091f\u094b\u0917\u094d\u0930\u093e\u092b\u093f\u0915 \u092b\u0902\u0915\u094d\u0936\u0928 \u0938\u092c \u0915\u093e\u092e \u0915\u0930\u0947 \u0932\u0947\u0902",
|
| 115 |
+
"Bhojpuri_Male",
|
| 116 |
+
null,
|
| 117 |
+
"bh"
|
| 118 |
+
]
|
| 119 |
+
],
|
| 120 |
+
"eval_split_max_size": null,
|
| 121 |
+
"eval_split_size": 0.01,
|
| 122 |
+
"use_speaker_weighted_sampler": false,
|
| 123 |
+
"speaker_weighted_sampler_alpha": 1.0,
|
| 124 |
+
"use_language_weighted_sampler": false,
|
| 125 |
+
"language_weighted_sampler_alpha": 1.0,
|
| 126 |
+
"use_length_weighted_sampler": false,
|
| 127 |
+
"length_weighted_sampler_alpha": 1.0,
|
| 128 |
+
"model_args": {
|
| 129 |
+
"num_chars": 86,
|
| 130 |
+
"out_channels": 513,
|
| 131 |
+
"spec_segment_size": 32,
|
| 132 |
+
"hidden_channels": 192,
|
| 133 |
+
"hidden_channels_ffn_text_encoder": 768,
|
| 134 |
+
"num_heads_text_encoder": 2,
|
| 135 |
+
"num_layers_text_encoder": 6,
|
| 136 |
+
"kernel_size_text_encoder": 3,
|
| 137 |
+
"dropout_p_text_encoder": 0.1,
|
| 138 |
+
"dropout_p_duration_predictor": 0.5,
|
| 139 |
+
"kernel_size_posterior_encoder": 5,
|
| 140 |
+
"dilation_rate_posterior_encoder": 1,
|
| 141 |
+
"num_layers_posterior_encoder": 16,
|
| 142 |
+
"kernel_size_flow": 5,
|
| 143 |
+
"dilation_rate_flow": 1,
|
| 144 |
+
"num_layers_flow": 4,
|
| 145 |
+
"resblock_type_decoder": "1",
|
| 146 |
+
"resblock_kernel_sizes_decoder": [
|
| 147 |
+
3,
|
| 148 |
+
7,
|
| 149 |
+
11
|
| 150 |
+
],
|
| 151 |
+
"resblock_dilation_sizes_decoder": [
|
| 152 |
+
[
|
| 153 |
+
1,
|
| 154 |
+
3,
|
| 155 |
+
5
|
| 156 |
+
],
|
| 157 |
+
[
|
| 158 |
+
1,
|
| 159 |
+
3,
|
| 160 |
+
5
|
| 161 |
+
],
|
| 162 |
+
[
|
| 163 |
+
1,
|
| 164 |
+
3,
|
| 165 |
+
5
|
| 166 |
+
]
|
| 167 |
+
],
|
| 168 |
+
"upsample_rates_decoder": [
|
| 169 |
+
8,
|
| 170 |
+
8,
|
| 171 |
+
2,
|
| 172 |
+
2
|
| 173 |
+
],
|
| 174 |
+
"upsample_initial_channel_decoder": 512,
|
| 175 |
+
"upsample_kernel_sizes_decoder": [
|
| 176 |
+
16,
|
| 177 |
+
16,
|
| 178 |
+
4,
|
| 179 |
+
4
|
| 180 |
+
],
|
| 181 |
+
"periods_multi_period_discriminator": [
|
| 182 |
+
2,
|
| 183 |
+
3,
|
| 184 |
+
5,
|
| 185 |
+
7,
|
| 186 |
+
11
|
| 187 |
+
],
|
| 188 |
+
"use_sdp": true,
|
| 189 |
+
"noise_scale": 1.0,
|
| 190 |
+
"inference_noise_scale": 0.667,
|
| 191 |
+
"length_scale": 1,
|
| 192 |
+
"noise_scale_dp": 1.0,
|
| 193 |
+
"inference_noise_scale_dp": 1.0,
|
| 194 |
+
"max_inference_len": null,
|
| 195 |
+
"init_discriminator": true,
|
| 196 |
+
"use_spectral_norm_disriminator": false,
|
| 197 |
+
"use_speaker_embedding": false,
|
| 198 |
+
"num_speakers": 0,
|
| 199 |
+
"speakers_file": null,
|
| 200 |
+
"d_vector_file": null,
|
| 201 |
+
"speaker_embedding_channels": 256,
|
| 202 |
+
"use_d_vector_file": false,
|
| 203 |
+
"d_vector_dim": 0,
|
| 204 |
+
"detach_dp_input": true,
|
| 205 |
+
"use_language_embedding": false,
|
| 206 |
+
"embedded_language_dim": 4,
|
| 207 |
+
"num_languages": 0,
|
| 208 |
+
"language_ids_file": null,
|
| 209 |
+
"use_speaker_encoder_as_loss": false,
|
| 210 |
+
"speaker_encoder_config_path": "",
|
| 211 |
+
"speaker_encoder_model_path": "",
|
| 212 |
+
"condition_dp_on_speaker": true,
|
| 213 |
+
"freeze_encoder": false,
|
| 214 |
+
"freeze_DP": false,
|
| 215 |
+
"freeze_PE": false,
|
| 216 |
+
"freeze_flow_decoder": false,
|
| 217 |
+
"freeze_waveform_decoder": false,
|
| 218 |
+
"encoder_sample_rate": null,
|
| 219 |
+
"interpolate_z": true,
|
| 220 |
+
"reinit_DP": false,
|
| 221 |
+
"reinit_text_encoder": false
|
| 222 |
+
},
|
| 223 |
+
"lr_gen": 0.0002,
|
| 224 |
+
"lr_disc": 0.0002,
|
| 225 |
+
"lr_scheduler_gen": "ExponentialLR",
|
| 226 |
+
"lr_scheduler_gen_params": {
|
| 227 |
+
"gamma": 0.999875,
|
| 228 |
+
"last_epoch": -1
|
| 229 |
+
},
|
| 230 |
+
"lr_scheduler_disc": "ExponentialLR",
|
| 231 |
+
"lr_scheduler_disc_params": {
|
| 232 |
+
"gamma": 0.999875,
|
| 233 |
+
"last_epoch": -1
|
| 234 |
+
},
|
| 235 |
+
"kl_loss_alpha": 1.0,
|
| 236 |
+
"disc_loss_alpha": 1.0,
|
| 237 |
+
"gen_loss_alpha": 1.0,
|
| 238 |
+
"feat_loss_alpha": 1.0,
|
| 239 |
+
"mel_loss_alpha": 45.0,
|
| 240 |
+
"dur_loss_alpha": 1.0,
|
| 241 |
+
"speaker_encoder_loss_alpha": 1.0,
|
| 242 |
+
"return_wav": true,
|
| 243 |
+
"use_weighted_sampler": false,
|
| 244 |
+
"weighted_sampler_attrs": {},
|
| 245 |
+
"weighted_sampler_multipliers": {},
|
| 246 |
+
"r": 1,
|
| 247 |
+
"num_speakers": 0,
|
| 248 |
+
"use_speaker_embedding": false,
|
| 249 |
+
"speakers_file": null,
|
| 250 |
+
"speaker_embedding_channels": 256,
|
| 251 |
+
"language_ids_file": null,
|
| 252 |
+
"use_language_embedding": false,
|
| 253 |
+
"use_d_vector_file": false,
|
| 254 |
+
"d_vector_file": null,
|
| 255 |
+
"d_vector_dim": 0,
|
| 256 |
+
"github_branch": "* dev"
|
| 257 |
+
}
|
models/bn_female/bn_female_vits_30hrs.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:53208e056050bb485df9192a0d444d3fa72eefe15b2c04840e9a500e4ac1bbf4
|
| 3 |
+
size 333255366
|
models/bn_female/chars.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
ূঞংঘঔদলৌআডখরথটোৗঙঐানষঝবছঅঢ়ঁপউধঢশগয়।?িক,যঈস্ত়ফঋৈজ'ীঠৰণওৎঃমচঊড়ইুভে এ"ৃহ
|
models/bn_female/jit_infer.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from extra import TTSTokenizer, VitsConfig, CharactersConfig, VitsCharacters
|
| 3 |
+
import torch
|
| 4 |
+
import numpy as np
|
| 5 |
+
|
| 6 |
+
#ch female
|
| 7 |
+
with open("chars.txt", 'r') as f:
|
| 8 |
+
letters = f.read().strip('\n')
|
| 9 |
+
model="bn_female_vits_30hrs.pt"
|
| 10 |
+
text = " হলেও আমাদের সবার সার্বিক শৃঙ্খলা বোধের উন্নতি হবে"
|
| 11 |
+
|
| 12 |
+
config = VitsConfig(
|
| 13 |
+
text_cleaner="multilingual_cleaners",
|
| 14 |
+
characters=CharactersConfig(
|
| 15 |
+
characters_class=VitsCharacters,
|
| 16 |
+
pad="<PAD>",
|
| 17 |
+
eos="<EOS>",
|
| 18 |
+
bos="<BOS>",
|
| 19 |
+
blank="<BLNK>",
|
| 20 |
+
characters=letters,
|
| 21 |
+
punctuations="!¡'(),-.:;¿? ",
|
| 22 |
+
phonemes=None)
|
| 23 |
+
)
|
| 24 |
+
tokenizer, config = TTSTokenizer.init_from_config(config)
|
| 25 |
+
|
| 26 |
+
x = tokenizer.text_to_ids(text)
|
| 27 |
+
x = torch.from_numpy(np.array(x)).unsqueeze(0)
|
| 28 |
+
net = torch.jit.load(model)
|
| 29 |
+
with torch.no_grad():
|
| 30 |
+
out2 = net(x)
|
| 31 |
+
import soundfile as sf
|
| 32 |
+
sf.write("jit.wav", out2.squeeze().cpu().numpy(), 22050)
|
models/bn_male/bn_male_vits_30hrs.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c9d8d52f0bc33ef01d733eef36fb00f1e17192b8c86123a0ccf84a24dbb80d0e
|
| 3 |
+
size 333249868
|
models/bn_male/chars.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
ূঞংঘঔদলৌআডখরঃটোৗঙঐনাঝষবঅছঢ়ঁপউধঢশগয়।?িক,যঈসত্ৈফ়ঊজ'ীঠৎণওঋৰমচড়ভুইে থএ"ৃহ
|
models/bn_male/extra.py
ADDED
|
@@ -0,0 +1,787 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Callable, Dict, List, Union
|
| 2 |
+
from dataclasses import asdict, dataclass, field
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
import re
|
| 6 |
+
from dataclasses import replace
|
| 7 |
+
from typing import Dict
|
| 8 |
+
_whitespace_re = re.compile(r"\s+")
|
| 9 |
+
|
| 10 |
+
from dataclasses import dataclass, field
|
| 11 |
+
from typing import List
|
| 12 |
+
|
| 13 |
+
# from TTS.tts.configs.shared_configs import BaseTTSConfig
|
| 14 |
+
# from TTS.tts.models.vits import VitsArgs, VitsAudioConfig
|
| 15 |
+
|
| 16 |
+
@dataclass
|
| 17 |
+
class CharactersConfig():
|
| 18 |
+
|
| 19 |
+
characters_class: str = None
|
| 20 |
+
|
| 21 |
+
# using BaseVocabulary
|
| 22 |
+
vocab_dict: Dict = None
|
| 23 |
+
|
| 24 |
+
# using on BaseCharacters
|
| 25 |
+
pad: str = None
|
| 26 |
+
eos: str = None
|
| 27 |
+
bos: str = None
|
| 28 |
+
blank: str = None
|
| 29 |
+
characters: str = None
|
| 30 |
+
punctuations: str = None
|
| 31 |
+
phonemes: str = None
|
| 32 |
+
is_unique: bool = True # for backwards compatibility of models trained with char sets with duplicates
|
| 33 |
+
is_sorted: bool = True
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@dataclass
|
| 37 |
+
class BaseTTSConfig():
|
| 38 |
+
|
| 39 |
+
# audio: BaseAudioConfig = field(default_factory=BaseAudioConfig)
|
| 40 |
+
# phoneme settings
|
| 41 |
+
use_phonemes: bool = False
|
| 42 |
+
phonemizer: str = None
|
| 43 |
+
phoneme_language: str = None
|
| 44 |
+
compute_input_seq_cache: bool = False
|
| 45 |
+
text_cleaner: str = None
|
| 46 |
+
enable_eos_bos_chars: bool = False
|
| 47 |
+
test_sentences_file: str = ""
|
| 48 |
+
phoneme_cache_path: str = None
|
| 49 |
+
# vocabulary parameters
|
| 50 |
+
characters: CharactersConfig = None
|
| 51 |
+
add_blank: bool = False
|
| 52 |
+
# training params
|
| 53 |
+
batch_group_size: int = 0
|
| 54 |
+
loss_masking: bool = None
|
| 55 |
+
# dataloading
|
| 56 |
+
min_audio_len: int = 1
|
| 57 |
+
max_audio_len: int = float("inf")
|
| 58 |
+
min_text_len: int = 1
|
| 59 |
+
max_text_len: int = float("inf")
|
| 60 |
+
compute_f0: bool = False
|
| 61 |
+
compute_energy: bool = False
|
| 62 |
+
compute_linear_spec: bool = False
|
| 63 |
+
precompute_num_workers: int = 0
|
| 64 |
+
use_noise_augment: bool = False
|
| 65 |
+
start_by_longest: bool = False
|
| 66 |
+
shuffle: bool = False
|
| 67 |
+
drop_last: bool = False
|
| 68 |
+
# dataset
|
| 69 |
+
datasets: str = None
|
| 70 |
+
# optimizer
|
| 71 |
+
optimizer: str = "radam"
|
| 72 |
+
optimizer_params: dict = None
|
| 73 |
+
# scheduler
|
| 74 |
+
lr_scheduler: str = None
|
| 75 |
+
lr_scheduler_params: dict = field(default_factory=lambda: {})
|
| 76 |
+
# testing
|
| 77 |
+
test_sentences: List[str] = field(default_factory=lambda: [])
|
| 78 |
+
# evaluation
|
| 79 |
+
eval_split_max_size: int = None
|
| 80 |
+
eval_split_size: float = 0.01
|
| 81 |
+
# weighted samplers
|
| 82 |
+
use_speaker_weighted_sampler: bool = False
|
| 83 |
+
speaker_weighted_sampler_alpha: float = 1.0
|
| 84 |
+
use_language_weighted_sampler: bool = False
|
| 85 |
+
language_weighted_sampler_alpha: float = 1.0
|
| 86 |
+
use_length_weighted_sampler: bool = False
|
| 87 |
+
length_weighted_sampler_alpha: float = 1.0
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
@dataclass
|
| 91 |
+
class VitsAudioConfig():
|
| 92 |
+
fft_size: int = 1024
|
| 93 |
+
sample_rate: int = 22050
|
| 94 |
+
win_length: int = 1024
|
| 95 |
+
hop_length: int = 256
|
| 96 |
+
num_mels: int = 80
|
| 97 |
+
mel_fmin: int = 0
|
| 98 |
+
mel_fmax: int = None
|
| 99 |
+
|
| 100 |
+
@dataclass
|
| 101 |
+
class VitsArgs():
|
| 102 |
+
num_chars: int = 100
|
| 103 |
+
out_channels: int = 513
|
| 104 |
+
spec_segment_size: int = 32
|
| 105 |
+
hidden_channels: int = 192
|
| 106 |
+
hidden_channels_ffn_text_encoder: int = 768
|
| 107 |
+
num_heads_text_encoder: int = 2
|
| 108 |
+
num_layers_text_encoder: int = 6
|
| 109 |
+
kernel_size_text_encoder: int = 3
|
| 110 |
+
dropout_p_text_encoder: float = 0.1
|
| 111 |
+
dropout_p_duration_predictor: float = 0.5
|
| 112 |
+
kernel_size_posterior_encoder: int = 5
|
| 113 |
+
dilation_rate_posterior_encoder: int = 1
|
| 114 |
+
num_layers_posterior_encoder: int = 16
|
| 115 |
+
kernel_size_flow: int = 5
|
| 116 |
+
dilation_rate_flow: int = 1
|
| 117 |
+
num_layers_flow: int = 4
|
| 118 |
+
resblock_type_decoder: str = "1"
|
| 119 |
+
resblock_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [3, 7, 11])
|
| 120 |
+
resblock_dilation_sizes_decoder: List[List[int]] = field(default_factory=lambda: [[1, 3, 5], [1, 3, 5], [1, 3, 5]])
|
| 121 |
+
upsample_rates_decoder: List[int] = field(default_factory=lambda: [8, 8, 2, 2])
|
| 122 |
+
upsample_initial_channel_decoder: int = 512
|
| 123 |
+
upsample_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [16, 16, 4, 4])
|
| 124 |
+
periods_multi_period_discriminator: List[int] = field(default_factory=lambda: [2, 3, 5, 7, 11])
|
| 125 |
+
use_sdp: bool = True
|
| 126 |
+
noise_scale: float = 1.0
|
| 127 |
+
inference_noise_scale: float = 0.667
|
| 128 |
+
length_scale: float = 1
|
| 129 |
+
noise_scale_dp: float = 1.0
|
| 130 |
+
inference_noise_scale_dp: float = 1.0
|
| 131 |
+
max_inference_len: int = None
|
| 132 |
+
init_discriminator: bool = True
|
| 133 |
+
use_spectral_norm_disriminator: bool = False
|
| 134 |
+
use_speaker_embedding: bool = False
|
| 135 |
+
num_speakers: int = 0
|
| 136 |
+
speakers_file: str = None
|
| 137 |
+
d_vector_file: List[str] = None
|
| 138 |
+
speaker_embedding_channels: int = 256
|
| 139 |
+
use_d_vector_file: bool = False
|
| 140 |
+
d_vector_dim: int = 0
|
| 141 |
+
detach_dp_input: bool = True
|
| 142 |
+
use_language_embedding: bool = False
|
| 143 |
+
embedded_language_dim: int = 4
|
| 144 |
+
num_languages: int = 0
|
| 145 |
+
language_ids_file: str = None
|
| 146 |
+
use_speaker_encoder_as_loss: bool = False
|
| 147 |
+
speaker_encoder_config_path: str = ""
|
| 148 |
+
speaker_encoder_model_path: str = ""
|
| 149 |
+
condition_dp_on_speaker: bool = True
|
| 150 |
+
freeze_encoder: bool = False
|
| 151 |
+
freeze_DP: bool = False
|
| 152 |
+
freeze_PE: bool = False
|
| 153 |
+
freeze_flow_decoder: bool = False
|
| 154 |
+
freeze_waveform_decoder: bool = False
|
| 155 |
+
encoder_sample_rate: int = None
|
| 156 |
+
interpolate_z: bool = True
|
| 157 |
+
reinit_DP: bool = False
|
| 158 |
+
reinit_text_encoder: bool = False
|
| 159 |
+
@dataclass
|
| 160 |
+
class VitsConfig(BaseTTSConfig):
|
| 161 |
+
|
| 162 |
+
model: str = "vits"
|
| 163 |
+
# model specific params
|
| 164 |
+
model_args: VitsArgs = field(default_factory=VitsArgs)
|
| 165 |
+
audio: VitsAudioConfig = field(default_factory=VitsAudioConfig)
|
| 166 |
+
|
| 167 |
+
# optimizer
|
| 168 |
+
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000])
|
| 169 |
+
lr_gen: float = 0.0002
|
| 170 |
+
lr_disc: float = 0.0002
|
| 171 |
+
lr_scheduler_gen: str = "ExponentialLR"
|
| 172 |
+
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 173 |
+
lr_scheduler_disc: str = "ExponentialLR"
|
| 174 |
+
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 175 |
+
scheduler_after_epoch: bool = True
|
| 176 |
+
optimizer: str = "AdamW"
|
| 177 |
+
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01})
|
| 178 |
+
|
| 179 |
+
# loss params
|
| 180 |
+
kl_loss_alpha: float = 1.0
|
| 181 |
+
disc_loss_alpha: float = 1.0
|
| 182 |
+
gen_loss_alpha: float = 1.0
|
| 183 |
+
feat_loss_alpha: float = 1.0
|
| 184 |
+
mel_loss_alpha: float = 45.0
|
| 185 |
+
dur_loss_alpha: float = 1.0
|
| 186 |
+
speaker_encoder_loss_alpha: float = 1.0
|
| 187 |
+
|
| 188 |
+
# data loader params
|
| 189 |
+
return_wav: bool = True
|
| 190 |
+
compute_linear_spec: bool = True
|
| 191 |
+
|
| 192 |
+
# sampler params
|
| 193 |
+
use_weighted_sampler: bool = False # TODO: move it to the base config
|
| 194 |
+
weighted_sampler_attrs: dict = field(default_factory=lambda: {})
|
| 195 |
+
weighted_sampler_multipliers: dict = field(default_factory=lambda: {})
|
| 196 |
+
|
| 197 |
+
# overrides
|
| 198 |
+
r: int = 1 # DO NOT CHANGE
|
| 199 |
+
add_blank: bool = True
|
| 200 |
+
|
| 201 |
+
# testing
|
| 202 |
+
test_sentences: List[List] = field(
|
| 203 |
+
default_factory=lambda: [
|
| 204 |
+
["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."],
|
| 205 |
+
["Be a voice, not an echo."],
|
| 206 |
+
["I'm sorry Dave. I'm afraid I can't do that."],
|
| 207 |
+
["This cake is great. It's so delicious and moist."],
|
| 208 |
+
["Prior to November 22, 1963."],
|
| 209 |
+
]
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# multi-speaker settings
|
| 213 |
+
# use speaker embedding layer
|
| 214 |
+
num_speakers: int = 0
|
| 215 |
+
use_speaker_embedding: bool = False
|
| 216 |
+
speakers_file: str = None
|
| 217 |
+
speaker_embedding_channels: int = 256
|
| 218 |
+
language_ids_file: str = None
|
| 219 |
+
use_language_embedding: bool = False
|
| 220 |
+
|
| 221 |
+
# use d-vectors
|
| 222 |
+
use_d_vector_file: bool = False
|
| 223 |
+
d_vector_file: List[str] = None
|
| 224 |
+
d_vector_dim: int = None
|
| 225 |
+
|
| 226 |
+
def __post_init__(self):
|
| 227 |
+
pass
|
| 228 |
+
# for key, val in self.model_args.items():
|
| 229 |
+
# if hasattr(self, key):
|
| 230 |
+
# self[key] = val
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
def parse_symbols():
|
| 237 |
+
return {
|
| 238 |
+
"pad": _pad,
|
| 239 |
+
"eos": _eos,
|
| 240 |
+
"bos": _bos,
|
| 241 |
+
"characters": _characters,
|
| 242 |
+
"punctuations": _punctuations,
|
| 243 |
+
"phonemes": _phonemes,
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# DEFAULT SET OF GRAPHEMES
|
| 248 |
+
_pad = "<PAD>"
|
| 249 |
+
_eos = "<EOS>"
|
| 250 |
+
_bos = "<BOS>"
|
| 251 |
+
_blank = "<BLNK>" # TODO: check if we need this alongside with PAD
|
| 252 |
+
_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
|
| 253 |
+
_punctuations = "!'(),-.:;? "
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
# DEFAULT SET OF IPA PHONEMES
|
| 257 |
+
# Phonemes definition (All IPA characters)
|
| 258 |
+
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
|
| 259 |
+
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
|
| 260 |
+
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
|
| 261 |
+
_suprasegmentals = "ˈˌːˑ"
|
| 262 |
+
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
|
| 263 |
+
_diacrilics = "ɚ˞ɫ"
|
| 264 |
+
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
class BaseVocabulary:
|
| 268 |
+
"""Base Vocabulary class.
|
| 269 |
+
|
| 270 |
+
This class only needs a vocabulary dictionary without specifying the characters.
|
| 271 |
+
|
| 272 |
+
Args:
|
| 273 |
+
vocab (Dict): A dictionary of characters and their corresponding indices.
|
| 274 |
+
"""
|
| 275 |
+
|
| 276 |
+
def __init__(self, vocab: Dict, pad: str = None, blank: str = None, bos: str = None, eos: str = None):
|
| 277 |
+
self.vocab = vocab
|
| 278 |
+
self.pad = pad
|
| 279 |
+
self.blank = blank
|
| 280 |
+
self.bos = bos
|
| 281 |
+
self.eos = eos
|
| 282 |
+
|
| 283 |
+
@property
|
| 284 |
+
def pad_id(self) -> int:
|
| 285 |
+
"""Return the index of the padding character. If the padding character is not specified, return the length
|
| 286 |
+
of the vocabulary."""
|
| 287 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 288 |
+
|
| 289 |
+
@property
|
| 290 |
+
def blank_id(self) -> int:
|
| 291 |
+
"""Return the index of the blank character. If the blank character is not specified, return the length of
|
| 292 |
+
the vocabulary."""
|
| 293 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 294 |
+
|
| 295 |
+
@property
|
| 296 |
+
def bos_id(self) -> int:
|
| 297 |
+
"""Return the index of the bos character. If the bos character is not specified, return the length of the
|
| 298 |
+
vocabulary."""
|
| 299 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 300 |
+
|
| 301 |
+
@property
|
| 302 |
+
def eos_id(self) -> int:
|
| 303 |
+
"""Return the index of the eos character. If the eos character is not specified, return the length of the
|
| 304 |
+
vocabulary."""
|
| 305 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 306 |
+
|
| 307 |
+
@property
|
| 308 |
+
def vocab(self):
|
| 309 |
+
"""Return the vocabulary dictionary."""
|
| 310 |
+
return self._vocab
|
| 311 |
+
|
| 312 |
+
@vocab.setter
|
| 313 |
+
def vocab(self, vocab):
|
| 314 |
+
"""Set the vocabulary dictionary and character mapping dictionaries."""
|
| 315 |
+
self._vocab, self._char_to_id, self._id_to_char = None, None, None
|
| 316 |
+
if vocab is not None:
|
| 317 |
+
self._vocab = vocab
|
| 318 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self._vocab)}
|
| 319 |
+
self._id_to_char = {
|
| 320 |
+
idx: char for idx, char in enumerate(self._vocab) # pylint: disable=unnecessary-comprehension
|
| 321 |
+
}
|
| 322 |
+
|
| 323 |
+
@staticmethod
|
| 324 |
+
def init_from_config(config, **kwargs):
|
| 325 |
+
"""Initialize from the given config."""
|
| 326 |
+
if config.characters is not None and "vocab_dict" in config.characters and config.characters.vocab_dict:
|
| 327 |
+
return (
|
| 328 |
+
BaseVocabulary(
|
| 329 |
+
config.characters.vocab_dict,
|
| 330 |
+
config.characters.pad,
|
| 331 |
+
config.characters.blank,
|
| 332 |
+
config.characters.bos,
|
| 333 |
+
config.characters.eos,
|
| 334 |
+
),
|
| 335 |
+
config,
|
| 336 |
+
)
|
| 337 |
+
return BaseVocabulary(**kwargs), config
|
| 338 |
+
|
| 339 |
+
def to_config(self):
|
| 340 |
+
return CharactersConfig(
|
| 341 |
+
vocab_dict=self._vocab,
|
| 342 |
+
pad=self.pad,
|
| 343 |
+
eos=self.eos,
|
| 344 |
+
bos=self.bos,
|
| 345 |
+
blank=self.blank,
|
| 346 |
+
is_unique=False,
|
| 347 |
+
is_sorted=False,
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
@property
|
| 351 |
+
def num_chars(self):
|
| 352 |
+
"""Return number of tokens in the vocabulary."""
|
| 353 |
+
return len(self._vocab)
|
| 354 |
+
|
| 355 |
+
def char_to_id(self, char: str) -> int:
|
| 356 |
+
"""Map a character to an token ID."""
|
| 357 |
+
try:
|
| 358 |
+
return self._char_to_id[char]
|
| 359 |
+
except KeyError as e:
|
| 360 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 361 |
+
|
| 362 |
+
def id_to_char(self, idx: int) -> str:
|
| 363 |
+
"""Map an token ID to a character."""
|
| 364 |
+
return self._id_to_char[idx]
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
class BaseCharacters:
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
def __init__(
|
| 371 |
+
self,
|
| 372 |
+
characters: str = None,
|
| 373 |
+
punctuations: str = None,
|
| 374 |
+
pad: str = None,
|
| 375 |
+
eos: str = None,
|
| 376 |
+
bos: str = None,
|
| 377 |
+
blank: str = None,
|
| 378 |
+
is_unique: bool = False,
|
| 379 |
+
is_sorted: bool = True,
|
| 380 |
+
) -> None:
|
| 381 |
+
self._characters = characters
|
| 382 |
+
self._punctuations = punctuations
|
| 383 |
+
self._pad = pad
|
| 384 |
+
self._eos = eos
|
| 385 |
+
self._bos = bos
|
| 386 |
+
self._blank = blank
|
| 387 |
+
self.is_unique = is_unique
|
| 388 |
+
self.is_sorted = is_sorted
|
| 389 |
+
self._create_vocab()
|
| 390 |
+
|
| 391 |
+
@property
|
| 392 |
+
def pad_id(self) -> int:
|
| 393 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 394 |
+
|
| 395 |
+
@property
|
| 396 |
+
def blank_id(self) -> int:
|
| 397 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 398 |
+
|
| 399 |
+
@property
|
| 400 |
+
def eos_id(self) -> int:
|
| 401 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 402 |
+
|
| 403 |
+
@property
|
| 404 |
+
def bos_id(self) -> int:
|
| 405 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 406 |
+
|
| 407 |
+
@property
|
| 408 |
+
def characters(self):
|
| 409 |
+
return self._characters
|
| 410 |
+
|
| 411 |
+
@characters.setter
|
| 412 |
+
def characters(self, characters):
|
| 413 |
+
self._characters = characters
|
| 414 |
+
self._create_vocab()
|
| 415 |
+
|
| 416 |
+
@property
|
| 417 |
+
def punctuations(self):
|
| 418 |
+
return self._punctuations
|
| 419 |
+
|
| 420 |
+
@punctuations.setter
|
| 421 |
+
def punctuations(self, punctuations):
|
| 422 |
+
self._punctuations = punctuations
|
| 423 |
+
self._create_vocab()
|
| 424 |
+
|
| 425 |
+
@property
|
| 426 |
+
def pad(self):
|
| 427 |
+
return self._pad
|
| 428 |
+
|
| 429 |
+
@pad.setter
|
| 430 |
+
def pad(self, pad):
|
| 431 |
+
self._pad = pad
|
| 432 |
+
self._create_vocab()
|
| 433 |
+
|
| 434 |
+
@property
|
| 435 |
+
def eos(self):
|
| 436 |
+
return self._eos
|
| 437 |
+
|
| 438 |
+
@eos.setter
|
| 439 |
+
def eos(self, eos):
|
| 440 |
+
self._eos = eos
|
| 441 |
+
self._create_vocab()
|
| 442 |
+
|
| 443 |
+
@property
|
| 444 |
+
def bos(self):
|
| 445 |
+
return self._bos
|
| 446 |
+
|
| 447 |
+
@bos.setter
|
| 448 |
+
def bos(self, bos):
|
| 449 |
+
self._bos = bos
|
| 450 |
+
self._create_vocab()
|
| 451 |
+
|
| 452 |
+
@property
|
| 453 |
+
def blank(self):
|
| 454 |
+
return self._blank
|
| 455 |
+
|
| 456 |
+
@blank.setter
|
| 457 |
+
def blank(self, blank):
|
| 458 |
+
self._blank = blank
|
| 459 |
+
self._create_vocab()
|
| 460 |
+
|
| 461 |
+
@property
|
| 462 |
+
def vocab(self):
|
| 463 |
+
return self._vocab
|
| 464 |
+
|
| 465 |
+
@vocab.setter
|
| 466 |
+
def vocab(self, vocab):
|
| 467 |
+
self._vocab = vocab
|
| 468 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 469 |
+
self._id_to_char = {
|
| 470 |
+
idx: char for idx, char in enumerate(self.vocab) # pylint: disable=unnecessary-comprehension
|
| 471 |
+
}
|
| 472 |
+
|
| 473 |
+
@property
|
| 474 |
+
def num_chars(self):
|
| 475 |
+
return len(self._vocab)
|
| 476 |
+
|
| 477 |
+
def _create_vocab(self):
|
| 478 |
+
_vocab = self._characters
|
| 479 |
+
if self.is_unique:
|
| 480 |
+
_vocab = list(set(_vocab))
|
| 481 |
+
if self.is_sorted:
|
| 482 |
+
_vocab = sorted(_vocab)
|
| 483 |
+
_vocab = list(_vocab)
|
| 484 |
+
_vocab = [self._blank] + _vocab if self._blank is not None and len(self._blank) > 0 else _vocab
|
| 485 |
+
_vocab = [self._bos] + _vocab if self._bos is not None and len(self._bos) > 0 else _vocab
|
| 486 |
+
_vocab = [self._eos] + _vocab if self._eos is not None and len(self._eos) > 0 else _vocab
|
| 487 |
+
_vocab = [self._pad] + _vocab if self._pad is not None and len(self._pad) > 0 else _vocab
|
| 488 |
+
self.vocab = _vocab + list(self._punctuations)
|
| 489 |
+
if self.is_unique:
|
| 490 |
+
duplicates = {x for x in self.vocab if self.vocab.count(x) > 1}
|
| 491 |
+
assert (
|
| 492 |
+
len(self.vocab) == len(self._char_to_id) == len(self._id_to_char)
|
| 493 |
+
), f" [!] There are duplicate characters in the character set. {duplicates}"
|
| 494 |
+
|
| 495 |
+
def char_to_id(self, char: str) -> int:
|
| 496 |
+
try:
|
| 497 |
+
return self._char_to_id[char]
|
| 498 |
+
except KeyError as e:
|
| 499 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 500 |
+
|
| 501 |
+
def id_to_char(self, idx: int) -> str:
|
| 502 |
+
return self._id_to_char[idx]
|
| 503 |
+
|
| 504 |
+
def print_log(self, level: int = 0):
|
| 505 |
+
"""
|
| 506 |
+
Prints the vocabulary in a nice format.
|
| 507 |
+
"""
|
| 508 |
+
indent = "\t" * level
|
| 509 |
+
print(f"{indent}| > Characters: {self._characters}")
|
| 510 |
+
print(f"{indent}| > Punctuations: {self._punctuations}")
|
| 511 |
+
print(f"{indent}| > Pad: {self._pad}")
|
| 512 |
+
print(f"{indent}| > EOS: {self._eos}")
|
| 513 |
+
print(f"{indent}| > BOS: {self._bos}")
|
| 514 |
+
print(f"{indent}| > Blank: {self._blank}")
|
| 515 |
+
print(f"{indent}| > Vocab: {self.vocab}")
|
| 516 |
+
print(f"{indent}| > Num chars: {self.num_chars}")
|
| 517 |
+
|
| 518 |
+
@staticmethod
|
| 519 |
+
def init_from_config(config: "Coqpit"): # pylint: disable=unused-argument
|
| 520 |
+
"""Init your character class from a config.
|
| 521 |
+
|
| 522 |
+
Implement this method for your subclass.
|
| 523 |
+
"""
|
| 524 |
+
# use character set from config
|
| 525 |
+
if config.characters is not None:
|
| 526 |
+
return BaseCharacters(**config.characters), config
|
| 527 |
+
# return default character set
|
| 528 |
+
characters = BaseCharacters()
|
| 529 |
+
new_config = replace(config, characters=characters.to_config())
|
| 530 |
+
return characters, new_config
|
| 531 |
+
|
| 532 |
+
def to_config(self) -> "CharactersConfig":
|
| 533 |
+
return CharactersConfig(
|
| 534 |
+
characters=self._characters,
|
| 535 |
+
punctuations=self._punctuations,
|
| 536 |
+
pad=self._pad,
|
| 537 |
+
eos=self._eos,
|
| 538 |
+
bos=self._bos,
|
| 539 |
+
blank=self._blank,
|
| 540 |
+
is_unique=self.is_unique,
|
| 541 |
+
is_sorted=self.is_sorted,
|
| 542 |
+
)
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
class IPAPhonemes(BaseCharacters):
|
| 546 |
+
|
| 547 |
+
|
| 548 |
+
def __init__(
|
| 549 |
+
self,
|
| 550 |
+
characters: str = _phonemes,
|
| 551 |
+
punctuations: str = _punctuations,
|
| 552 |
+
pad: str = _pad,
|
| 553 |
+
eos: str = _eos,
|
| 554 |
+
bos: str = _bos,
|
| 555 |
+
blank: str = _blank,
|
| 556 |
+
is_unique: bool = False,
|
| 557 |
+
is_sorted: bool = True,
|
| 558 |
+
) -> None:
|
| 559 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 560 |
+
|
| 561 |
+
@staticmethod
|
| 562 |
+
def init_from_config(config: "Coqpit"):
|
| 563 |
+
"""Init a IPAPhonemes object from a model config
|
| 564 |
+
|
| 565 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 566 |
+
will be updated.
|
| 567 |
+
"""
|
| 568 |
+
# band-aid for compatibility with old models
|
| 569 |
+
if "characters" in config and config.characters is not None:
|
| 570 |
+
if "phonemes" in config.characters and config.characters.phonemes is not None:
|
| 571 |
+
config.characters["characters"] = config.characters["phonemes"]
|
| 572 |
+
return (
|
| 573 |
+
IPAPhonemes(
|
| 574 |
+
characters=config.characters["characters"],
|
| 575 |
+
punctuations=config.characters["punctuations"],
|
| 576 |
+
pad=config.characters["pad"],
|
| 577 |
+
eos=config.characters["eos"],
|
| 578 |
+
bos=config.characters["bos"],
|
| 579 |
+
blank=config.characters["blank"],
|
| 580 |
+
is_unique=config.characters["is_unique"],
|
| 581 |
+
is_sorted=config.characters["is_sorted"],
|
| 582 |
+
),
|
| 583 |
+
config,
|
| 584 |
+
)
|
| 585 |
+
# use character set from config
|
| 586 |
+
if config.characters is not None:
|
| 587 |
+
return IPAPhonemes(**config.characters), config
|
| 588 |
+
# return default character set
|
| 589 |
+
characters = IPAPhonemes()
|
| 590 |
+
new_config = replace(config, characters=characters.to_config())
|
| 591 |
+
return characters, new_config
|
| 592 |
+
|
| 593 |
+
|
| 594 |
+
class Graphemes(BaseCharacters):
|
| 595 |
+
|
| 596 |
+
|
| 597 |
+
def __init__(
|
| 598 |
+
self,
|
| 599 |
+
characters: str = _characters,
|
| 600 |
+
punctuations: str = _punctuations,
|
| 601 |
+
pad: str = _pad,
|
| 602 |
+
eos: str = _eos,
|
| 603 |
+
bos: str = _bos,
|
| 604 |
+
blank: str = _blank,
|
| 605 |
+
is_unique: bool = False,
|
| 606 |
+
is_sorted: bool = True,
|
| 607 |
+
) -> None:
|
| 608 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 609 |
+
|
| 610 |
+
@staticmethod
|
| 611 |
+
def init_from_config(config: "Coqpit"):
|
| 612 |
+
"""Init a Graphemes object from a model config
|
| 613 |
+
|
| 614 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 615 |
+
will be updated.
|
| 616 |
+
"""
|
| 617 |
+
if config.characters is not None:
|
| 618 |
+
# band-aid for compatibility with old models
|
| 619 |
+
if "phonemes" in config.characters:
|
| 620 |
+
return (
|
| 621 |
+
Graphemes(
|
| 622 |
+
characters=config.characters["characters"],
|
| 623 |
+
punctuations=config.characters["punctuations"],
|
| 624 |
+
pad=config.characters["pad"],
|
| 625 |
+
eos=config.characters["eos"],
|
| 626 |
+
bos=config.characters["bos"],
|
| 627 |
+
blank=config.characters["blank"],
|
| 628 |
+
is_unique=config.characters["is_unique"],
|
| 629 |
+
is_sorted=config.characters["is_sorted"],
|
| 630 |
+
),
|
| 631 |
+
config,
|
| 632 |
+
)
|
| 633 |
+
return Graphemes(**config.characters), config
|
| 634 |
+
characters = Graphemes()
|
| 635 |
+
new_config = replace(config, characters=characters.to_config())
|
| 636 |
+
return characters, new_config
|
| 637 |
+
|
| 638 |
+
|
| 639 |
+
if __name__ == "__main__":
|
| 640 |
+
gr = Graphemes()
|
| 641 |
+
ph = IPAPhonemes()
|
| 642 |
+
gr.print_log()
|
| 643 |
+
ph.print_log()
|
| 644 |
+
|
| 645 |
+
|
| 646 |
+
class VitsCharacters(BaseCharacters):
|
| 647 |
+
"""Characters class for VITs model for compatibility with pre-trained models"""
|
| 648 |
+
|
| 649 |
+
def __init__(
|
| 650 |
+
self,
|
| 651 |
+
graphemes: str = _characters,
|
| 652 |
+
punctuations: str = _punctuations,
|
| 653 |
+
pad: str = _pad,
|
| 654 |
+
ipa_characters: str = _phonemes,
|
| 655 |
+
) -> None:
|
| 656 |
+
if ipa_characters is not None:
|
| 657 |
+
graphemes += ipa_characters
|
| 658 |
+
super().__init__(graphemes, punctuations, pad, None, None, "<BLNK>", is_unique=False, is_sorted=True)
|
| 659 |
+
|
| 660 |
+
def _create_vocab(self):
|
| 661 |
+
self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank]
|
| 662 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 663 |
+
# pylint: disable=unnecessary-comprehension
|
| 664 |
+
self._id_to_char = {idx: char for idx, char in enumerate(self.vocab)}
|
| 665 |
+
|
| 666 |
+
@staticmethod
|
| 667 |
+
def init_from_config(config):
|
| 668 |
+
_pad = config.characters.pad
|
| 669 |
+
_punctuations = config.characters.punctuations
|
| 670 |
+
_letters = config.characters.characters
|
| 671 |
+
_letters_ipa = config.characters.phonemes
|
| 672 |
+
return (
|
| 673 |
+
VitsCharacters(graphemes=_letters, ipa_characters=_letters_ipa, punctuations=_punctuations, pad=_pad),
|
| 674 |
+
config,
|
| 675 |
+
)
|
| 676 |
+
|
| 677 |
+
def to_config(self) -> "CharactersConfig":
|
| 678 |
+
return CharactersConfig(
|
| 679 |
+
characters=self._characters,
|
| 680 |
+
punctuations=self._punctuations,
|
| 681 |
+
pad=self._pad,
|
| 682 |
+
eos=None,
|
| 683 |
+
bos=None,
|
| 684 |
+
blank=self._blank,
|
| 685 |
+
is_unique=False,
|
| 686 |
+
is_sorted=True,
|
| 687 |
+
)
|
| 688 |
+
|
| 689 |
+
class TTSTokenizer:
|
| 690 |
+
def __init__(
|
| 691 |
+
self,
|
| 692 |
+
text_cleaner: Callable = None,
|
| 693 |
+
characters: "BaseCharacters" = None,
|
| 694 |
+
):
|
| 695 |
+
self.text_cleaner = text_cleaner
|
| 696 |
+
self.characters = characters
|
| 697 |
+
self.not_found_characters = []
|
| 698 |
+
|
| 699 |
+
@property
|
| 700 |
+
def characters(self):
|
| 701 |
+
return self._characters
|
| 702 |
+
|
| 703 |
+
@characters.setter
|
| 704 |
+
def characters(self, new_characters):
|
| 705 |
+
self._characters = new_characters
|
| 706 |
+
self.pad_id = self.characters.char_to_id(self.characters.pad) if self.characters.pad else None
|
| 707 |
+
self.blank_id = self.characters.char_to_id(self.characters.blank) if self.characters.blank else None
|
| 708 |
+
|
| 709 |
+
def encode(self, text: str) -> List[int]:
|
| 710 |
+
"""Encodes a string of text as a sequence of IDs."""
|
| 711 |
+
token_ids = []
|
| 712 |
+
for char in text:
|
| 713 |
+
try:
|
| 714 |
+
idx = self.characters.char_to_id(char)
|
| 715 |
+
token_ids.append(idx)
|
| 716 |
+
except KeyError:
|
| 717 |
+
# discard but store not found characters
|
| 718 |
+
if char not in self.not_found_characters:
|
| 719 |
+
self.not_found_characters.append(char)
|
| 720 |
+
print(text)
|
| 721 |
+
print(f" [!] Character {repr(char)} not found in the vocabulary. Discarding it.")
|
| 722 |
+
return token_ids
|
| 723 |
+
|
| 724 |
+
def text_to_ids(self, text: str, language: str = None) -> List[int]: # pylint: disable=unused-argument
|
| 725 |
+
text = self.text_cleaner(text)
|
| 726 |
+
text = self.encode(text)
|
| 727 |
+
text = self.intersperse_blank_char(text, True)
|
| 728 |
+
return text
|
| 729 |
+
|
| 730 |
+
def pad_with_bos_eos(self, char_sequence: List[str]):
|
| 731 |
+
"""Pads a sequence with the special BOS and EOS characters."""
|
| 732 |
+
return [self.characters.bos_id] + list(char_sequence) + [self.characters.eos_id]
|
| 733 |
+
|
| 734 |
+
def intersperse_blank_char(self, char_sequence: List[str], use_blank_char: bool = False):
|
| 735 |
+
"""Intersperses the blank character between characters in a sequence.
|
| 736 |
+
|
| 737 |
+
Use the ```blank``` character if defined else use the ```pad``` character.
|
| 738 |
+
"""
|
| 739 |
+
char_to_use = self.characters.blank_id if use_blank_char else self.characters.pad
|
| 740 |
+
result = [char_to_use] * (len(char_sequence) * 2 + 1)
|
| 741 |
+
result[1::2] = char_sequence
|
| 742 |
+
return result
|
| 743 |
+
|
| 744 |
+
@staticmethod
|
| 745 |
+
def init_from_config(config: "Coqpit", characters: "BaseCharacters" = None):
|
| 746 |
+
text_cleaner = multilingual_cleaners
|
| 747 |
+
CharactersClass = VitsCharacters
|
| 748 |
+
characters, new_config = CharactersClass.init_from_config(config)
|
| 749 |
+
# new_config.characters.characters_class = get_import_path(characters)
|
| 750 |
+
new_config.characters.characters_class = VitsCharacters
|
| 751 |
+
return (
|
| 752 |
+
TTSTokenizer(text_cleaner, characters),new_config)
|
| 753 |
+
|
| 754 |
+
|
| 755 |
+
def multilingual_cleaners(text):
|
| 756 |
+
"""Pipeline for multilingual text"""
|
| 757 |
+
text = lowercase(text)
|
| 758 |
+
text = replace_symbols(text, lang=None)
|
| 759 |
+
text = remove_aux_symbols(text)
|
| 760 |
+
text = collapse_whitespace(text)
|
| 761 |
+
return text
|
| 762 |
+
|
| 763 |
+
def lowercase(text):
|
| 764 |
+
return text.lower()
|
| 765 |
+
|
| 766 |
+
def collapse_whitespace(text):
|
| 767 |
+
return re.sub(_whitespace_re, " ", text).strip()
|
| 768 |
+
|
| 769 |
+
def replace_symbols(text, lang="en"):
|
| 770 |
+
|
| 771 |
+
text = text.replace(";", ",")
|
| 772 |
+
text = text.replace("-", " ") if lang != "ca" else text.replace("-", "")
|
| 773 |
+
text = text.replace(":", ",")
|
| 774 |
+
if lang == "en":
|
| 775 |
+
text = text.replace("&", " and ")
|
| 776 |
+
elif lang == "fr":
|
| 777 |
+
text = text.replace("&", " et ")
|
| 778 |
+
elif lang == "pt":
|
| 779 |
+
text = text.replace("&", " e ")
|
| 780 |
+
elif lang == "ca":
|
| 781 |
+
text = text.replace("&", " i ")
|
| 782 |
+
text = text.replace("'", "")
|
| 783 |
+
return text
|
| 784 |
+
|
| 785 |
+
def remove_aux_symbols(text):
|
| 786 |
+
text = re.sub(r"[\<\>\(\)\[\]\"]+", "", text)
|
| 787 |
+
return text
|
models/bn_male/jit_infer.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from extra import TTSTokenizer, VitsConfig, CharactersConfig, VitsCharacters
|
| 3 |
+
import torch
|
| 4 |
+
import numpy as np
|
| 5 |
+
|
| 6 |
+
#ch female
|
| 7 |
+
with open("chars.txt", 'r') as f:
|
| 8 |
+
letters = f.read().strip('\n')
|
| 9 |
+
model="bn_male_vits_30hrs.pt"
|
| 10 |
+
text = " হলেও আমাদের সবার সার্বিক শৃঙ্খলা বোধের উন্নতি হবে"
|
| 11 |
+
|
| 12 |
+
config = VitsConfig(
|
| 13 |
+
text_cleaner="multilingual_cleaners",
|
| 14 |
+
characters=CharactersConfig(
|
| 15 |
+
characters_class=VitsCharacters,
|
| 16 |
+
pad="<PAD>",
|
| 17 |
+
eos="<EOS>",
|
| 18 |
+
bos="<BOS>",
|
| 19 |
+
blank="<BLNK>",
|
| 20 |
+
characters=letters,
|
| 21 |
+
punctuations="!¡'(),-.:;¿? ",
|
| 22 |
+
phonemes=None)
|
| 23 |
+
)
|
| 24 |
+
tokenizer, config = TTSTokenizer.init_from_config(config)
|
| 25 |
+
|
| 26 |
+
x = tokenizer.text_to_ids(text)
|
| 27 |
+
x = torch.from_numpy(np.array(x)).unsqueeze(0)
|
| 28 |
+
net = torch.jit.load(model)
|
| 29 |
+
with torch.no_grad():
|
| 30 |
+
out2 = net(x)
|
| 31 |
+
import soundfile as sf
|
| 32 |
+
sf.write("jit.wav", out2.squeeze().cpu().numpy(), 22050)
|
models/en_female/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
models/en_female/README.md
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
---
|
models/en_female/chars.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
pqw'"sgufmxre?d!lcab,zk.iytoh jvn
|
models/en_female/en_female_vits_30hrs.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9dfa80f08da6ca7222a16cb6d919251fb733d3f03042848a20201fa6ae0d0b9c
|
| 3 |
+
size 333229574
|
models/en_female/extra.py
ADDED
|
@@ -0,0 +1,787 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Callable, Dict, List, Union
|
| 2 |
+
from dataclasses import asdict, dataclass, field
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
import re
|
| 6 |
+
from dataclasses import replace
|
| 7 |
+
from typing import Dict
|
| 8 |
+
_whitespace_re = re.compile(r"\s+")
|
| 9 |
+
|
| 10 |
+
from dataclasses import dataclass, field
|
| 11 |
+
from typing import List
|
| 12 |
+
|
| 13 |
+
# from TTS.tts.configs.shared_configs import BaseTTSConfig
|
| 14 |
+
# from TTS.tts.models.vits import VitsArgs, VitsAudioConfig
|
| 15 |
+
|
| 16 |
+
@dataclass
|
| 17 |
+
class CharactersConfig():
|
| 18 |
+
|
| 19 |
+
characters_class: str = None
|
| 20 |
+
|
| 21 |
+
# using BaseVocabulary
|
| 22 |
+
vocab_dict: Dict = None
|
| 23 |
+
|
| 24 |
+
# using on BaseCharacters
|
| 25 |
+
pad: str = None
|
| 26 |
+
eos: str = None
|
| 27 |
+
bos: str = None
|
| 28 |
+
blank: str = None
|
| 29 |
+
characters: str = None
|
| 30 |
+
punctuations: str = None
|
| 31 |
+
phonemes: str = None
|
| 32 |
+
is_unique: bool = True # for backwards compatibility of models trained with char sets with duplicates
|
| 33 |
+
is_sorted: bool = True
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@dataclass
|
| 37 |
+
class BaseTTSConfig():
|
| 38 |
+
|
| 39 |
+
# audio: BaseAudioConfig = field(default_factory=BaseAudioConfig)
|
| 40 |
+
# phoneme settings
|
| 41 |
+
use_phonemes: bool = False
|
| 42 |
+
phonemizer: str = None
|
| 43 |
+
phoneme_language: str = None
|
| 44 |
+
compute_input_seq_cache: bool = False
|
| 45 |
+
text_cleaner: str = None
|
| 46 |
+
enable_eos_bos_chars: bool = False
|
| 47 |
+
test_sentences_file: str = ""
|
| 48 |
+
phoneme_cache_path: str = None
|
| 49 |
+
# vocabulary parameters
|
| 50 |
+
characters: CharactersConfig = None
|
| 51 |
+
add_blank: bool = False
|
| 52 |
+
# training params
|
| 53 |
+
batch_group_size: int = 0
|
| 54 |
+
loss_masking: bool = None
|
| 55 |
+
# dataloading
|
| 56 |
+
min_audio_len: int = 1
|
| 57 |
+
max_audio_len: int = float("inf")
|
| 58 |
+
min_text_len: int = 1
|
| 59 |
+
max_text_len: int = float("inf")
|
| 60 |
+
compute_f0: bool = False
|
| 61 |
+
compute_energy: bool = False
|
| 62 |
+
compute_linear_spec: bool = False
|
| 63 |
+
precompute_num_workers: int = 0
|
| 64 |
+
use_noise_augment: bool = False
|
| 65 |
+
start_by_longest: bool = False
|
| 66 |
+
shuffle: bool = False
|
| 67 |
+
drop_last: bool = False
|
| 68 |
+
# dataset
|
| 69 |
+
datasets: str = None
|
| 70 |
+
# optimizer
|
| 71 |
+
optimizer: str = "radam"
|
| 72 |
+
optimizer_params: dict = None
|
| 73 |
+
# scheduler
|
| 74 |
+
lr_scheduler: str = None
|
| 75 |
+
lr_scheduler_params: dict = field(default_factory=lambda: {})
|
| 76 |
+
# testing
|
| 77 |
+
test_sentences: List[str] = field(default_factory=lambda: [])
|
| 78 |
+
# evaluation
|
| 79 |
+
eval_split_max_size: int = None
|
| 80 |
+
eval_split_size: float = 0.01
|
| 81 |
+
# weighted samplers
|
| 82 |
+
use_speaker_weighted_sampler: bool = False
|
| 83 |
+
speaker_weighted_sampler_alpha: float = 1.0
|
| 84 |
+
use_language_weighted_sampler: bool = False
|
| 85 |
+
language_weighted_sampler_alpha: float = 1.0
|
| 86 |
+
use_length_weighted_sampler: bool = False
|
| 87 |
+
length_weighted_sampler_alpha: float = 1.0
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
@dataclass
|
| 91 |
+
class VitsAudioConfig():
|
| 92 |
+
fft_size: int = 1024
|
| 93 |
+
sample_rate: int = 22050
|
| 94 |
+
win_length: int = 1024
|
| 95 |
+
hop_length: int = 256
|
| 96 |
+
num_mels: int = 80
|
| 97 |
+
mel_fmin: int = 0
|
| 98 |
+
mel_fmax: int = None
|
| 99 |
+
|
| 100 |
+
@dataclass
|
| 101 |
+
class VitsArgs():
|
| 102 |
+
num_chars: int = 100
|
| 103 |
+
out_channels: int = 513
|
| 104 |
+
spec_segment_size: int = 32
|
| 105 |
+
hidden_channels: int = 192
|
| 106 |
+
hidden_channels_ffn_text_encoder: int = 768
|
| 107 |
+
num_heads_text_encoder: int = 2
|
| 108 |
+
num_layers_text_encoder: int = 6
|
| 109 |
+
kernel_size_text_encoder: int = 3
|
| 110 |
+
dropout_p_text_encoder: float = 0.1
|
| 111 |
+
dropout_p_duration_predictor: float = 0.5
|
| 112 |
+
kernel_size_posterior_encoder: int = 5
|
| 113 |
+
dilation_rate_posterior_encoder: int = 1
|
| 114 |
+
num_layers_posterior_encoder: int = 16
|
| 115 |
+
kernel_size_flow: int = 5
|
| 116 |
+
dilation_rate_flow: int = 1
|
| 117 |
+
num_layers_flow: int = 4
|
| 118 |
+
resblock_type_decoder: str = "1"
|
| 119 |
+
resblock_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [3, 7, 11])
|
| 120 |
+
resblock_dilation_sizes_decoder: List[List[int]] = field(default_factory=lambda: [[1, 3, 5], [1, 3, 5], [1, 3, 5]])
|
| 121 |
+
upsample_rates_decoder: List[int] = field(default_factory=lambda: [8, 8, 2, 2])
|
| 122 |
+
upsample_initial_channel_decoder: int = 512
|
| 123 |
+
upsample_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [16, 16, 4, 4])
|
| 124 |
+
periods_multi_period_discriminator: List[int] = field(default_factory=lambda: [2, 3, 5, 7, 11])
|
| 125 |
+
use_sdp: bool = True
|
| 126 |
+
noise_scale: float = 1.0
|
| 127 |
+
inference_noise_scale: float = 0.667
|
| 128 |
+
length_scale: float = 1
|
| 129 |
+
noise_scale_dp: float = 1.0
|
| 130 |
+
inference_noise_scale_dp: float = 1.0
|
| 131 |
+
max_inference_len: int = None
|
| 132 |
+
init_discriminator: bool = True
|
| 133 |
+
use_spectral_norm_disriminator: bool = False
|
| 134 |
+
use_speaker_embedding: bool = False
|
| 135 |
+
num_speakers: int = 0
|
| 136 |
+
speakers_file: str = None
|
| 137 |
+
d_vector_file: List[str] = None
|
| 138 |
+
speaker_embedding_channels: int = 256
|
| 139 |
+
use_d_vector_file: bool = False
|
| 140 |
+
d_vector_dim: int = 0
|
| 141 |
+
detach_dp_input: bool = True
|
| 142 |
+
use_language_embedding: bool = False
|
| 143 |
+
embedded_language_dim: int = 4
|
| 144 |
+
num_languages: int = 0
|
| 145 |
+
language_ids_file: str = None
|
| 146 |
+
use_speaker_encoder_as_loss: bool = False
|
| 147 |
+
speaker_encoder_config_path: str = ""
|
| 148 |
+
speaker_encoder_model_path: str = ""
|
| 149 |
+
condition_dp_on_speaker: bool = True
|
| 150 |
+
freeze_encoder: bool = False
|
| 151 |
+
freeze_DP: bool = False
|
| 152 |
+
freeze_PE: bool = False
|
| 153 |
+
freeze_flow_decoder: bool = False
|
| 154 |
+
freeze_waveform_decoder: bool = False
|
| 155 |
+
encoder_sample_rate: int = None
|
| 156 |
+
interpolate_z: bool = True
|
| 157 |
+
reinit_DP: bool = False
|
| 158 |
+
reinit_text_encoder: bool = False
|
| 159 |
+
@dataclass
|
| 160 |
+
class VitsConfig(BaseTTSConfig):
|
| 161 |
+
|
| 162 |
+
model: str = "vits"
|
| 163 |
+
# model specific params
|
| 164 |
+
model_args: VitsArgs = field(default_factory=VitsArgs)
|
| 165 |
+
audio: VitsAudioConfig = field(default_factory=VitsAudioConfig)
|
| 166 |
+
|
| 167 |
+
# optimizer
|
| 168 |
+
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000])
|
| 169 |
+
lr_gen: float = 0.0002
|
| 170 |
+
lr_disc: float = 0.0002
|
| 171 |
+
lr_scheduler_gen: str = "ExponentialLR"
|
| 172 |
+
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 173 |
+
lr_scheduler_disc: str = "ExponentialLR"
|
| 174 |
+
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 175 |
+
scheduler_after_epoch: bool = True
|
| 176 |
+
optimizer: str = "AdamW"
|
| 177 |
+
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01})
|
| 178 |
+
|
| 179 |
+
# loss params
|
| 180 |
+
kl_loss_alpha: float = 1.0
|
| 181 |
+
disc_loss_alpha: float = 1.0
|
| 182 |
+
gen_loss_alpha: float = 1.0
|
| 183 |
+
feat_loss_alpha: float = 1.0
|
| 184 |
+
mel_loss_alpha: float = 45.0
|
| 185 |
+
dur_loss_alpha: float = 1.0
|
| 186 |
+
speaker_encoder_loss_alpha: float = 1.0
|
| 187 |
+
|
| 188 |
+
# data loader params
|
| 189 |
+
return_wav: bool = True
|
| 190 |
+
compute_linear_spec: bool = True
|
| 191 |
+
|
| 192 |
+
# sampler params
|
| 193 |
+
use_weighted_sampler: bool = False # TODO: move it to the base config
|
| 194 |
+
weighted_sampler_attrs: dict = field(default_factory=lambda: {})
|
| 195 |
+
weighted_sampler_multipliers: dict = field(default_factory=lambda: {})
|
| 196 |
+
|
| 197 |
+
# overrides
|
| 198 |
+
r: int = 1 # DO NOT CHANGE
|
| 199 |
+
add_blank: bool = True
|
| 200 |
+
|
| 201 |
+
# testing
|
| 202 |
+
test_sentences: List[List] = field(
|
| 203 |
+
default_factory=lambda: [
|
| 204 |
+
["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."],
|
| 205 |
+
["Be a voice, not an echo."],
|
| 206 |
+
["I'm sorry Dave. I'm afraid I can't do that."],
|
| 207 |
+
["This cake is great. It's so delicious and moist."],
|
| 208 |
+
["Prior to November 22, 1963."],
|
| 209 |
+
]
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# multi-speaker settings
|
| 213 |
+
# use speaker embedding layer
|
| 214 |
+
num_speakers: int = 0
|
| 215 |
+
use_speaker_embedding: bool = False
|
| 216 |
+
speakers_file: str = None
|
| 217 |
+
speaker_embedding_channels: int = 256
|
| 218 |
+
language_ids_file: str = None
|
| 219 |
+
use_language_embedding: bool = False
|
| 220 |
+
|
| 221 |
+
# use d-vectors
|
| 222 |
+
use_d_vector_file: bool = False
|
| 223 |
+
d_vector_file: List[str] = None
|
| 224 |
+
d_vector_dim: int = None
|
| 225 |
+
|
| 226 |
+
def __post_init__(self):
|
| 227 |
+
pass
|
| 228 |
+
# for key, val in self.model_args.items():
|
| 229 |
+
# if hasattr(self, key):
|
| 230 |
+
# self[key] = val
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
def parse_symbols():
|
| 237 |
+
return {
|
| 238 |
+
"pad": _pad,
|
| 239 |
+
"eos": _eos,
|
| 240 |
+
"bos": _bos,
|
| 241 |
+
"characters": _characters,
|
| 242 |
+
"punctuations": _punctuations,
|
| 243 |
+
"phonemes": _phonemes,
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# DEFAULT SET OF GRAPHEMES
|
| 248 |
+
_pad = "<PAD>"
|
| 249 |
+
_eos = "<EOS>"
|
| 250 |
+
_bos = "<BOS>"
|
| 251 |
+
_blank = "<BLNK>" # TODO: check if we need this alongside with PAD
|
| 252 |
+
_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
|
| 253 |
+
_punctuations = "!'(),-.:;? "
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
# DEFAULT SET OF IPA PHONEMES
|
| 257 |
+
# Phonemes definition (All IPA characters)
|
| 258 |
+
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
|
| 259 |
+
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
|
| 260 |
+
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
|
| 261 |
+
_suprasegmentals = "ˈˌːˑ"
|
| 262 |
+
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
|
| 263 |
+
_diacrilics = "ɚ˞ɫ"
|
| 264 |
+
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
class BaseVocabulary:
|
| 268 |
+
"""Base Vocabulary class.
|
| 269 |
+
|
| 270 |
+
This class only needs a vocabulary dictionary without specifying the characters.
|
| 271 |
+
|
| 272 |
+
Args:
|
| 273 |
+
vocab (Dict): A dictionary of characters and their corresponding indices.
|
| 274 |
+
"""
|
| 275 |
+
|
| 276 |
+
def __init__(self, vocab: Dict, pad: str = None, blank: str = None, bos: str = None, eos: str = None):
|
| 277 |
+
self.vocab = vocab
|
| 278 |
+
self.pad = pad
|
| 279 |
+
self.blank = blank
|
| 280 |
+
self.bos = bos
|
| 281 |
+
self.eos = eos
|
| 282 |
+
|
| 283 |
+
@property
|
| 284 |
+
def pad_id(self) -> int:
|
| 285 |
+
"""Return the index of the padding character. If the padding character is not specified, return the length
|
| 286 |
+
of the vocabulary."""
|
| 287 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 288 |
+
|
| 289 |
+
@property
|
| 290 |
+
def blank_id(self) -> int:
|
| 291 |
+
"""Return the index of the blank character. If the blank character is not specified, return the length of
|
| 292 |
+
the vocabulary."""
|
| 293 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 294 |
+
|
| 295 |
+
@property
|
| 296 |
+
def bos_id(self) -> int:
|
| 297 |
+
"""Return the index of the bos character. If the bos character is not specified, return the length of the
|
| 298 |
+
vocabulary."""
|
| 299 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 300 |
+
|
| 301 |
+
@property
|
| 302 |
+
def eos_id(self) -> int:
|
| 303 |
+
"""Return the index of the eos character. If the eos character is not specified, return the length of the
|
| 304 |
+
vocabulary."""
|
| 305 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 306 |
+
|
| 307 |
+
@property
|
| 308 |
+
def vocab(self):
|
| 309 |
+
"""Return the vocabulary dictionary."""
|
| 310 |
+
return self._vocab
|
| 311 |
+
|
| 312 |
+
@vocab.setter
|
| 313 |
+
def vocab(self, vocab):
|
| 314 |
+
"""Set the vocabulary dictionary and character mapping dictionaries."""
|
| 315 |
+
self._vocab, self._char_to_id, self._id_to_char = None, None, None
|
| 316 |
+
if vocab is not None:
|
| 317 |
+
self._vocab = vocab
|
| 318 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self._vocab)}
|
| 319 |
+
self._id_to_char = {
|
| 320 |
+
idx: char for idx, char in enumerate(self._vocab) # pylint: disable=unnecessary-comprehension
|
| 321 |
+
}
|
| 322 |
+
|
| 323 |
+
@staticmethod
|
| 324 |
+
def init_from_config(config, **kwargs):
|
| 325 |
+
"""Initialize from the given config."""
|
| 326 |
+
if config.characters is not None and "vocab_dict" in config.characters and config.characters.vocab_dict:
|
| 327 |
+
return (
|
| 328 |
+
BaseVocabulary(
|
| 329 |
+
config.characters.vocab_dict,
|
| 330 |
+
config.characters.pad,
|
| 331 |
+
config.characters.blank,
|
| 332 |
+
config.characters.bos,
|
| 333 |
+
config.characters.eos,
|
| 334 |
+
),
|
| 335 |
+
config,
|
| 336 |
+
)
|
| 337 |
+
return BaseVocabulary(**kwargs), config
|
| 338 |
+
|
| 339 |
+
def to_config(self):
|
| 340 |
+
return CharactersConfig(
|
| 341 |
+
vocab_dict=self._vocab,
|
| 342 |
+
pad=self.pad,
|
| 343 |
+
eos=self.eos,
|
| 344 |
+
bos=self.bos,
|
| 345 |
+
blank=self.blank,
|
| 346 |
+
is_unique=False,
|
| 347 |
+
is_sorted=False,
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
@property
|
| 351 |
+
def num_chars(self):
|
| 352 |
+
"""Return number of tokens in the vocabulary."""
|
| 353 |
+
return len(self._vocab)
|
| 354 |
+
|
| 355 |
+
def char_to_id(self, char: str) -> int:
|
| 356 |
+
"""Map a character to an token ID."""
|
| 357 |
+
try:
|
| 358 |
+
return self._char_to_id[char]
|
| 359 |
+
except KeyError as e:
|
| 360 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 361 |
+
|
| 362 |
+
def id_to_char(self, idx: int) -> str:
|
| 363 |
+
"""Map an token ID to a character."""
|
| 364 |
+
return self._id_to_char[idx]
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
class BaseCharacters:
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
def __init__(
|
| 371 |
+
self,
|
| 372 |
+
characters: str = None,
|
| 373 |
+
punctuations: str = None,
|
| 374 |
+
pad: str = None,
|
| 375 |
+
eos: str = None,
|
| 376 |
+
bos: str = None,
|
| 377 |
+
blank: str = None,
|
| 378 |
+
is_unique: bool = False,
|
| 379 |
+
is_sorted: bool = True,
|
| 380 |
+
) -> None:
|
| 381 |
+
self._characters = characters
|
| 382 |
+
self._punctuations = punctuations
|
| 383 |
+
self._pad = pad
|
| 384 |
+
self._eos = eos
|
| 385 |
+
self._bos = bos
|
| 386 |
+
self._blank = blank
|
| 387 |
+
self.is_unique = is_unique
|
| 388 |
+
self.is_sorted = is_sorted
|
| 389 |
+
self._create_vocab()
|
| 390 |
+
|
| 391 |
+
@property
|
| 392 |
+
def pad_id(self) -> int:
|
| 393 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 394 |
+
|
| 395 |
+
@property
|
| 396 |
+
def blank_id(self) -> int:
|
| 397 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 398 |
+
|
| 399 |
+
@property
|
| 400 |
+
def eos_id(self) -> int:
|
| 401 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 402 |
+
|
| 403 |
+
@property
|
| 404 |
+
def bos_id(self) -> int:
|
| 405 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 406 |
+
|
| 407 |
+
@property
|
| 408 |
+
def characters(self):
|
| 409 |
+
return self._characters
|
| 410 |
+
|
| 411 |
+
@characters.setter
|
| 412 |
+
def characters(self, characters):
|
| 413 |
+
self._characters = characters
|
| 414 |
+
self._create_vocab()
|
| 415 |
+
|
| 416 |
+
@property
|
| 417 |
+
def punctuations(self):
|
| 418 |
+
return self._punctuations
|
| 419 |
+
|
| 420 |
+
@punctuations.setter
|
| 421 |
+
def punctuations(self, punctuations):
|
| 422 |
+
self._punctuations = punctuations
|
| 423 |
+
self._create_vocab()
|
| 424 |
+
|
| 425 |
+
@property
|
| 426 |
+
def pad(self):
|
| 427 |
+
return self._pad
|
| 428 |
+
|
| 429 |
+
@pad.setter
|
| 430 |
+
def pad(self, pad):
|
| 431 |
+
self._pad = pad
|
| 432 |
+
self._create_vocab()
|
| 433 |
+
|
| 434 |
+
@property
|
| 435 |
+
def eos(self):
|
| 436 |
+
return self._eos
|
| 437 |
+
|
| 438 |
+
@eos.setter
|
| 439 |
+
def eos(self, eos):
|
| 440 |
+
self._eos = eos
|
| 441 |
+
self._create_vocab()
|
| 442 |
+
|
| 443 |
+
@property
|
| 444 |
+
def bos(self):
|
| 445 |
+
return self._bos
|
| 446 |
+
|
| 447 |
+
@bos.setter
|
| 448 |
+
def bos(self, bos):
|
| 449 |
+
self._bos = bos
|
| 450 |
+
self._create_vocab()
|
| 451 |
+
|
| 452 |
+
@property
|
| 453 |
+
def blank(self):
|
| 454 |
+
return self._blank
|
| 455 |
+
|
| 456 |
+
@blank.setter
|
| 457 |
+
def blank(self, blank):
|
| 458 |
+
self._blank = blank
|
| 459 |
+
self._create_vocab()
|
| 460 |
+
|
| 461 |
+
@property
|
| 462 |
+
def vocab(self):
|
| 463 |
+
return self._vocab
|
| 464 |
+
|
| 465 |
+
@vocab.setter
|
| 466 |
+
def vocab(self, vocab):
|
| 467 |
+
self._vocab = vocab
|
| 468 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 469 |
+
self._id_to_char = {
|
| 470 |
+
idx: char for idx, char in enumerate(self.vocab) # pylint: disable=unnecessary-comprehension
|
| 471 |
+
}
|
| 472 |
+
|
| 473 |
+
@property
|
| 474 |
+
def num_chars(self):
|
| 475 |
+
return len(self._vocab)
|
| 476 |
+
|
| 477 |
+
def _create_vocab(self):
|
| 478 |
+
_vocab = self._characters
|
| 479 |
+
if self.is_unique:
|
| 480 |
+
_vocab = list(set(_vocab))
|
| 481 |
+
if self.is_sorted:
|
| 482 |
+
_vocab = sorted(_vocab)
|
| 483 |
+
_vocab = list(_vocab)
|
| 484 |
+
_vocab = [self._blank] + _vocab if self._blank is not None and len(self._blank) > 0 else _vocab
|
| 485 |
+
_vocab = [self._bos] + _vocab if self._bos is not None and len(self._bos) > 0 else _vocab
|
| 486 |
+
_vocab = [self._eos] + _vocab if self._eos is not None and len(self._eos) > 0 else _vocab
|
| 487 |
+
_vocab = [self._pad] + _vocab if self._pad is not None and len(self._pad) > 0 else _vocab
|
| 488 |
+
self.vocab = _vocab + list(self._punctuations)
|
| 489 |
+
if self.is_unique:
|
| 490 |
+
duplicates = {x for x in self.vocab if self.vocab.count(x) > 1}
|
| 491 |
+
assert (
|
| 492 |
+
len(self.vocab) == len(self._char_to_id) == len(self._id_to_char)
|
| 493 |
+
), f" [!] There are duplicate characters in the character set. {duplicates}"
|
| 494 |
+
|
| 495 |
+
def char_to_id(self, char: str) -> int:
|
| 496 |
+
try:
|
| 497 |
+
return self._char_to_id[char]
|
| 498 |
+
except KeyError as e:
|
| 499 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 500 |
+
|
| 501 |
+
def id_to_char(self, idx: int) -> str:
|
| 502 |
+
return self._id_to_char[idx]
|
| 503 |
+
|
| 504 |
+
def print_log(self, level: int = 0):
|
| 505 |
+
"""
|
| 506 |
+
Prints the vocabulary in a nice format.
|
| 507 |
+
"""
|
| 508 |
+
indent = "\t" * level
|
| 509 |
+
print(f"{indent}| > Characters: {self._characters}")
|
| 510 |
+
print(f"{indent}| > Punctuations: {self._punctuations}")
|
| 511 |
+
print(f"{indent}| > Pad: {self._pad}")
|
| 512 |
+
print(f"{indent}| > EOS: {self._eos}")
|
| 513 |
+
print(f"{indent}| > BOS: {self._bos}")
|
| 514 |
+
print(f"{indent}| > Blank: {self._blank}")
|
| 515 |
+
print(f"{indent}| > Vocab: {self.vocab}")
|
| 516 |
+
print(f"{indent}| > Num chars: {self.num_chars}")
|
| 517 |
+
|
| 518 |
+
@staticmethod
|
| 519 |
+
def init_from_config(config: "Coqpit"): # pylint: disable=unused-argument
|
| 520 |
+
"""Init your character class from a config.
|
| 521 |
+
|
| 522 |
+
Implement this method for your subclass.
|
| 523 |
+
"""
|
| 524 |
+
# use character set from config
|
| 525 |
+
if config.characters is not None:
|
| 526 |
+
return BaseCharacters(**config.characters), config
|
| 527 |
+
# return default character set
|
| 528 |
+
characters = BaseCharacters()
|
| 529 |
+
new_config = replace(config, characters=characters.to_config())
|
| 530 |
+
return characters, new_config
|
| 531 |
+
|
| 532 |
+
def to_config(self) -> "CharactersConfig":
|
| 533 |
+
return CharactersConfig(
|
| 534 |
+
characters=self._characters,
|
| 535 |
+
punctuations=self._punctuations,
|
| 536 |
+
pad=self._pad,
|
| 537 |
+
eos=self._eos,
|
| 538 |
+
bos=self._bos,
|
| 539 |
+
blank=self._blank,
|
| 540 |
+
is_unique=self.is_unique,
|
| 541 |
+
is_sorted=self.is_sorted,
|
| 542 |
+
)
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
class IPAPhonemes(BaseCharacters):
|
| 546 |
+
|
| 547 |
+
|
| 548 |
+
def __init__(
|
| 549 |
+
self,
|
| 550 |
+
characters: str = _phonemes,
|
| 551 |
+
punctuations: str = _punctuations,
|
| 552 |
+
pad: str = _pad,
|
| 553 |
+
eos: str = _eos,
|
| 554 |
+
bos: str = _bos,
|
| 555 |
+
blank: str = _blank,
|
| 556 |
+
is_unique: bool = False,
|
| 557 |
+
is_sorted: bool = True,
|
| 558 |
+
) -> None:
|
| 559 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 560 |
+
|
| 561 |
+
@staticmethod
|
| 562 |
+
def init_from_config(config: "Coqpit"):
|
| 563 |
+
"""Init a IPAPhonemes object from a model config
|
| 564 |
+
|
| 565 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 566 |
+
will be updated.
|
| 567 |
+
"""
|
| 568 |
+
# band-aid for compatibility with old models
|
| 569 |
+
if "characters" in config and config.characters is not None:
|
| 570 |
+
if "phonemes" in config.characters and config.characters.phonemes is not None:
|
| 571 |
+
config.characters["characters"] = config.characters["phonemes"]
|
| 572 |
+
return (
|
| 573 |
+
IPAPhonemes(
|
| 574 |
+
characters=config.characters["characters"],
|
| 575 |
+
punctuations=config.characters["punctuations"],
|
| 576 |
+
pad=config.characters["pad"],
|
| 577 |
+
eos=config.characters["eos"],
|
| 578 |
+
bos=config.characters["bos"],
|
| 579 |
+
blank=config.characters["blank"],
|
| 580 |
+
is_unique=config.characters["is_unique"],
|
| 581 |
+
is_sorted=config.characters["is_sorted"],
|
| 582 |
+
),
|
| 583 |
+
config,
|
| 584 |
+
)
|
| 585 |
+
# use character set from config
|
| 586 |
+
if config.characters is not None:
|
| 587 |
+
return IPAPhonemes(**config.characters), config
|
| 588 |
+
# return default character set
|
| 589 |
+
characters = IPAPhonemes()
|
| 590 |
+
new_config = replace(config, characters=characters.to_config())
|
| 591 |
+
return characters, new_config
|
| 592 |
+
|
| 593 |
+
|
| 594 |
+
class Graphemes(BaseCharacters):
|
| 595 |
+
|
| 596 |
+
|
| 597 |
+
def __init__(
|
| 598 |
+
self,
|
| 599 |
+
characters: str = _characters,
|
| 600 |
+
punctuations: str = _punctuations,
|
| 601 |
+
pad: str = _pad,
|
| 602 |
+
eos: str = _eos,
|
| 603 |
+
bos: str = _bos,
|
| 604 |
+
blank: str = _blank,
|
| 605 |
+
is_unique: bool = False,
|
| 606 |
+
is_sorted: bool = True,
|
| 607 |
+
) -> None:
|
| 608 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 609 |
+
|
| 610 |
+
@staticmethod
|
| 611 |
+
def init_from_config(config: "Coqpit"):
|
| 612 |
+
"""Init a Graphemes object from a model config
|
| 613 |
+
|
| 614 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 615 |
+
will be updated.
|
| 616 |
+
"""
|
| 617 |
+
if config.characters is not None:
|
| 618 |
+
# band-aid for compatibility with old models
|
| 619 |
+
if "phonemes" in config.characters:
|
| 620 |
+
return (
|
| 621 |
+
Graphemes(
|
| 622 |
+
characters=config.characters["characters"],
|
| 623 |
+
punctuations=config.characters["punctuations"],
|
| 624 |
+
pad=config.characters["pad"],
|
| 625 |
+
eos=config.characters["eos"],
|
| 626 |
+
bos=config.characters["bos"],
|
| 627 |
+
blank=config.characters["blank"],
|
| 628 |
+
is_unique=config.characters["is_unique"],
|
| 629 |
+
is_sorted=config.characters["is_sorted"],
|
| 630 |
+
),
|
| 631 |
+
config,
|
| 632 |
+
)
|
| 633 |
+
return Graphemes(**config.characters), config
|
| 634 |
+
characters = Graphemes()
|
| 635 |
+
new_config = replace(config, characters=characters.to_config())
|
| 636 |
+
return characters, new_config
|
| 637 |
+
|
| 638 |
+
|
| 639 |
+
if __name__ == "__main__":
|
| 640 |
+
gr = Graphemes()
|
| 641 |
+
ph = IPAPhonemes()
|
| 642 |
+
gr.print_log()
|
| 643 |
+
ph.print_log()
|
| 644 |
+
|
| 645 |
+
|
| 646 |
+
class VitsCharacters(BaseCharacters):
|
| 647 |
+
"""Characters class for VITs model for compatibility with pre-trained models"""
|
| 648 |
+
|
| 649 |
+
def __init__(
|
| 650 |
+
self,
|
| 651 |
+
graphemes: str = _characters,
|
| 652 |
+
punctuations: str = _punctuations,
|
| 653 |
+
pad: str = _pad,
|
| 654 |
+
ipa_characters: str = _phonemes,
|
| 655 |
+
) -> None:
|
| 656 |
+
if ipa_characters is not None:
|
| 657 |
+
graphemes += ipa_characters
|
| 658 |
+
super().__init__(graphemes, punctuations, pad, None, None, "<BLNK>", is_unique=False, is_sorted=True)
|
| 659 |
+
|
| 660 |
+
def _create_vocab(self):
|
| 661 |
+
self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank]
|
| 662 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 663 |
+
# pylint: disable=unnecessary-comprehension
|
| 664 |
+
self._id_to_char = {idx: char for idx, char in enumerate(self.vocab)}
|
| 665 |
+
|
| 666 |
+
@staticmethod
|
| 667 |
+
def init_from_config(config):
|
| 668 |
+
_pad = config.characters.pad
|
| 669 |
+
_punctuations = config.characters.punctuations
|
| 670 |
+
_letters = config.characters.characters
|
| 671 |
+
_letters_ipa = config.characters.phonemes
|
| 672 |
+
return (
|
| 673 |
+
VitsCharacters(graphemes=_letters, ipa_characters=_letters_ipa, punctuations=_punctuations, pad=_pad),
|
| 674 |
+
config,
|
| 675 |
+
)
|
| 676 |
+
|
| 677 |
+
def to_config(self) -> "CharactersConfig":
|
| 678 |
+
return CharactersConfig(
|
| 679 |
+
characters=self._characters,
|
| 680 |
+
punctuations=self._punctuations,
|
| 681 |
+
pad=self._pad,
|
| 682 |
+
eos=None,
|
| 683 |
+
bos=None,
|
| 684 |
+
blank=self._blank,
|
| 685 |
+
is_unique=False,
|
| 686 |
+
is_sorted=True,
|
| 687 |
+
)
|
| 688 |
+
|
| 689 |
+
class TTSTokenizer:
|
| 690 |
+
def __init__(
|
| 691 |
+
self,
|
| 692 |
+
text_cleaner: Callable = None,
|
| 693 |
+
characters: "BaseCharacters" = None,
|
| 694 |
+
):
|
| 695 |
+
self.text_cleaner = text_cleaner
|
| 696 |
+
self.characters = characters
|
| 697 |
+
self.not_found_characters = []
|
| 698 |
+
|
| 699 |
+
@property
|
| 700 |
+
def characters(self):
|
| 701 |
+
return self._characters
|
| 702 |
+
|
| 703 |
+
@characters.setter
|
| 704 |
+
def characters(self, new_characters):
|
| 705 |
+
self._characters = new_characters
|
| 706 |
+
self.pad_id = self.characters.char_to_id(self.characters.pad) if self.characters.pad else None
|
| 707 |
+
self.blank_id = self.characters.char_to_id(self.characters.blank) if self.characters.blank else None
|
| 708 |
+
|
| 709 |
+
def encode(self, text: str) -> List[int]:
|
| 710 |
+
"""Encodes a string of text as a sequence of IDs."""
|
| 711 |
+
token_ids = []
|
| 712 |
+
for char in text:
|
| 713 |
+
try:
|
| 714 |
+
idx = self.characters.char_to_id(char)
|
| 715 |
+
token_ids.append(idx)
|
| 716 |
+
except KeyError:
|
| 717 |
+
# discard but store not found characters
|
| 718 |
+
if char not in self.not_found_characters:
|
| 719 |
+
self.not_found_characters.append(char)
|
| 720 |
+
print(text)
|
| 721 |
+
print(f" [!] Character {repr(char)} not found in the vocabulary. Discarding it.")
|
| 722 |
+
return token_ids
|
| 723 |
+
|
| 724 |
+
def text_to_ids(self, text: str, language: str = None) -> List[int]: # pylint: disable=unused-argument
|
| 725 |
+
text = self.text_cleaner(text)
|
| 726 |
+
text = self.encode(text)
|
| 727 |
+
text = self.intersperse_blank_char(text, True)
|
| 728 |
+
return text
|
| 729 |
+
|
| 730 |
+
def pad_with_bos_eos(self, char_sequence: List[str]):
|
| 731 |
+
"""Pads a sequence with the special BOS and EOS characters."""
|
| 732 |
+
return [self.characters.bos_id] + list(char_sequence) + [self.characters.eos_id]
|
| 733 |
+
|
| 734 |
+
def intersperse_blank_char(self, char_sequence: List[str], use_blank_char: bool = False):
|
| 735 |
+
"""Intersperses the blank character between characters in a sequence.
|
| 736 |
+
|
| 737 |
+
Use the ```blank``` character if defined else use the ```pad``` character.
|
| 738 |
+
"""
|
| 739 |
+
char_to_use = self.characters.blank_id if use_blank_char else self.characters.pad
|
| 740 |
+
result = [char_to_use] * (len(char_sequence) * 2 + 1)
|
| 741 |
+
result[1::2] = char_sequence
|
| 742 |
+
return result
|
| 743 |
+
|
| 744 |
+
@staticmethod
|
| 745 |
+
def init_from_config(config: "Coqpit", characters: "BaseCharacters" = None):
|
| 746 |
+
text_cleaner = multilingual_cleaners
|
| 747 |
+
CharactersClass = VitsCharacters
|
| 748 |
+
characters, new_config = CharactersClass.init_from_config(config)
|
| 749 |
+
# new_config.characters.characters_class = get_import_path(characters)
|
| 750 |
+
new_config.characters.characters_class = VitsCharacters
|
| 751 |
+
return (
|
| 752 |
+
TTSTokenizer(text_cleaner, characters),new_config)
|
| 753 |
+
|
| 754 |
+
|
| 755 |
+
def multilingual_cleaners(text):
|
| 756 |
+
"""Pipeline for multilingual text"""
|
| 757 |
+
text = lowercase(text)
|
| 758 |
+
text = replace_symbols(text, lang=None)
|
| 759 |
+
text = remove_aux_symbols(text)
|
| 760 |
+
text = collapse_whitespace(text)
|
| 761 |
+
return text
|
| 762 |
+
|
| 763 |
+
def lowercase(text):
|
| 764 |
+
return text.lower()
|
| 765 |
+
|
| 766 |
+
def collapse_whitespace(text):
|
| 767 |
+
return re.sub(_whitespace_re, " ", text).strip()
|
| 768 |
+
|
| 769 |
+
def replace_symbols(text, lang="en"):
|
| 770 |
+
|
| 771 |
+
text = text.replace(";", ",")
|
| 772 |
+
text = text.replace("-", " ") if lang != "ca" else text.replace("-", "")
|
| 773 |
+
text = text.replace(":", ",")
|
| 774 |
+
if lang == "en":
|
| 775 |
+
text = text.replace("&", " and ")
|
| 776 |
+
elif lang == "fr":
|
| 777 |
+
text = text.replace("&", " et ")
|
| 778 |
+
elif lang == "pt":
|
| 779 |
+
text = text.replace("&", " e ")
|
| 780 |
+
elif lang == "ca":
|
| 781 |
+
text = text.replace("&", " i ")
|
| 782 |
+
text = text.replace("'", "")
|
| 783 |
+
return text
|
| 784 |
+
|
| 785 |
+
def remove_aux_symbols(text):
|
| 786 |
+
text = re.sub(r"[\<\>\(\)\[\]\"]+", "", text)
|
| 787 |
+
return text
|
models/en_female/jit_infer.py
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from extra import TTSTokenizer, VitsConfig, CharactersConfig, VitsCharacters
|
| 3 |
+
import torch
|
| 4 |
+
import numpy as np
|
| 5 |
+
|
| 6 |
+
#ch female
|
| 7 |
+
with open("chars.txt", 'r') as f:
|
| 8 |
+
letters = f.read().strip('\n')
|
| 9 |
+
model="en_female_vits_30hrs.pt"
|
| 10 |
+
# text = " হলেও আমাদের সবার সার্বিক শৃঙ্খলা বোধের উন্নতি হবে"
|
| 11 |
+
text = "My name is g p t, chat g p t"
|
| 12 |
+
|
| 13 |
+
config = VitsConfig(
|
| 14 |
+
text_cleaner="multilingual_cleaners",
|
| 15 |
+
characters=CharactersConfig(
|
| 16 |
+
characters_class=VitsCharacters,
|
| 17 |
+
pad="<PAD>",
|
| 18 |
+
eos="<EOS>",
|
| 19 |
+
bos="<BOS>",
|
| 20 |
+
blank="<BLNK>",
|
| 21 |
+
characters=letters,
|
| 22 |
+
punctuations="!¡'(),-.:;¿? ",
|
| 23 |
+
phonemes=None)
|
| 24 |
+
)
|
| 25 |
+
tokenizer, config = TTSTokenizer.init_from_config(config)
|
| 26 |
+
|
| 27 |
+
x = tokenizer.text_to_ids(text)
|
| 28 |
+
x = torch.from_numpy(np.array(x)).unsqueeze(0)
|
| 29 |
+
net = torch.jit.load(model)
|
| 30 |
+
with torch.no_grad():
|
| 31 |
+
out2 = net(x)
|
| 32 |
+
import soundfile as sf
|
| 33 |
+
sf.write("jit.wav", out2.squeeze().cpu().numpy(), 22050)
|
models/en_male/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
models/en_male/README.md
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
---
|
models/en_male/chars.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
pqw'"sgufmxre?d!lcab,zk.iytoh jvn
|
models/en_male/en_male_vits_30hrs.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ffa1099438a58c8a13e437d39ec304b530644156ef445032e64422d83e558666
|
| 3 |
+
size 333224012
|
models/en_male/extra.py
ADDED
|
@@ -0,0 +1,787 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Callable, Dict, List, Union
|
| 2 |
+
from dataclasses import asdict, dataclass, field
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
import re
|
| 6 |
+
from dataclasses import replace
|
| 7 |
+
from typing import Dict
|
| 8 |
+
_whitespace_re = re.compile(r"\s+")
|
| 9 |
+
|
| 10 |
+
from dataclasses import dataclass, field
|
| 11 |
+
from typing import List
|
| 12 |
+
|
| 13 |
+
# from TTS.tts.configs.shared_configs import BaseTTSConfig
|
| 14 |
+
# from TTS.tts.models.vits import VitsArgs, VitsAudioConfig
|
| 15 |
+
|
| 16 |
+
@dataclass
|
| 17 |
+
class CharactersConfig():
|
| 18 |
+
|
| 19 |
+
characters_class: str = None
|
| 20 |
+
|
| 21 |
+
# using BaseVocabulary
|
| 22 |
+
vocab_dict: Dict = None
|
| 23 |
+
|
| 24 |
+
# using on BaseCharacters
|
| 25 |
+
pad: str = None
|
| 26 |
+
eos: str = None
|
| 27 |
+
bos: str = None
|
| 28 |
+
blank: str = None
|
| 29 |
+
characters: str = None
|
| 30 |
+
punctuations: str = None
|
| 31 |
+
phonemes: str = None
|
| 32 |
+
is_unique: bool = True # for backwards compatibility of models trained with char sets with duplicates
|
| 33 |
+
is_sorted: bool = True
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@dataclass
|
| 37 |
+
class BaseTTSConfig():
|
| 38 |
+
|
| 39 |
+
# audio: BaseAudioConfig = field(default_factory=BaseAudioConfig)
|
| 40 |
+
# phoneme settings
|
| 41 |
+
use_phonemes: bool = False
|
| 42 |
+
phonemizer: str = None
|
| 43 |
+
phoneme_language: str = None
|
| 44 |
+
compute_input_seq_cache: bool = False
|
| 45 |
+
text_cleaner: str = None
|
| 46 |
+
enable_eos_bos_chars: bool = False
|
| 47 |
+
test_sentences_file: str = ""
|
| 48 |
+
phoneme_cache_path: str = None
|
| 49 |
+
# vocabulary parameters
|
| 50 |
+
characters: CharactersConfig = None
|
| 51 |
+
add_blank: bool = False
|
| 52 |
+
# training params
|
| 53 |
+
batch_group_size: int = 0
|
| 54 |
+
loss_masking: bool = None
|
| 55 |
+
# dataloading
|
| 56 |
+
min_audio_len: int = 1
|
| 57 |
+
max_audio_len: int = float("inf")
|
| 58 |
+
min_text_len: int = 1
|
| 59 |
+
max_text_len: int = float("inf")
|
| 60 |
+
compute_f0: bool = False
|
| 61 |
+
compute_energy: bool = False
|
| 62 |
+
compute_linear_spec: bool = False
|
| 63 |
+
precompute_num_workers: int = 0
|
| 64 |
+
use_noise_augment: bool = False
|
| 65 |
+
start_by_longest: bool = False
|
| 66 |
+
shuffle: bool = False
|
| 67 |
+
drop_last: bool = False
|
| 68 |
+
# dataset
|
| 69 |
+
datasets: str = None
|
| 70 |
+
# optimizer
|
| 71 |
+
optimizer: str = "radam"
|
| 72 |
+
optimizer_params: dict = None
|
| 73 |
+
# scheduler
|
| 74 |
+
lr_scheduler: str = None
|
| 75 |
+
lr_scheduler_params: dict = field(default_factory=lambda: {})
|
| 76 |
+
# testing
|
| 77 |
+
test_sentences: List[str] = field(default_factory=lambda: [])
|
| 78 |
+
# evaluation
|
| 79 |
+
eval_split_max_size: int = None
|
| 80 |
+
eval_split_size: float = 0.01
|
| 81 |
+
# weighted samplers
|
| 82 |
+
use_speaker_weighted_sampler: bool = False
|
| 83 |
+
speaker_weighted_sampler_alpha: float = 1.0
|
| 84 |
+
use_language_weighted_sampler: bool = False
|
| 85 |
+
language_weighted_sampler_alpha: float = 1.0
|
| 86 |
+
use_length_weighted_sampler: bool = False
|
| 87 |
+
length_weighted_sampler_alpha: float = 1.0
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
@dataclass
|
| 91 |
+
class VitsAudioConfig():
|
| 92 |
+
fft_size: int = 1024
|
| 93 |
+
sample_rate: int = 22050
|
| 94 |
+
win_length: int = 1024
|
| 95 |
+
hop_length: int = 256
|
| 96 |
+
num_mels: int = 80
|
| 97 |
+
mel_fmin: int = 0
|
| 98 |
+
mel_fmax: int = None
|
| 99 |
+
|
| 100 |
+
@dataclass
|
| 101 |
+
class VitsArgs():
|
| 102 |
+
num_chars: int = 100
|
| 103 |
+
out_channels: int = 513
|
| 104 |
+
spec_segment_size: int = 32
|
| 105 |
+
hidden_channels: int = 192
|
| 106 |
+
hidden_channels_ffn_text_encoder: int = 768
|
| 107 |
+
num_heads_text_encoder: int = 2
|
| 108 |
+
num_layers_text_encoder: int = 6
|
| 109 |
+
kernel_size_text_encoder: int = 3
|
| 110 |
+
dropout_p_text_encoder: float = 0.1
|
| 111 |
+
dropout_p_duration_predictor: float = 0.5
|
| 112 |
+
kernel_size_posterior_encoder: int = 5
|
| 113 |
+
dilation_rate_posterior_encoder: int = 1
|
| 114 |
+
num_layers_posterior_encoder: int = 16
|
| 115 |
+
kernel_size_flow: int = 5
|
| 116 |
+
dilation_rate_flow: int = 1
|
| 117 |
+
num_layers_flow: int = 4
|
| 118 |
+
resblock_type_decoder: str = "1"
|
| 119 |
+
resblock_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [3, 7, 11])
|
| 120 |
+
resblock_dilation_sizes_decoder: List[List[int]] = field(default_factory=lambda: [[1, 3, 5], [1, 3, 5], [1, 3, 5]])
|
| 121 |
+
upsample_rates_decoder: List[int] = field(default_factory=lambda: [8, 8, 2, 2])
|
| 122 |
+
upsample_initial_channel_decoder: int = 512
|
| 123 |
+
upsample_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [16, 16, 4, 4])
|
| 124 |
+
periods_multi_period_discriminator: List[int] = field(default_factory=lambda: [2, 3, 5, 7, 11])
|
| 125 |
+
use_sdp: bool = True
|
| 126 |
+
noise_scale: float = 1.0
|
| 127 |
+
inference_noise_scale: float = 0.667
|
| 128 |
+
length_scale: float = 1
|
| 129 |
+
noise_scale_dp: float = 1.0
|
| 130 |
+
inference_noise_scale_dp: float = 1.0
|
| 131 |
+
max_inference_len: int = None
|
| 132 |
+
init_discriminator: bool = True
|
| 133 |
+
use_spectral_norm_disriminator: bool = False
|
| 134 |
+
use_speaker_embedding: bool = False
|
| 135 |
+
num_speakers: int = 0
|
| 136 |
+
speakers_file: str = None
|
| 137 |
+
d_vector_file: List[str] = None
|
| 138 |
+
speaker_embedding_channels: int = 256
|
| 139 |
+
use_d_vector_file: bool = False
|
| 140 |
+
d_vector_dim: int = 0
|
| 141 |
+
detach_dp_input: bool = True
|
| 142 |
+
use_language_embedding: bool = False
|
| 143 |
+
embedded_language_dim: int = 4
|
| 144 |
+
num_languages: int = 0
|
| 145 |
+
language_ids_file: str = None
|
| 146 |
+
use_speaker_encoder_as_loss: bool = False
|
| 147 |
+
speaker_encoder_config_path: str = ""
|
| 148 |
+
speaker_encoder_model_path: str = ""
|
| 149 |
+
condition_dp_on_speaker: bool = True
|
| 150 |
+
freeze_encoder: bool = False
|
| 151 |
+
freeze_DP: bool = False
|
| 152 |
+
freeze_PE: bool = False
|
| 153 |
+
freeze_flow_decoder: bool = False
|
| 154 |
+
freeze_waveform_decoder: bool = False
|
| 155 |
+
encoder_sample_rate: int = None
|
| 156 |
+
interpolate_z: bool = True
|
| 157 |
+
reinit_DP: bool = False
|
| 158 |
+
reinit_text_encoder: bool = False
|
| 159 |
+
@dataclass
|
| 160 |
+
class VitsConfig(BaseTTSConfig):
|
| 161 |
+
|
| 162 |
+
model: str = "vits"
|
| 163 |
+
# model specific params
|
| 164 |
+
model_args: VitsArgs = field(default_factory=VitsArgs)
|
| 165 |
+
audio: VitsAudioConfig = field(default_factory=VitsAudioConfig)
|
| 166 |
+
|
| 167 |
+
# optimizer
|
| 168 |
+
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000])
|
| 169 |
+
lr_gen: float = 0.0002
|
| 170 |
+
lr_disc: float = 0.0002
|
| 171 |
+
lr_scheduler_gen: str = "ExponentialLR"
|
| 172 |
+
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 173 |
+
lr_scheduler_disc: str = "ExponentialLR"
|
| 174 |
+
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 175 |
+
scheduler_after_epoch: bool = True
|
| 176 |
+
optimizer: str = "AdamW"
|
| 177 |
+
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01})
|
| 178 |
+
|
| 179 |
+
# loss params
|
| 180 |
+
kl_loss_alpha: float = 1.0
|
| 181 |
+
disc_loss_alpha: float = 1.0
|
| 182 |
+
gen_loss_alpha: float = 1.0
|
| 183 |
+
feat_loss_alpha: float = 1.0
|
| 184 |
+
mel_loss_alpha: float = 45.0
|
| 185 |
+
dur_loss_alpha: float = 1.0
|
| 186 |
+
speaker_encoder_loss_alpha: float = 1.0
|
| 187 |
+
|
| 188 |
+
# data loader params
|
| 189 |
+
return_wav: bool = True
|
| 190 |
+
compute_linear_spec: bool = True
|
| 191 |
+
|
| 192 |
+
# sampler params
|
| 193 |
+
use_weighted_sampler: bool = False # TODO: move it to the base config
|
| 194 |
+
weighted_sampler_attrs: dict = field(default_factory=lambda: {})
|
| 195 |
+
weighted_sampler_multipliers: dict = field(default_factory=lambda: {})
|
| 196 |
+
|
| 197 |
+
# overrides
|
| 198 |
+
r: int = 1 # DO NOT CHANGE
|
| 199 |
+
add_blank: bool = True
|
| 200 |
+
|
| 201 |
+
# testing
|
| 202 |
+
test_sentences: List[List] = field(
|
| 203 |
+
default_factory=lambda: [
|
| 204 |
+
["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."],
|
| 205 |
+
["Be a voice, not an echo."],
|
| 206 |
+
["I'm sorry Dave. I'm afraid I can't do that."],
|
| 207 |
+
["This cake is great. It's so delicious and moist."],
|
| 208 |
+
["Prior to November 22, 1963."],
|
| 209 |
+
]
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# multi-speaker settings
|
| 213 |
+
# use speaker embedding layer
|
| 214 |
+
num_speakers: int = 0
|
| 215 |
+
use_speaker_embedding: bool = False
|
| 216 |
+
speakers_file: str = None
|
| 217 |
+
speaker_embedding_channels: int = 256
|
| 218 |
+
language_ids_file: str = None
|
| 219 |
+
use_language_embedding: bool = False
|
| 220 |
+
|
| 221 |
+
# use d-vectors
|
| 222 |
+
use_d_vector_file: bool = False
|
| 223 |
+
d_vector_file: List[str] = None
|
| 224 |
+
d_vector_dim: int = None
|
| 225 |
+
|
| 226 |
+
def __post_init__(self):
|
| 227 |
+
pass
|
| 228 |
+
# for key, val in self.model_args.items():
|
| 229 |
+
# if hasattr(self, key):
|
| 230 |
+
# self[key] = val
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
def parse_symbols():
|
| 237 |
+
return {
|
| 238 |
+
"pad": _pad,
|
| 239 |
+
"eos": _eos,
|
| 240 |
+
"bos": _bos,
|
| 241 |
+
"characters": _characters,
|
| 242 |
+
"punctuations": _punctuations,
|
| 243 |
+
"phonemes": _phonemes,
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# DEFAULT SET OF GRAPHEMES
|
| 248 |
+
_pad = "<PAD>"
|
| 249 |
+
_eos = "<EOS>"
|
| 250 |
+
_bos = "<BOS>"
|
| 251 |
+
_blank = "<BLNK>" # TODO: check if we need this alongside with PAD
|
| 252 |
+
_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
|
| 253 |
+
_punctuations = "!'(),-.:;? "
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
# DEFAULT SET OF IPA PHONEMES
|
| 257 |
+
# Phonemes definition (All IPA characters)
|
| 258 |
+
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
|
| 259 |
+
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
|
| 260 |
+
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
|
| 261 |
+
_suprasegmentals = "ˈˌːˑ"
|
| 262 |
+
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
|
| 263 |
+
_diacrilics = "ɚ˞ɫ"
|
| 264 |
+
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
class BaseVocabulary:
|
| 268 |
+
"""Base Vocabulary class.
|
| 269 |
+
|
| 270 |
+
This class only needs a vocabulary dictionary without specifying the characters.
|
| 271 |
+
|
| 272 |
+
Args:
|
| 273 |
+
vocab (Dict): A dictionary of characters and their corresponding indices.
|
| 274 |
+
"""
|
| 275 |
+
|
| 276 |
+
def __init__(self, vocab: Dict, pad: str = None, blank: str = None, bos: str = None, eos: str = None):
|
| 277 |
+
self.vocab = vocab
|
| 278 |
+
self.pad = pad
|
| 279 |
+
self.blank = blank
|
| 280 |
+
self.bos = bos
|
| 281 |
+
self.eos = eos
|
| 282 |
+
|
| 283 |
+
@property
|
| 284 |
+
def pad_id(self) -> int:
|
| 285 |
+
"""Return the index of the padding character. If the padding character is not specified, return the length
|
| 286 |
+
of the vocabulary."""
|
| 287 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 288 |
+
|
| 289 |
+
@property
|
| 290 |
+
def blank_id(self) -> int:
|
| 291 |
+
"""Return the index of the blank character. If the blank character is not specified, return the length of
|
| 292 |
+
the vocabulary."""
|
| 293 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 294 |
+
|
| 295 |
+
@property
|
| 296 |
+
def bos_id(self) -> int:
|
| 297 |
+
"""Return the index of the bos character. If the bos character is not specified, return the length of the
|
| 298 |
+
vocabulary."""
|
| 299 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 300 |
+
|
| 301 |
+
@property
|
| 302 |
+
def eos_id(self) -> int:
|
| 303 |
+
"""Return the index of the eos character. If the eos character is not specified, return the length of the
|
| 304 |
+
vocabulary."""
|
| 305 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 306 |
+
|
| 307 |
+
@property
|
| 308 |
+
def vocab(self):
|
| 309 |
+
"""Return the vocabulary dictionary."""
|
| 310 |
+
return self._vocab
|
| 311 |
+
|
| 312 |
+
@vocab.setter
|
| 313 |
+
def vocab(self, vocab):
|
| 314 |
+
"""Set the vocabulary dictionary and character mapping dictionaries."""
|
| 315 |
+
self._vocab, self._char_to_id, self._id_to_char = None, None, None
|
| 316 |
+
if vocab is not None:
|
| 317 |
+
self._vocab = vocab
|
| 318 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self._vocab)}
|
| 319 |
+
self._id_to_char = {
|
| 320 |
+
idx: char for idx, char in enumerate(self._vocab) # pylint: disable=unnecessary-comprehension
|
| 321 |
+
}
|
| 322 |
+
|
| 323 |
+
@staticmethod
|
| 324 |
+
def init_from_config(config, **kwargs):
|
| 325 |
+
"""Initialize from the given config."""
|
| 326 |
+
if config.characters is not None and "vocab_dict" in config.characters and config.characters.vocab_dict:
|
| 327 |
+
return (
|
| 328 |
+
BaseVocabulary(
|
| 329 |
+
config.characters.vocab_dict,
|
| 330 |
+
config.characters.pad,
|
| 331 |
+
config.characters.blank,
|
| 332 |
+
config.characters.bos,
|
| 333 |
+
config.characters.eos,
|
| 334 |
+
),
|
| 335 |
+
config,
|
| 336 |
+
)
|
| 337 |
+
return BaseVocabulary(**kwargs), config
|
| 338 |
+
|
| 339 |
+
def to_config(self):
|
| 340 |
+
return CharactersConfig(
|
| 341 |
+
vocab_dict=self._vocab,
|
| 342 |
+
pad=self.pad,
|
| 343 |
+
eos=self.eos,
|
| 344 |
+
bos=self.bos,
|
| 345 |
+
blank=self.blank,
|
| 346 |
+
is_unique=False,
|
| 347 |
+
is_sorted=False,
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
@property
|
| 351 |
+
def num_chars(self):
|
| 352 |
+
"""Return number of tokens in the vocabulary."""
|
| 353 |
+
return len(self._vocab)
|
| 354 |
+
|
| 355 |
+
def char_to_id(self, char: str) -> int:
|
| 356 |
+
"""Map a character to an token ID."""
|
| 357 |
+
try:
|
| 358 |
+
return self._char_to_id[char]
|
| 359 |
+
except KeyError as e:
|
| 360 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 361 |
+
|
| 362 |
+
def id_to_char(self, idx: int) -> str:
|
| 363 |
+
"""Map an token ID to a character."""
|
| 364 |
+
return self._id_to_char[idx]
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
class BaseCharacters:
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
def __init__(
|
| 371 |
+
self,
|
| 372 |
+
characters: str = None,
|
| 373 |
+
punctuations: str = None,
|
| 374 |
+
pad: str = None,
|
| 375 |
+
eos: str = None,
|
| 376 |
+
bos: str = None,
|
| 377 |
+
blank: str = None,
|
| 378 |
+
is_unique: bool = False,
|
| 379 |
+
is_sorted: bool = True,
|
| 380 |
+
) -> None:
|
| 381 |
+
self._characters = characters
|
| 382 |
+
self._punctuations = punctuations
|
| 383 |
+
self._pad = pad
|
| 384 |
+
self._eos = eos
|
| 385 |
+
self._bos = bos
|
| 386 |
+
self._blank = blank
|
| 387 |
+
self.is_unique = is_unique
|
| 388 |
+
self.is_sorted = is_sorted
|
| 389 |
+
self._create_vocab()
|
| 390 |
+
|
| 391 |
+
@property
|
| 392 |
+
def pad_id(self) -> int:
|
| 393 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 394 |
+
|
| 395 |
+
@property
|
| 396 |
+
def blank_id(self) -> int:
|
| 397 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 398 |
+
|
| 399 |
+
@property
|
| 400 |
+
def eos_id(self) -> int:
|
| 401 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 402 |
+
|
| 403 |
+
@property
|
| 404 |
+
def bos_id(self) -> int:
|
| 405 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 406 |
+
|
| 407 |
+
@property
|
| 408 |
+
def characters(self):
|
| 409 |
+
return self._characters
|
| 410 |
+
|
| 411 |
+
@characters.setter
|
| 412 |
+
def characters(self, characters):
|
| 413 |
+
self._characters = characters
|
| 414 |
+
self._create_vocab()
|
| 415 |
+
|
| 416 |
+
@property
|
| 417 |
+
def punctuations(self):
|
| 418 |
+
return self._punctuations
|
| 419 |
+
|
| 420 |
+
@punctuations.setter
|
| 421 |
+
def punctuations(self, punctuations):
|
| 422 |
+
self._punctuations = punctuations
|
| 423 |
+
self._create_vocab()
|
| 424 |
+
|
| 425 |
+
@property
|
| 426 |
+
def pad(self):
|
| 427 |
+
return self._pad
|
| 428 |
+
|
| 429 |
+
@pad.setter
|
| 430 |
+
def pad(self, pad):
|
| 431 |
+
self._pad = pad
|
| 432 |
+
self._create_vocab()
|
| 433 |
+
|
| 434 |
+
@property
|
| 435 |
+
def eos(self):
|
| 436 |
+
return self._eos
|
| 437 |
+
|
| 438 |
+
@eos.setter
|
| 439 |
+
def eos(self, eos):
|
| 440 |
+
self._eos = eos
|
| 441 |
+
self._create_vocab()
|
| 442 |
+
|
| 443 |
+
@property
|
| 444 |
+
def bos(self):
|
| 445 |
+
return self._bos
|
| 446 |
+
|
| 447 |
+
@bos.setter
|
| 448 |
+
def bos(self, bos):
|
| 449 |
+
self._bos = bos
|
| 450 |
+
self._create_vocab()
|
| 451 |
+
|
| 452 |
+
@property
|
| 453 |
+
def blank(self):
|
| 454 |
+
return self._blank
|
| 455 |
+
|
| 456 |
+
@blank.setter
|
| 457 |
+
def blank(self, blank):
|
| 458 |
+
self._blank = blank
|
| 459 |
+
self._create_vocab()
|
| 460 |
+
|
| 461 |
+
@property
|
| 462 |
+
def vocab(self):
|
| 463 |
+
return self._vocab
|
| 464 |
+
|
| 465 |
+
@vocab.setter
|
| 466 |
+
def vocab(self, vocab):
|
| 467 |
+
self._vocab = vocab
|
| 468 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 469 |
+
self._id_to_char = {
|
| 470 |
+
idx: char for idx, char in enumerate(self.vocab) # pylint: disable=unnecessary-comprehension
|
| 471 |
+
}
|
| 472 |
+
|
| 473 |
+
@property
|
| 474 |
+
def num_chars(self):
|
| 475 |
+
return len(self._vocab)
|
| 476 |
+
|
| 477 |
+
def _create_vocab(self):
|
| 478 |
+
_vocab = self._characters
|
| 479 |
+
if self.is_unique:
|
| 480 |
+
_vocab = list(set(_vocab))
|
| 481 |
+
if self.is_sorted:
|
| 482 |
+
_vocab = sorted(_vocab)
|
| 483 |
+
_vocab = list(_vocab)
|
| 484 |
+
_vocab = [self._blank] + _vocab if self._blank is not None and len(self._blank) > 0 else _vocab
|
| 485 |
+
_vocab = [self._bos] + _vocab if self._bos is not None and len(self._bos) > 0 else _vocab
|
| 486 |
+
_vocab = [self._eos] + _vocab if self._eos is not None and len(self._eos) > 0 else _vocab
|
| 487 |
+
_vocab = [self._pad] + _vocab if self._pad is not None and len(self._pad) > 0 else _vocab
|
| 488 |
+
self.vocab = _vocab + list(self._punctuations)
|
| 489 |
+
if self.is_unique:
|
| 490 |
+
duplicates = {x for x in self.vocab if self.vocab.count(x) > 1}
|
| 491 |
+
assert (
|
| 492 |
+
len(self.vocab) == len(self._char_to_id) == len(self._id_to_char)
|
| 493 |
+
), f" [!] There are duplicate characters in the character set. {duplicates}"
|
| 494 |
+
|
| 495 |
+
def char_to_id(self, char: str) -> int:
|
| 496 |
+
try:
|
| 497 |
+
return self._char_to_id[char]
|
| 498 |
+
except KeyError as e:
|
| 499 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 500 |
+
|
| 501 |
+
def id_to_char(self, idx: int) -> str:
|
| 502 |
+
return self._id_to_char[idx]
|
| 503 |
+
|
| 504 |
+
def print_log(self, level: int = 0):
|
| 505 |
+
"""
|
| 506 |
+
Prints the vocabulary in a nice format.
|
| 507 |
+
"""
|
| 508 |
+
indent = "\t" * level
|
| 509 |
+
print(f"{indent}| > Characters: {self._characters}")
|
| 510 |
+
print(f"{indent}| > Punctuations: {self._punctuations}")
|
| 511 |
+
print(f"{indent}| > Pad: {self._pad}")
|
| 512 |
+
print(f"{indent}| > EOS: {self._eos}")
|
| 513 |
+
print(f"{indent}| > BOS: {self._bos}")
|
| 514 |
+
print(f"{indent}| > Blank: {self._blank}")
|
| 515 |
+
print(f"{indent}| > Vocab: {self.vocab}")
|
| 516 |
+
print(f"{indent}| > Num chars: {self.num_chars}")
|
| 517 |
+
|
| 518 |
+
@staticmethod
|
| 519 |
+
def init_from_config(config: "Coqpit"): # pylint: disable=unused-argument
|
| 520 |
+
"""Init your character class from a config.
|
| 521 |
+
|
| 522 |
+
Implement this method for your subclass.
|
| 523 |
+
"""
|
| 524 |
+
# use character set from config
|
| 525 |
+
if config.characters is not None:
|
| 526 |
+
return BaseCharacters(**config.characters), config
|
| 527 |
+
# return default character set
|
| 528 |
+
characters = BaseCharacters()
|
| 529 |
+
new_config = replace(config, characters=characters.to_config())
|
| 530 |
+
return characters, new_config
|
| 531 |
+
|
| 532 |
+
def to_config(self) -> "CharactersConfig":
|
| 533 |
+
return CharactersConfig(
|
| 534 |
+
characters=self._characters,
|
| 535 |
+
punctuations=self._punctuations,
|
| 536 |
+
pad=self._pad,
|
| 537 |
+
eos=self._eos,
|
| 538 |
+
bos=self._bos,
|
| 539 |
+
blank=self._blank,
|
| 540 |
+
is_unique=self.is_unique,
|
| 541 |
+
is_sorted=self.is_sorted,
|
| 542 |
+
)
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
class IPAPhonemes(BaseCharacters):
|
| 546 |
+
|
| 547 |
+
|
| 548 |
+
def __init__(
|
| 549 |
+
self,
|
| 550 |
+
characters: str = _phonemes,
|
| 551 |
+
punctuations: str = _punctuations,
|
| 552 |
+
pad: str = _pad,
|
| 553 |
+
eos: str = _eos,
|
| 554 |
+
bos: str = _bos,
|
| 555 |
+
blank: str = _blank,
|
| 556 |
+
is_unique: bool = False,
|
| 557 |
+
is_sorted: bool = True,
|
| 558 |
+
) -> None:
|
| 559 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 560 |
+
|
| 561 |
+
@staticmethod
|
| 562 |
+
def init_from_config(config: "Coqpit"):
|
| 563 |
+
"""Init a IPAPhonemes object from a model config
|
| 564 |
+
|
| 565 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 566 |
+
will be updated.
|
| 567 |
+
"""
|
| 568 |
+
# band-aid for compatibility with old models
|
| 569 |
+
if "characters" in config and config.characters is not None:
|
| 570 |
+
if "phonemes" in config.characters and config.characters.phonemes is not None:
|
| 571 |
+
config.characters["characters"] = config.characters["phonemes"]
|
| 572 |
+
return (
|
| 573 |
+
IPAPhonemes(
|
| 574 |
+
characters=config.characters["characters"],
|
| 575 |
+
punctuations=config.characters["punctuations"],
|
| 576 |
+
pad=config.characters["pad"],
|
| 577 |
+
eos=config.characters["eos"],
|
| 578 |
+
bos=config.characters["bos"],
|
| 579 |
+
blank=config.characters["blank"],
|
| 580 |
+
is_unique=config.characters["is_unique"],
|
| 581 |
+
is_sorted=config.characters["is_sorted"],
|
| 582 |
+
),
|
| 583 |
+
config,
|
| 584 |
+
)
|
| 585 |
+
# use character set from config
|
| 586 |
+
if config.characters is not None:
|
| 587 |
+
return IPAPhonemes(**config.characters), config
|
| 588 |
+
# return default character set
|
| 589 |
+
characters = IPAPhonemes()
|
| 590 |
+
new_config = replace(config, characters=characters.to_config())
|
| 591 |
+
return characters, new_config
|
| 592 |
+
|
| 593 |
+
|
| 594 |
+
class Graphemes(BaseCharacters):
|
| 595 |
+
|
| 596 |
+
|
| 597 |
+
def __init__(
|
| 598 |
+
self,
|
| 599 |
+
characters: str = _characters,
|
| 600 |
+
punctuations: str = _punctuations,
|
| 601 |
+
pad: str = _pad,
|
| 602 |
+
eos: str = _eos,
|
| 603 |
+
bos: str = _bos,
|
| 604 |
+
blank: str = _blank,
|
| 605 |
+
is_unique: bool = False,
|
| 606 |
+
is_sorted: bool = True,
|
| 607 |
+
) -> None:
|
| 608 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 609 |
+
|
| 610 |
+
@staticmethod
|
| 611 |
+
def init_from_config(config: "Coqpit"):
|
| 612 |
+
"""Init a Graphemes object from a model config
|
| 613 |
+
|
| 614 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 615 |
+
will be updated.
|
| 616 |
+
"""
|
| 617 |
+
if config.characters is not None:
|
| 618 |
+
# band-aid for compatibility with old models
|
| 619 |
+
if "phonemes" in config.characters:
|
| 620 |
+
return (
|
| 621 |
+
Graphemes(
|
| 622 |
+
characters=config.characters["characters"],
|
| 623 |
+
punctuations=config.characters["punctuations"],
|
| 624 |
+
pad=config.characters["pad"],
|
| 625 |
+
eos=config.characters["eos"],
|
| 626 |
+
bos=config.characters["bos"],
|
| 627 |
+
blank=config.characters["blank"],
|
| 628 |
+
is_unique=config.characters["is_unique"],
|
| 629 |
+
is_sorted=config.characters["is_sorted"],
|
| 630 |
+
),
|
| 631 |
+
config,
|
| 632 |
+
)
|
| 633 |
+
return Graphemes(**config.characters), config
|
| 634 |
+
characters = Graphemes()
|
| 635 |
+
new_config = replace(config, characters=characters.to_config())
|
| 636 |
+
return characters, new_config
|
| 637 |
+
|
| 638 |
+
|
| 639 |
+
if __name__ == "__main__":
|
| 640 |
+
gr = Graphemes()
|
| 641 |
+
ph = IPAPhonemes()
|
| 642 |
+
gr.print_log()
|
| 643 |
+
ph.print_log()
|
| 644 |
+
|
| 645 |
+
|
| 646 |
+
class VitsCharacters(BaseCharacters):
|
| 647 |
+
"""Characters class for VITs model for compatibility with pre-trained models"""
|
| 648 |
+
|
| 649 |
+
def __init__(
|
| 650 |
+
self,
|
| 651 |
+
graphemes: str = _characters,
|
| 652 |
+
punctuations: str = _punctuations,
|
| 653 |
+
pad: str = _pad,
|
| 654 |
+
ipa_characters: str = _phonemes,
|
| 655 |
+
) -> None:
|
| 656 |
+
if ipa_characters is not None:
|
| 657 |
+
graphemes += ipa_characters
|
| 658 |
+
super().__init__(graphemes, punctuations, pad, None, None, "<BLNK>", is_unique=False, is_sorted=True)
|
| 659 |
+
|
| 660 |
+
def _create_vocab(self):
|
| 661 |
+
self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank]
|
| 662 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 663 |
+
# pylint: disable=unnecessary-comprehension
|
| 664 |
+
self._id_to_char = {idx: char for idx, char in enumerate(self.vocab)}
|
| 665 |
+
|
| 666 |
+
@staticmethod
|
| 667 |
+
def init_from_config(config):
|
| 668 |
+
_pad = config.characters.pad
|
| 669 |
+
_punctuations = config.characters.punctuations
|
| 670 |
+
_letters = config.characters.characters
|
| 671 |
+
_letters_ipa = config.characters.phonemes
|
| 672 |
+
return (
|
| 673 |
+
VitsCharacters(graphemes=_letters, ipa_characters=_letters_ipa, punctuations=_punctuations, pad=_pad),
|
| 674 |
+
config,
|
| 675 |
+
)
|
| 676 |
+
|
| 677 |
+
def to_config(self) -> "CharactersConfig":
|
| 678 |
+
return CharactersConfig(
|
| 679 |
+
characters=self._characters,
|
| 680 |
+
punctuations=self._punctuations,
|
| 681 |
+
pad=self._pad,
|
| 682 |
+
eos=None,
|
| 683 |
+
bos=None,
|
| 684 |
+
blank=self._blank,
|
| 685 |
+
is_unique=False,
|
| 686 |
+
is_sorted=True,
|
| 687 |
+
)
|
| 688 |
+
|
| 689 |
+
class TTSTokenizer:
|
| 690 |
+
def __init__(
|
| 691 |
+
self,
|
| 692 |
+
text_cleaner: Callable = None,
|
| 693 |
+
characters: "BaseCharacters" = None,
|
| 694 |
+
):
|
| 695 |
+
self.text_cleaner = text_cleaner
|
| 696 |
+
self.characters = characters
|
| 697 |
+
self.not_found_characters = []
|
| 698 |
+
|
| 699 |
+
@property
|
| 700 |
+
def characters(self):
|
| 701 |
+
return self._characters
|
| 702 |
+
|
| 703 |
+
@characters.setter
|
| 704 |
+
def characters(self, new_characters):
|
| 705 |
+
self._characters = new_characters
|
| 706 |
+
self.pad_id = self.characters.char_to_id(self.characters.pad) if self.characters.pad else None
|
| 707 |
+
self.blank_id = self.characters.char_to_id(self.characters.blank) if self.characters.blank else None
|
| 708 |
+
|
| 709 |
+
def encode(self, text: str) -> List[int]:
|
| 710 |
+
"""Encodes a string of text as a sequence of IDs."""
|
| 711 |
+
token_ids = []
|
| 712 |
+
for char in text:
|
| 713 |
+
try:
|
| 714 |
+
idx = self.characters.char_to_id(char)
|
| 715 |
+
token_ids.append(idx)
|
| 716 |
+
except KeyError:
|
| 717 |
+
# discard but store not found characters
|
| 718 |
+
if char not in self.not_found_characters:
|
| 719 |
+
self.not_found_characters.append(char)
|
| 720 |
+
print(text)
|
| 721 |
+
print(f" [!] Character {repr(char)} not found in the vocabulary. Discarding it.")
|
| 722 |
+
return token_ids
|
| 723 |
+
|
| 724 |
+
def text_to_ids(self, text: str, language: str = None) -> List[int]: # pylint: disable=unused-argument
|
| 725 |
+
text = self.text_cleaner(text)
|
| 726 |
+
text = self.encode(text)
|
| 727 |
+
text = self.intersperse_blank_char(text, True)
|
| 728 |
+
return text
|
| 729 |
+
|
| 730 |
+
def pad_with_bos_eos(self, char_sequence: List[str]):
|
| 731 |
+
"""Pads a sequence with the special BOS and EOS characters."""
|
| 732 |
+
return [self.characters.bos_id] + list(char_sequence) + [self.characters.eos_id]
|
| 733 |
+
|
| 734 |
+
def intersperse_blank_char(self, char_sequence: List[str], use_blank_char: bool = False):
|
| 735 |
+
"""Intersperses the blank character between characters in a sequence.
|
| 736 |
+
|
| 737 |
+
Use the ```blank``` character if defined else use the ```pad``` character.
|
| 738 |
+
"""
|
| 739 |
+
char_to_use = self.characters.blank_id if use_blank_char else self.characters.pad
|
| 740 |
+
result = [char_to_use] * (len(char_sequence) * 2 + 1)
|
| 741 |
+
result[1::2] = char_sequence
|
| 742 |
+
return result
|
| 743 |
+
|
| 744 |
+
@staticmethod
|
| 745 |
+
def init_from_config(config: "Coqpit", characters: "BaseCharacters" = None):
|
| 746 |
+
text_cleaner = multilingual_cleaners
|
| 747 |
+
CharactersClass = VitsCharacters
|
| 748 |
+
characters, new_config = CharactersClass.init_from_config(config)
|
| 749 |
+
# new_config.characters.characters_class = get_import_path(characters)
|
| 750 |
+
new_config.characters.characters_class = VitsCharacters
|
| 751 |
+
return (
|
| 752 |
+
TTSTokenizer(text_cleaner, characters),new_config)
|
| 753 |
+
|
| 754 |
+
|
| 755 |
+
def multilingual_cleaners(text):
|
| 756 |
+
"""Pipeline for multilingual text"""
|
| 757 |
+
text = lowercase(text)
|
| 758 |
+
text = replace_symbols(text, lang=None)
|
| 759 |
+
text = remove_aux_symbols(text)
|
| 760 |
+
text = collapse_whitespace(text)
|
| 761 |
+
return text
|
| 762 |
+
|
| 763 |
+
def lowercase(text):
|
| 764 |
+
return text.lower()
|
| 765 |
+
|
| 766 |
+
def collapse_whitespace(text):
|
| 767 |
+
return re.sub(_whitespace_re, " ", text).strip()
|
| 768 |
+
|
| 769 |
+
def replace_symbols(text, lang="en"):
|
| 770 |
+
|
| 771 |
+
text = text.replace(";", ",")
|
| 772 |
+
text = text.replace("-", " ") if lang != "ca" else text.replace("-", "")
|
| 773 |
+
text = text.replace(":", ",")
|
| 774 |
+
if lang == "en":
|
| 775 |
+
text = text.replace("&", " and ")
|
| 776 |
+
elif lang == "fr":
|
| 777 |
+
text = text.replace("&", " et ")
|
| 778 |
+
elif lang == "pt":
|
| 779 |
+
text = text.replace("&", " e ")
|
| 780 |
+
elif lang == "ca":
|
| 781 |
+
text = text.replace("&", " i ")
|
| 782 |
+
text = text.replace("'", "")
|
| 783 |
+
return text
|
| 784 |
+
|
| 785 |
+
def remove_aux_symbols(text):
|
| 786 |
+
text = re.sub(r"[\<\>\(\)\[\]\"]+", "", text)
|
| 787 |
+
return text
|
models/en_male/jit_infer.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from extra import TTSTokenizer, VitsConfig, CharactersConfig, VitsCharacters
|
| 3 |
+
import torch
|
| 4 |
+
import numpy as np
|
| 5 |
+
|
| 6 |
+
#ch female
|
| 7 |
+
with open("chars.txt", 'r') as f:
|
| 8 |
+
letters = f.read().strip('\n')
|
| 9 |
+
model="en_male_vits_30hrs.pt"
|
| 10 |
+
text = "This is a text to b spoken"
|
| 11 |
+
|
| 12 |
+
config = VitsConfig(
|
| 13 |
+
text_cleaner="multilingual_cleaners",
|
| 14 |
+
characters=CharactersConfig(
|
| 15 |
+
characters_class=VitsCharacters,
|
| 16 |
+
pad="<PAD>",
|
| 17 |
+
eos="<EOS>",
|
| 18 |
+
bos="<BOS>",
|
| 19 |
+
blank="<BLNK>",
|
| 20 |
+
characters=letters,
|
| 21 |
+
punctuations="!¡'(),-.:;¿? ",
|
| 22 |
+
phonemes=None)
|
| 23 |
+
)
|
| 24 |
+
tokenizer, config = TTSTokenizer.init_from_config(config)
|
| 25 |
+
|
| 26 |
+
x = tokenizer.text_to_ids(text)
|
| 27 |
+
x = torch.from_numpy(np.array(x)).unsqueeze(0)
|
| 28 |
+
net = torch.jit.load(model)
|
| 29 |
+
with torch.no_grad():
|
| 30 |
+
out2 = net(x)
|
| 31 |
+
import soundfile as sf
|
| 32 |
+
sf.write("jit.wav", out2.squeeze().cpu().numpy(), 22050)
|
models/gu_mms/config.json
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"activation_dropout": 0.1,
|
| 3 |
+
"architectures": [
|
| 4 |
+
"VitsModel"
|
| 5 |
+
],
|
| 6 |
+
"attention_dropout": 0.1,
|
| 7 |
+
"depth_separable_channels": 2,
|
| 8 |
+
"depth_separable_num_layers": 3,
|
| 9 |
+
"duration_predictor_dropout": 0.5,
|
| 10 |
+
"duration_predictor_filter_channels": 256,
|
| 11 |
+
"duration_predictor_flow_bins": 10,
|
| 12 |
+
"duration_predictor_kernel_size": 3,
|
| 13 |
+
"duration_predictor_num_flows": 4,
|
| 14 |
+
"duration_predictor_tail_bound": 5.0,
|
| 15 |
+
"ffn_dim": 768,
|
| 16 |
+
"ffn_kernel_size": 3,
|
| 17 |
+
"flow_size": 192,
|
| 18 |
+
"hidden_act": "relu",
|
| 19 |
+
"hidden_dropout": 0.1,
|
| 20 |
+
"hidden_size": 192,
|
| 21 |
+
"initializer_range": 0.02,
|
| 22 |
+
"layer_norm_eps": 1e-05,
|
| 23 |
+
"layerdrop": 0.1,
|
| 24 |
+
"leaky_relu_slope": 0.1,
|
| 25 |
+
"model_type": "vits",
|
| 26 |
+
"noise_scale": 0.667,
|
| 27 |
+
"noise_scale_duration": 0.8,
|
| 28 |
+
"num_attention_heads": 2,
|
| 29 |
+
"num_hidden_layers": 6,
|
| 30 |
+
"num_speakers": 1,
|
| 31 |
+
"posterior_encoder_num_wavenet_layers": 16,
|
| 32 |
+
"prior_encoder_num_flows": 4,
|
| 33 |
+
"prior_encoder_num_wavenet_layers": 4,
|
| 34 |
+
"resblock_dilation_sizes": [
|
| 35 |
+
[
|
| 36 |
+
1,
|
| 37 |
+
3,
|
| 38 |
+
5
|
| 39 |
+
],
|
| 40 |
+
[
|
| 41 |
+
1,
|
| 42 |
+
3,
|
| 43 |
+
5
|
| 44 |
+
],
|
| 45 |
+
[
|
| 46 |
+
1,
|
| 47 |
+
3,
|
| 48 |
+
5
|
| 49 |
+
]
|
| 50 |
+
],
|
| 51 |
+
"resblock_kernel_sizes": [
|
| 52 |
+
3,
|
| 53 |
+
7,
|
| 54 |
+
11
|
| 55 |
+
],
|
| 56 |
+
"sampling_rate": 16000,
|
| 57 |
+
"speaker_embedding_size": 0,
|
| 58 |
+
"speaking_rate": 1.0,
|
| 59 |
+
"spectrogram_bins": 513,
|
| 60 |
+
"torch_dtype": "float32",
|
| 61 |
+
"transformers_version": "4.33.0.dev0",
|
| 62 |
+
"upsample_initial_channel": 512,
|
| 63 |
+
"upsample_kernel_sizes": [
|
| 64 |
+
16,
|
| 65 |
+
16,
|
| 66 |
+
4,
|
| 67 |
+
4
|
| 68 |
+
],
|
| 69 |
+
"upsample_rates": [
|
| 70 |
+
8,
|
| 71 |
+
8,
|
| 72 |
+
2,
|
| 73 |
+
2
|
| 74 |
+
],
|
| 75 |
+
"use_bias": true,
|
| 76 |
+
"use_stochastic_duration_prediction": true,
|
| 77 |
+
"vocab_size": 60,
|
| 78 |
+
"wavenet_dilation_rate": 1,
|
| 79 |
+
"wavenet_dropout": 0.0,
|
| 80 |
+
"wavenet_kernel_size": 5,
|
| 81 |
+
"window_size": 4
|
| 82 |
+
}
|
models/gu_mms/special_tokens_map.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"pad_token": "|",
|
| 3 |
+
"unk_token": "<unk>"
|
| 4 |
+
}
|
models/gu_mms/tokenizer_config.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_blank": true,
|
| 3 |
+
"clean_up_tokenization_spaces": true,
|
| 4 |
+
"is_uroman": false,
|
| 5 |
+
"language": "guj",
|
| 6 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 7 |
+
"normalize": true,
|
| 8 |
+
"pad_token": "|",
|
| 9 |
+
"phonemize": false,
|
| 10 |
+
"tokenizer_class": "VitsTokenizer",
|
| 11 |
+
"unk_token": "<unk>"
|
| 12 |
+
}
|
models/gu_mms/vocab.json
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
" ": 59,
|
| 3 |
+
"'": 47,
|
| 4 |
+
"-": 56,
|
| 5 |
+
"|": 0,
|
| 6 |
+
"ં": 10,
|
| 7 |
+
"ઃ": 54,
|
| 8 |
+
"અ": 28,
|
| 9 |
+
"આ": 26,
|
| 10 |
+
"ઇ": 49,
|
| 11 |
+
"ઈ": 30,
|
| 12 |
+
"ઉ": 42,
|
| 13 |
+
"ઊ": 48,
|
| 14 |
+
"ઋ": 57,
|
| 15 |
+
"એ": 29,
|
| 16 |
+
"ઐ": 58,
|
| 17 |
+
"ઓ": 27,
|
| 18 |
+
"ક": 9,
|
| 19 |
+
"ખ": 33,
|
| 20 |
+
"ગ": 32,
|
| 21 |
+
"ઘ": 44,
|
| 22 |
+
"ચ": 39,
|
| 23 |
+
"છ": 23,
|
| 24 |
+
"જ": 18,
|
| 25 |
+
"ઝ": 51,
|
| 26 |
+
"ઞ": 50,
|
| 27 |
+
"ટ": 36,
|
| 28 |
+
"ઠ": 45,
|
| 29 |
+
"ડ": 40,
|
| 30 |
+
"ઢ": 52,
|
| 31 |
+
"ણ": 22,
|
| 32 |
+
"ત": 3,
|
| 33 |
+
"થ": 19,
|
| 34 |
+
"દ": 25,
|
| 35 |
+
"ધ": 34,
|
| 36 |
+
"ન": 4,
|
| 37 |
+
"પ": 12,
|
| 38 |
+
"ફ": 43,
|
| 39 |
+
"બ": 31,
|
| 40 |
+
"ભ": 35,
|
| 41 |
+
"મ": 7,
|
| 42 |
+
"ય": 16,
|
| 43 |
+
"ર": 5,
|
| 44 |
+
"લ": 24,
|
| 45 |
+
"ળ": 37,
|
| 46 |
+
"વ": 13,
|
| 47 |
+
"શ": 21,
|
| 48 |
+
"ષ": 41,
|
| 49 |
+
"સ": 15,
|
| 50 |
+
"હ": 17,
|
| 51 |
+
"ા": 1,
|
| 52 |
+
"િ": 20,
|
| 53 |
+
"ી": 8,
|
| 54 |
+
"ુ": 14,
|
| 55 |
+
"ૂ": 38,
|
| 56 |
+
"ૃ": 46,
|
| 57 |
+
"ે": 2,
|
| 58 |
+
"ૈ": 53,
|
| 59 |
+
"ો": 11,
|
| 60 |
+
"ૌ": 55,
|
| 61 |
+
"્": 6
|
| 62 |
+
}
|
models/hi_female/__pycache__/extra.cpython-310.pyc
ADDED
|
Binary file (26.3 kB). View file
|
|
|
models/hi_female/chars.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
शदऊतसओषमऱढै?ख़ौक़ड़ःिअनठय़ज़फ़्खँे।ंऋउ'हछङझ" ुणऔघयञृएईॆीपचॉॠवगडटइ,बॅूऐफकजलग़आधोथाभढ़ऑ
|
models/hi_female/extra.py
ADDED
|
@@ -0,0 +1,787 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Callable, Dict, List, Union
|
| 2 |
+
from dataclasses import asdict, dataclass, field
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
import re
|
| 6 |
+
from dataclasses import replace
|
| 7 |
+
from typing import Dict
|
| 8 |
+
_whitespace_re = re.compile(r"\s+")
|
| 9 |
+
|
| 10 |
+
from dataclasses import dataclass, field
|
| 11 |
+
from typing import List
|
| 12 |
+
|
| 13 |
+
# from TTS.tts.configs.shared_configs import BaseTTSConfig
|
| 14 |
+
# from TTS.tts.models.vits import VitsArgs, VitsAudioConfig
|
| 15 |
+
|
| 16 |
+
@dataclass
|
| 17 |
+
class CharactersConfig():
|
| 18 |
+
|
| 19 |
+
characters_class: str = None
|
| 20 |
+
|
| 21 |
+
# using BaseVocabulary
|
| 22 |
+
vocab_dict: Dict = None
|
| 23 |
+
|
| 24 |
+
# using on BaseCharacters
|
| 25 |
+
pad: str = None
|
| 26 |
+
eos: str = None
|
| 27 |
+
bos: str = None
|
| 28 |
+
blank: str = None
|
| 29 |
+
characters: str = None
|
| 30 |
+
punctuations: str = None
|
| 31 |
+
phonemes: str = None
|
| 32 |
+
is_unique: bool = True # for backwards compatibility of models trained with char sets with duplicates
|
| 33 |
+
is_sorted: bool = True
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@dataclass
|
| 37 |
+
class BaseTTSConfig():
|
| 38 |
+
|
| 39 |
+
# audio: BaseAudioConfig = field(default_factory=BaseAudioConfig)
|
| 40 |
+
# phoneme settings
|
| 41 |
+
use_phonemes: bool = False
|
| 42 |
+
phonemizer: str = None
|
| 43 |
+
phoneme_language: str = None
|
| 44 |
+
compute_input_seq_cache: bool = False
|
| 45 |
+
text_cleaner: str = None
|
| 46 |
+
enable_eos_bos_chars: bool = False
|
| 47 |
+
test_sentences_file: str = ""
|
| 48 |
+
phoneme_cache_path: str = None
|
| 49 |
+
# vocabulary parameters
|
| 50 |
+
characters: CharactersConfig = None
|
| 51 |
+
add_blank: bool = False
|
| 52 |
+
# training params
|
| 53 |
+
batch_group_size: int = 0
|
| 54 |
+
loss_masking: bool = None
|
| 55 |
+
# dataloading
|
| 56 |
+
min_audio_len: int = 1
|
| 57 |
+
max_audio_len: int = float("inf")
|
| 58 |
+
min_text_len: int = 1
|
| 59 |
+
max_text_len: int = float("inf")
|
| 60 |
+
compute_f0: bool = False
|
| 61 |
+
compute_energy: bool = False
|
| 62 |
+
compute_linear_spec: bool = False
|
| 63 |
+
precompute_num_workers: int = 0
|
| 64 |
+
use_noise_augment: bool = False
|
| 65 |
+
start_by_longest: bool = False
|
| 66 |
+
shuffle: bool = False
|
| 67 |
+
drop_last: bool = False
|
| 68 |
+
# dataset
|
| 69 |
+
datasets: str = None
|
| 70 |
+
# optimizer
|
| 71 |
+
optimizer: str = "radam"
|
| 72 |
+
optimizer_params: dict = None
|
| 73 |
+
# scheduler
|
| 74 |
+
lr_scheduler: str = None
|
| 75 |
+
lr_scheduler_params: dict = field(default_factory=lambda: {})
|
| 76 |
+
# testing
|
| 77 |
+
test_sentences: List[str] = field(default_factory=lambda: [])
|
| 78 |
+
# evaluation
|
| 79 |
+
eval_split_max_size: int = None
|
| 80 |
+
eval_split_size: float = 0.01
|
| 81 |
+
# weighted samplers
|
| 82 |
+
use_speaker_weighted_sampler: bool = False
|
| 83 |
+
speaker_weighted_sampler_alpha: float = 1.0
|
| 84 |
+
use_language_weighted_sampler: bool = False
|
| 85 |
+
language_weighted_sampler_alpha: float = 1.0
|
| 86 |
+
use_length_weighted_sampler: bool = False
|
| 87 |
+
length_weighted_sampler_alpha: float = 1.0
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
@dataclass
|
| 91 |
+
class VitsAudioConfig():
|
| 92 |
+
fft_size: int = 1024
|
| 93 |
+
sample_rate: int = 22050
|
| 94 |
+
win_length: int = 1024
|
| 95 |
+
hop_length: int = 256
|
| 96 |
+
num_mels: int = 80
|
| 97 |
+
mel_fmin: int = 0
|
| 98 |
+
mel_fmax: int = None
|
| 99 |
+
|
| 100 |
+
@dataclass
|
| 101 |
+
class VitsArgs():
|
| 102 |
+
num_chars: int = 100
|
| 103 |
+
out_channels: int = 513
|
| 104 |
+
spec_segment_size: int = 32
|
| 105 |
+
hidden_channels: int = 192
|
| 106 |
+
hidden_channels_ffn_text_encoder: int = 768
|
| 107 |
+
num_heads_text_encoder: int = 2
|
| 108 |
+
num_layers_text_encoder: int = 6
|
| 109 |
+
kernel_size_text_encoder: int = 3
|
| 110 |
+
dropout_p_text_encoder: float = 0.1
|
| 111 |
+
dropout_p_duration_predictor: float = 0.5
|
| 112 |
+
kernel_size_posterior_encoder: int = 5
|
| 113 |
+
dilation_rate_posterior_encoder: int = 1
|
| 114 |
+
num_layers_posterior_encoder: int = 16
|
| 115 |
+
kernel_size_flow: int = 5
|
| 116 |
+
dilation_rate_flow: int = 1
|
| 117 |
+
num_layers_flow: int = 4
|
| 118 |
+
resblock_type_decoder: str = "1"
|
| 119 |
+
resblock_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [3, 7, 11])
|
| 120 |
+
resblock_dilation_sizes_decoder: List[List[int]] = field(default_factory=lambda: [[1, 3, 5], [1, 3, 5], [1, 3, 5]])
|
| 121 |
+
upsample_rates_decoder: List[int] = field(default_factory=lambda: [8, 8, 2, 2])
|
| 122 |
+
upsample_initial_channel_decoder: int = 512
|
| 123 |
+
upsample_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [16, 16, 4, 4])
|
| 124 |
+
periods_multi_period_discriminator: List[int] = field(default_factory=lambda: [2, 3, 5, 7, 11])
|
| 125 |
+
use_sdp: bool = True
|
| 126 |
+
noise_scale: float = 1.0
|
| 127 |
+
inference_noise_scale: float = 0.667
|
| 128 |
+
length_scale: float = 1
|
| 129 |
+
noise_scale_dp: float = 1.0
|
| 130 |
+
inference_noise_scale_dp: float = 1.0
|
| 131 |
+
max_inference_len: int = None
|
| 132 |
+
init_discriminator: bool = True
|
| 133 |
+
use_spectral_norm_disriminator: bool = False
|
| 134 |
+
use_speaker_embedding: bool = False
|
| 135 |
+
num_speakers: int = 0
|
| 136 |
+
speakers_file: str = None
|
| 137 |
+
d_vector_file: List[str] = None
|
| 138 |
+
speaker_embedding_channels: int = 256
|
| 139 |
+
use_d_vector_file: bool = False
|
| 140 |
+
d_vector_dim: int = 0
|
| 141 |
+
detach_dp_input: bool = True
|
| 142 |
+
use_language_embedding: bool = False
|
| 143 |
+
embedded_language_dim: int = 4
|
| 144 |
+
num_languages: int = 0
|
| 145 |
+
language_ids_file: str = None
|
| 146 |
+
use_speaker_encoder_as_loss: bool = False
|
| 147 |
+
speaker_encoder_config_path: str = ""
|
| 148 |
+
speaker_encoder_model_path: str = ""
|
| 149 |
+
condition_dp_on_speaker: bool = True
|
| 150 |
+
freeze_encoder: bool = False
|
| 151 |
+
freeze_DP: bool = False
|
| 152 |
+
freeze_PE: bool = False
|
| 153 |
+
freeze_flow_decoder: bool = False
|
| 154 |
+
freeze_waveform_decoder: bool = False
|
| 155 |
+
encoder_sample_rate: int = None
|
| 156 |
+
interpolate_z: bool = True
|
| 157 |
+
reinit_DP: bool = False
|
| 158 |
+
reinit_text_encoder: bool = False
|
| 159 |
+
@dataclass
|
| 160 |
+
class VitsConfig(BaseTTSConfig):
|
| 161 |
+
|
| 162 |
+
model: str = "vits"
|
| 163 |
+
# model specific params
|
| 164 |
+
model_args: VitsArgs = field(default_factory=VitsArgs)
|
| 165 |
+
audio: VitsAudioConfig = field(default_factory=VitsAudioConfig)
|
| 166 |
+
|
| 167 |
+
# optimizer
|
| 168 |
+
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000])
|
| 169 |
+
lr_gen: float = 0.0002
|
| 170 |
+
lr_disc: float = 0.0002
|
| 171 |
+
lr_scheduler_gen: str = "ExponentialLR"
|
| 172 |
+
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 173 |
+
lr_scheduler_disc: str = "ExponentialLR"
|
| 174 |
+
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 175 |
+
scheduler_after_epoch: bool = True
|
| 176 |
+
optimizer: str = "AdamW"
|
| 177 |
+
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01})
|
| 178 |
+
|
| 179 |
+
# loss params
|
| 180 |
+
kl_loss_alpha: float = 1.0
|
| 181 |
+
disc_loss_alpha: float = 1.0
|
| 182 |
+
gen_loss_alpha: float = 1.0
|
| 183 |
+
feat_loss_alpha: float = 1.0
|
| 184 |
+
mel_loss_alpha: float = 45.0
|
| 185 |
+
dur_loss_alpha: float = 1.0
|
| 186 |
+
speaker_encoder_loss_alpha: float = 1.0
|
| 187 |
+
|
| 188 |
+
# data loader params
|
| 189 |
+
return_wav: bool = True
|
| 190 |
+
compute_linear_spec: bool = True
|
| 191 |
+
|
| 192 |
+
# sampler params
|
| 193 |
+
use_weighted_sampler: bool = False # TODO: move it to the base config
|
| 194 |
+
weighted_sampler_attrs: dict = field(default_factory=lambda: {})
|
| 195 |
+
weighted_sampler_multipliers: dict = field(default_factory=lambda: {})
|
| 196 |
+
|
| 197 |
+
# overrides
|
| 198 |
+
r: int = 1 # DO NOT CHANGE
|
| 199 |
+
add_blank: bool = True
|
| 200 |
+
|
| 201 |
+
# testing
|
| 202 |
+
test_sentences: List[List] = field(
|
| 203 |
+
default_factory=lambda: [
|
| 204 |
+
["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."],
|
| 205 |
+
["Be a voice, not an echo."],
|
| 206 |
+
["I'm sorry Dave. I'm afraid I can't do that."],
|
| 207 |
+
["This cake is great. It's so delicious and moist."],
|
| 208 |
+
["Prior to November 22, 1963."],
|
| 209 |
+
]
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# multi-speaker settings
|
| 213 |
+
# use speaker embedding layer
|
| 214 |
+
num_speakers: int = 0
|
| 215 |
+
use_speaker_embedding: bool = False
|
| 216 |
+
speakers_file: str = None
|
| 217 |
+
speaker_embedding_channels: int = 256
|
| 218 |
+
language_ids_file: str = None
|
| 219 |
+
use_language_embedding: bool = False
|
| 220 |
+
|
| 221 |
+
# use d-vectors
|
| 222 |
+
use_d_vector_file: bool = False
|
| 223 |
+
d_vector_file: List[str] = None
|
| 224 |
+
d_vector_dim: int = None
|
| 225 |
+
|
| 226 |
+
def __post_init__(self):
|
| 227 |
+
pass
|
| 228 |
+
# for key, val in self.model_args.items():
|
| 229 |
+
# if hasattr(self, key):
|
| 230 |
+
# self[key] = val
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
def parse_symbols():
|
| 237 |
+
return {
|
| 238 |
+
"pad": _pad,
|
| 239 |
+
"eos": _eos,
|
| 240 |
+
"bos": _bos,
|
| 241 |
+
"characters": _characters,
|
| 242 |
+
"punctuations": _punctuations,
|
| 243 |
+
"phonemes": _phonemes,
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# DEFAULT SET OF GRAPHEMES
|
| 248 |
+
_pad = "<PAD>"
|
| 249 |
+
_eos = "<EOS>"
|
| 250 |
+
_bos = "<BOS>"
|
| 251 |
+
_blank = "<BLNK>" # TODO: check if we need this alongside with PAD
|
| 252 |
+
_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
|
| 253 |
+
_punctuations = "!'(),-.:;? "
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
# DEFAULT SET OF IPA PHONEMES
|
| 257 |
+
# Phonemes definition (All IPA characters)
|
| 258 |
+
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
|
| 259 |
+
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
|
| 260 |
+
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
|
| 261 |
+
_suprasegmentals = "ˈˌːˑ"
|
| 262 |
+
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
|
| 263 |
+
_diacrilics = "ɚ˞ɫ"
|
| 264 |
+
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
class BaseVocabulary:
|
| 268 |
+
"""Base Vocabulary class.
|
| 269 |
+
|
| 270 |
+
This class only needs a vocabulary dictionary without specifying the characters.
|
| 271 |
+
|
| 272 |
+
Args:
|
| 273 |
+
vocab (Dict): A dictionary of characters and their corresponding indices.
|
| 274 |
+
"""
|
| 275 |
+
|
| 276 |
+
def __init__(self, vocab: Dict, pad: str = None, blank: str = None, bos: str = None, eos: str = None):
|
| 277 |
+
self.vocab = vocab
|
| 278 |
+
self.pad = pad
|
| 279 |
+
self.blank = blank
|
| 280 |
+
self.bos = bos
|
| 281 |
+
self.eos = eos
|
| 282 |
+
|
| 283 |
+
@property
|
| 284 |
+
def pad_id(self) -> int:
|
| 285 |
+
"""Return the index of the padding character. If the padding character is not specified, return the length
|
| 286 |
+
of the vocabulary."""
|
| 287 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 288 |
+
|
| 289 |
+
@property
|
| 290 |
+
def blank_id(self) -> int:
|
| 291 |
+
"""Return the index of the blank character. If the blank character is not specified, return the length of
|
| 292 |
+
the vocabulary."""
|
| 293 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 294 |
+
|
| 295 |
+
@property
|
| 296 |
+
def bos_id(self) -> int:
|
| 297 |
+
"""Return the index of the bos character. If the bos character is not specified, return the length of the
|
| 298 |
+
vocabulary."""
|
| 299 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 300 |
+
|
| 301 |
+
@property
|
| 302 |
+
def eos_id(self) -> int:
|
| 303 |
+
"""Return the index of the eos character. If the eos character is not specified, return the length of the
|
| 304 |
+
vocabulary."""
|
| 305 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 306 |
+
|
| 307 |
+
@property
|
| 308 |
+
def vocab(self):
|
| 309 |
+
"""Return the vocabulary dictionary."""
|
| 310 |
+
return self._vocab
|
| 311 |
+
|
| 312 |
+
@vocab.setter
|
| 313 |
+
def vocab(self, vocab):
|
| 314 |
+
"""Set the vocabulary dictionary and character mapping dictionaries."""
|
| 315 |
+
self._vocab, self._char_to_id, self._id_to_char = None, None, None
|
| 316 |
+
if vocab is not None:
|
| 317 |
+
self._vocab = vocab
|
| 318 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self._vocab)}
|
| 319 |
+
self._id_to_char = {
|
| 320 |
+
idx: char for idx, char in enumerate(self._vocab) # pylint: disable=unnecessary-comprehension
|
| 321 |
+
}
|
| 322 |
+
|
| 323 |
+
@staticmethod
|
| 324 |
+
def init_from_config(config, **kwargs):
|
| 325 |
+
"""Initialize from the given config."""
|
| 326 |
+
if config.characters is not None and "vocab_dict" in config.characters and config.characters.vocab_dict:
|
| 327 |
+
return (
|
| 328 |
+
BaseVocabulary(
|
| 329 |
+
config.characters.vocab_dict,
|
| 330 |
+
config.characters.pad,
|
| 331 |
+
config.characters.blank,
|
| 332 |
+
config.characters.bos,
|
| 333 |
+
config.characters.eos,
|
| 334 |
+
),
|
| 335 |
+
config,
|
| 336 |
+
)
|
| 337 |
+
return BaseVocabulary(**kwargs), config
|
| 338 |
+
|
| 339 |
+
def to_config(self):
|
| 340 |
+
return CharactersConfig(
|
| 341 |
+
vocab_dict=self._vocab,
|
| 342 |
+
pad=self.pad,
|
| 343 |
+
eos=self.eos,
|
| 344 |
+
bos=self.bos,
|
| 345 |
+
blank=self.blank,
|
| 346 |
+
is_unique=False,
|
| 347 |
+
is_sorted=False,
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
@property
|
| 351 |
+
def num_chars(self):
|
| 352 |
+
"""Return number of tokens in the vocabulary."""
|
| 353 |
+
return len(self._vocab)
|
| 354 |
+
|
| 355 |
+
def char_to_id(self, char: str) -> int:
|
| 356 |
+
"""Map a character to an token ID."""
|
| 357 |
+
try:
|
| 358 |
+
return self._char_to_id[char]
|
| 359 |
+
except KeyError as e:
|
| 360 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 361 |
+
|
| 362 |
+
def id_to_char(self, idx: int) -> str:
|
| 363 |
+
"""Map an token ID to a character."""
|
| 364 |
+
return self._id_to_char[idx]
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
class BaseCharacters:
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
def __init__(
|
| 371 |
+
self,
|
| 372 |
+
characters: str = None,
|
| 373 |
+
punctuations: str = None,
|
| 374 |
+
pad: str = None,
|
| 375 |
+
eos: str = None,
|
| 376 |
+
bos: str = None,
|
| 377 |
+
blank: str = None,
|
| 378 |
+
is_unique: bool = False,
|
| 379 |
+
is_sorted: bool = True,
|
| 380 |
+
) -> None:
|
| 381 |
+
self._characters = characters
|
| 382 |
+
self._punctuations = punctuations
|
| 383 |
+
self._pad = pad
|
| 384 |
+
self._eos = eos
|
| 385 |
+
self._bos = bos
|
| 386 |
+
self._blank = blank
|
| 387 |
+
self.is_unique = is_unique
|
| 388 |
+
self.is_sorted = is_sorted
|
| 389 |
+
self._create_vocab()
|
| 390 |
+
|
| 391 |
+
@property
|
| 392 |
+
def pad_id(self) -> int:
|
| 393 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 394 |
+
|
| 395 |
+
@property
|
| 396 |
+
def blank_id(self) -> int:
|
| 397 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 398 |
+
|
| 399 |
+
@property
|
| 400 |
+
def eos_id(self) -> int:
|
| 401 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 402 |
+
|
| 403 |
+
@property
|
| 404 |
+
def bos_id(self) -> int:
|
| 405 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 406 |
+
|
| 407 |
+
@property
|
| 408 |
+
def characters(self):
|
| 409 |
+
return self._characters
|
| 410 |
+
|
| 411 |
+
@characters.setter
|
| 412 |
+
def characters(self, characters):
|
| 413 |
+
self._characters = characters
|
| 414 |
+
self._create_vocab()
|
| 415 |
+
|
| 416 |
+
@property
|
| 417 |
+
def punctuations(self):
|
| 418 |
+
return self._punctuations
|
| 419 |
+
|
| 420 |
+
@punctuations.setter
|
| 421 |
+
def punctuations(self, punctuations):
|
| 422 |
+
self._punctuations = punctuations
|
| 423 |
+
self._create_vocab()
|
| 424 |
+
|
| 425 |
+
@property
|
| 426 |
+
def pad(self):
|
| 427 |
+
return self._pad
|
| 428 |
+
|
| 429 |
+
@pad.setter
|
| 430 |
+
def pad(self, pad):
|
| 431 |
+
self._pad = pad
|
| 432 |
+
self._create_vocab()
|
| 433 |
+
|
| 434 |
+
@property
|
| 435 |
+
def eos(self):
|
| 436 |
+
return self._eos
|
| 437 |
+
|
| 438 |
+
@eos.setter
|
| 439 |
+
def eos(self, eos):
|
| 440 |
+
self._eos = eos
|
| 441 |
+
self._create_vocab()
|
| 442 |
+
|
| 443 |
+
@property
|
| 444 |
+
def bos(self):
|
| 445 |
+
return self._bos
|
| 446 |
+
|
| 447 |
+
@bos.setter
|
| 448 |
+
def bos(self, bos):
|
| 449 |
+
self._bos = bos
|
| 450 |
+
self._create_vocab()
|
| 451 |
+
|
| 452 |
+
@property
|
| 453 |
+
def blank(self):
|
| 454 |
+
return self._blank
|
| 455 |
+
|
| 456 |
+
@blank.setter
|
| 457 |
+
def blank(self, blank):
|
| 458 |
+
self._blank = blank
|
| 459 |
+
self._create_vocab()
|
| 460 |
+
|
| 461 |
+
@property
|
| 462 |
+
def vocab(self):
|
| 463 |
+
return self._vocab
|
| 464 |
+
|
| 465 |
+
@vocab.setter
|
| 466 |
+
def vocab(self, vocab):
|
| 467 |
+
self._vocab = vocab
|
| 468 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 469 |
+
self._id_to_char = {
|
| 470 |
+
idx: char for idx, char in enumerate(self.vocab) # pylint: disable=unnecessary-comprehension
|
| 471 |
+
}
|
| 472 |
+
|
| 473 |
+
@property
|
| 474 |
+
def num_chars(self):
|
| 475 |
+
return len(self._vocab)
|
| 476 |
+
|
| 477 |
+
def _create_vocab(self):
|
| 478 |
+
_vocab = self._characters
|
| 479 |
+
if self.is_unique:
|
| 480 |
+
_vocab = list(set(_vocab))
|
| 481 |
+
if self.is_sorted:
|
| 482 |
+
_vocab = sorted(_vocab)
|
| 483 |
+
_vocab = list(_vocab)
|
| 484 |
+
_vocab = [self._blank] + _vocab if self._blank is not None and len(self._blank) > 0 else _vocab
|
| 485 |
+
_vocab = [self._bos] + _vocab if self._bos is not None and len(self._bos) > 0 else _vocab
|
| 486 |
+
_vocab = [self._eos] + _vocab if self._eos is not None and len(self._eos) > 0 else _vocab
|
| 487 |
+
_vocab = [self._pad] + _vocab if self._pad is not None and len(self._pad) > 0 else _vocab
|
| 488 |
+
self.vocab = _vocab + list(self._punctuations)
|
| 489 |
+
if self.is_unique:
|
| 490 |
+
duplicates = {x for x in self.vocab if self.vocab.count(x) > 1}
|
| 491 |
+
assert (
|
| 492 |
+
len(self.vocab) == len(self._char_to_id) == len(self._id_to_char)
|
| 493 |
+
), f" [!] There are duplicate characters in the character set. {duplicates}"
|
| 494 |
+
|
| 495 |
+
def char_to_id(self, char: str) -> int:
|
| 496 |
+
try:
|
| 497 |
+
return self._char_to_id[char]
|
| 498 |
+
except KeyError as e:
|
| 499 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 500 |
+
|
| 501 |
+
def id_to_char(self, idx: int) -> str:
|
| 502 |
+
return self._id_to_char[idx]
|
| 503 |
+
|
| 504 |
+
def print_log(self, level: int = 0):
|
| 505 |
+
"""
|
| 506 |
+
Prints the vocabulary in a nice format.
|
| 507 |
+
"""
|
| 508 |
+
indent = "\t" * level
|
| 509 |
+
print(f"{indent}| > Characters: {self._characters}")
|
| 510 |
+
print(f"{indent}| > Punctuations: {self._punctuations}")
|
| 511 |
+
print(f"{indent}| > Pad: {self._pad}")
|
| 512 |
+
print(f"{indent}| > EOS: {self._eos}")
|
| 513 |
+
print(f"{indent}| > BOS: {self._bos}")
|
| 514 |
+
print(f"{indent}| > Blank: {self._blank}")
|
| 515 |
+
print(f"{indent}| > Vocab: {self.vocab}")
|
| 516 |
+
print(f"{indent}| > Num chars: {self.num_chars}")
|
| 517 |
+
|
| 518 |
+
@staticmethod
|
| 519 |
+
def init_from_config(config: "Coqpit"): # pylint: disable=unused-argument
|
| 520 |
+
"""Init your character class from a config.
|
| 521 |
+
|
| 522 |
+
Implement this method for your subclass.
|
| 523 |
+
"""
|
| 524 |
+
# use character set from config
|
| 525 |
+
if config.characters is not None:
|
| 526 |
+
return BaseCharacters(**config.characters), config
|
| 527 |
+
# return default character set
|
| 528 |
+
characters = BaseCharacters()
|
| 529 |
+
new_config = replace(config, characters=characters.to_config())
|
| 530 |
+
return characters, new_config
|
| 531 |
+
|
| 532 |
+
def to_config(self) -> "CharactersConfig":
|
| 533 |
+
return CharactersConfig(
|
| 534 |
+
characters=self._characters,
|
| 535 |
+
punctuations=self._punctuations,
|
| 536 |
+
pad=self._pad,
|
| 537 |
+
eos=self._eos,
|
| 538 |
+
bos=self._bos,
|
| 539 |
+
blank=self._blank,
|
| 540 |
+
is_unique=self.is_unique,
|
| 541 |
+
is_sorted=self.is_sorted,
|
| 542 |
+
)
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
class IPAPhonemes(BaseCharacters):
|
| 546 |
+
|
| 547 |
+
|
| 548 |
+
def __init__(
|
| 549 |
+
self,
|
| 550 |
+
characters: str = _phonemes,
|
| 551 |
+
punctuations: str = _punctuations,
|
| 552 |
+
pad: str = _pad,
|
| 553 |
+
eos: str = _eos,
|
| 554 |
+
bos: str = _bos,
|
| 555 |
+
blank: str = _blank,
|
| 556 |
+
is_unique: bool = False,
|
| 557 |
+
is_sorted: bool = True,
|
| 558 |
+
) -> None:
|
| 559 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 560 |
+
|
| 561 |
+
@staticmethod
|
| 562 |
+
def init_from_config(config: "Coqpit"):
|
| 563 |
+
"""Init a IPAPhonemes object from a model config
|
| 564 |
+
|
| 565 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 566 |
+
will be updated.
|
| 567 |
+
"""
|
| 568 |
+
# band-aid for compatibility with old models
|
| 569 |
+
if "characters" in config and config.characters is not None:
|
| 570 |
+
if "phonemes" in config.characters and config.characters.phonemes is not None:
|
| 571 |
+
config.characters["characters"] = config.characters["phonemes"]
|
| 572 |
+
return (
|
| 573 |
+
IPAPhonemes(
|
| 574 |
+
characters=config.characters["characters"],
|
| 575 |
+
punctuations=config.characters["punctuations"],
|
| 576 |
+
pad=config.characters["pad"],
|
| 577 |
+
eos=config.characters["eos"],
|
| 578 |
+
bos=config.characters["bos"],
|
| 579 |
+
blank=config.characters["blank"],
|
| 580 |
+
is_unique=config.characters["is_unique"],
|
| 581 |
+
is_sorted=config.characters["is_sorted"],
|
| 582 |
+
),
|
| 583 |
+
config,
|
| 584 |
+
)
|
| 585 |
+
# use character set from config
|
| 586 |
+
if config.characters is not None:
|
| 587 |
+
return IPAPhonemes(**config.characters), config
|
| 588 |
+
# return default character set
|
| 589 |
+
characters = IPAPhonemes()
|
| 590 |
+
new_config = replace(config, characters=characters.to_config())
|
| 591 |
+
return characters, new_config
|
| 592 |
+
|
| 593 |
+
|
| 594 |
+
class Graphemes(BaseCharacters):
|
| 595 |
+
|
| 596 |
+
|
| 597 |
+
def __init__(
|
| 598 |
+
self,
|
| 599 |
+
characters: str = _characters,
|
| 600 |
+
punctuations: str = _punctuations,
|
| 601 |
+
pad: str = _pad,
|
| 602 |
+
eos: str = _eos,
|
| 603 |
+
bos: str = _bos,
|
| 604 |
+
blank: str = _blank,
|
| 605 |
+
is_unique: bool = False,
|
| 606 |
+
is_sorted: bool = True,
|
| 607 |
+
) -> None:
|
| 608 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 609 |
+
|
| 610 |
+
@staticmethod
|
| 611 |
+
def init_from_config(config: "Coqpit"):
|
| 612 |
+
"""Init a Graphemes object from a model config
|
| 613 |
+
|
| 614 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 615 |
+
will be updated.
|
| 616 |
+
"""
|
| 617 |
+
if config.characters is not None:
|
| 618 |
+
# band-aid for compatibility with old models
|
| 619 |
+
if "phonemes" in config.characters:
|
| 620 |
+
return (
|
| 621 |
+
Graphemes(
|
| 622 |
+
characters=config.characters["characters"],
|
| 623 |
+
punctuations=config.characters["punctuations"],
|
| 624 |
+
pad=config.characters["pad"],
|
| 625 |
+
eos=config.characters["eos"],
|
| 626 |
+
bos=config.characters["bos"],
|
| 627 |
+
blank=config.characters["blank"],
|
| 628 |
+
is_unique=config.characters["is_unique"],
|
| 629 |
+
is_sorted=config.characters["is_sorted"],
|
| 630 |
+
),
|
| 631 |
+
config,
|
| 632 |
+
)
|
| 633 |
+
return Graphemes(**config.characters), config
|
| 634 |
+
characters = Graphemes()
|
| 635 |
+
new_config = replace(config, characters=characters.to_config())
|
| 636 |
+
return characters, new_config
|
| 637 |
+
|
| 638 |
+
|
| 639 |
+
if __name__ == "__main__":
|
| 640 |
+
gr = Graphemes()
|
| 641 |
+
ph = IPAPhonemes()
|
| 642 |
+
gr.print_log()
|
| 643 |
+
ph.print_log()
|
| 644 |
+
|
| 645 |
+
|
| 646 |
+
class VitsCharacters(BaseCharacters):
|
| 647 |
+
"""Characters class for VITs model for compatibility with pre-trained models"""
|
| 648 |
+
|
| 649 |
+
def __init__(
|
| 650 |
+
self,
|
| 651 |
+
graphemes: str = _characters,
|
| 652 |
+
punctuations: str = _punctuations,
|
| 653 |
+
pad: str = _pad,
|
| 654 |
+
ipa_characters: str = _phonemes,
|
| 655 |
+
) -> None:
|
| 656 |
+
if ipa_characters is not None:
|
| 657 |
+
graphemes += ipa_characters
|
| 658 |
+
super().__init__(graphemes, punctuations, pad, None, None, "<BLNK>", is_unique=False, is_sorted=True)
|
| 659 |
+
|
| 660 |
+
def _create_vocab(self):
|
| 661 |
+
self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank]
|
| 662 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 663 |
+
# pylint: disable=unnecessary-comprehension
|
| 664 |
+
self._id_to_char = {idx: char for idx, char in enumerate(self.vocab)}
|
| 665 |
+
|
| 666 |
+
@staticmethod
|
| 667 |
+
def init_from_config(config):
|
| 668 |
+
_pad = config.characters.pad
|
| 669 |
+
_punctuations = config.characters.punctuations
|
| 670 |
+
_letters = config.characters.characters
|
| 671 |
+
_letters_ipa = config.characters.phonemes
|
| 672 |
+
return (
|
| 673 |
+
VitsCharacters(graphemes=_letters, ipa_characters=_letters_ipa, punctuations=_punctuations, pad=_pad),
|
| 674 |
+
config,
|
| 675 |
+
)
|
| 676 |
+
|
| 677 |
+
def to_config(self) -> "CharactersConfig":
|
| 678 |
+
return CharactersConfig(
|
| 679 |
+
characters=self._characters,
|
| 680 |
+
punctuations=self._punctuations,
|
| 681 |
+
pad=self._pad,
|
| 682 |
+
eos=None,
|
| 683 |
+
bos=None,
|
| 684 |
+
blank=self._blank,
|
| 685 |
+
is_unique=False,
|
| 686 |
+
is_sorted=True,
|
| 687 |
+
)
|
| 688 |
+
|
| 689 |
+
class TTSTokenizer:
|
| 690 |
+
def __init__(
|
| 691 |
+
self,
|
| 692 |
+
text_cleaner: Callable = None,
|
| 693 |
+
characters: "BaseCharacters" = None,
|
| 694 |
+
):
|
| 695 |
+
self.text_cleaner = text_cleaner
|
| 696 |
+
self.characters = characters
|
| 697 |
+
self.not_found_characters = []
|
| 698 |
+
|
| 699 |
+
@property
|
| 700 |
+
def characters(self):
|
| 701 |
+
return self._characters
|
| 702 |
+
|
| 703 |
+
@characters.setter
|
| 704 |
+
def characters(self, new_characters):
|
| 705 |
+
self._characters = new_characters
|
| 706 |
+
self.pad_id = self.characters.char_to_id(self.characters.pad) if self.characters.pad else None
|
| 707 |
+
self.blank_id = self.characters.char_to_id(self.characters.blank) if self.characters.blank else None
|
| 708 |
+
|
| 709 |
+
def encode(self, text: str) -> List[int]:
|
| 710 |
+
"""Encodes a string of text as a sequence of IDs."""
|
| 711 |
+
token_ids = []
|
| 712 |
+
for char in text:
|
| 713 |
+
try:
|
| 714 |
+
idx = self.characters.char_to_id(char)
|
| 715 |
+
token_ids.append(idx)
|
| 716 |
+
except KeyError:
|
| 717 |
+
# discard but store not found characters
|
| 718 |
+
if char not in self.not_found_characters:
|
| 719 |
+
self.not_found_characters.append(char)
|
| 720 |
+
print(text)
|
| 721 |
+
print(f" [!] Character {repr(char)} not found in the vocabulary. Discarding it.")
|
| 722 |
+
return token_ids
|
| 723 |
+
|
| 724 |
+
def text_to_ids(self, text: str, language: str = None) -> List[int]: # pylint: disable=unused-argument
|
| 725 |
+
text = self.text_cleaner(text)
|
| 726 |
+
text = self.encode(text)
|
| 727 |
+
text = self.intersperse_blank_char(text, True)
|
| 728 |
+
return text
|
| 729 |
+
|
| 730 |
+
def pad_with_bos_eos(self, char_sequence: List[str]):
|
| 731 |
+
"""Pads a sequence with the special BOS and EOS characters."""
|
| 732 |
+
return [self.characters.bos_id] + list(char_sequence) + [self.characters.eos_id]
|
| 733 |
+
|
| 734 |
+
def intersperse_blank_char(self, char_sequence: List[str], use_blank_char: bool = False):
|
| 735 |
+
"""Intersperses the blank character between characters in a sequence.
|
| 736 |
+
|
| 737 |
+
Use the ```blank``` character if defined else use the ```pad``` character.
|
| 738 |
+
"""
|
| 739 |
+
char_to_use = self.characters.blank_id if use_blank_char else self.characters.pad
|
| 740 |
+
result = [char_to_use] * (len(char_sequence) * 2 + 1)
|
| 741 |
+
result[1::2] = char_sequence
|
| 742 |
+
return result
|
| 743 |
+
|
| 744 |
+
@staticmethod
|
| 745 |
+
def init_from_config(config: "Coqpit", characters: "BaseCharacters" = None):
|
| 746 |
+
text_cleaner = multilingual_cleaners
|
| 747 |
+
CharactersClass = VitsCharacters
|
| 748 |
+
characters, new_config = CharactersClass.init_from_config(config)
|
| 749 |
+
# new_config.characters.characters_class = get_import_path(characters)
|
| 750 |
+
new_config.characters.characters_class = VitsCharacters
|
| 751 |
+
return (
|
| 752 |
+
TTSTokenizer(text_cleaner, characters),new_config)
|
| 753 |
+
|
| 754 |
+
|
| 755 |
+
def multilingual_cleaners(text):
|
| 756 |
+
"""Pipeline for multilingual text"""
|
| 757 |
+
text = lowercase(text)
|
| 758 |
+
text = replace_symbols(text, lang=None)
|
| 759 |
+
text = remove_aux_symbols(text)
|
| 760 |
+
text = collapse_whitespace(text)
|
| 761 |
+
return text
|
| 762 |
+
|
| 763 |
+
def lowercase(text):
|
| 764 |
+
return text.lower()
|
| 765 |
+
|
| 766 |
+
def collapse_whitespace(text):
|
| 767 |
+
return re.sub(_whitespace_re, " ", text).strip()
|
| 768 |
+
|
| 769 |
+
def replace_symbols(text, lang="en"):
|
| 770 |
+
|
| 771 |
+
text = text.replace(";", ",")
|
| 772 |
+
text = text.replace("-", " ") if lang != "ca" else text.replace("-", "")
|
| 773 |
+
text = text.replace(":", ",")
|
| 774 |
+
if lang == "en":
|
| 775 |
+
text = text.replace("&", " and ")
|
| 776 |
+
elif lang == "fr":
|
| 777 |
+
text = text.replace("&", " et ")
|
| 778 |
+
elif lang == "pt":
|
| 779 |
+
text = text.replace("&", " e ")
|
| 780 |
+
elif lang == "ca":
|
| 781 |
+
text = text.replace("&", " i ")
|
| 782 |
+
text = text.replace("'", "")
|
| 783 |
+
return text
|
| 784 |
+
|
| 785 |
+
def remove_aux_symbols(text):
|
| 786 |
+
text = re.sub(r"[\<\>\(\)\[\]\"]+", "", text)
|
| 787 |
+
return text
|
models/hi_female/hi_female_vits_30hrs.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2bcfb47f599b36e7cbfec27142604c366e538c17e89980a40519291f92a46327
|
| 3 |
+
size 333261446
|
models/hi_female/jit_infer.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from extra import TTSTokenizer, VitsConfig, CharactersConfig, VitsCharacters
|
| 3 |
+
import torch
|
| 4 |
+
import numpy as np
|
| 5 |
+
|
| 6 |
+
#ch female
|
| 7 |
+
with open("chars.txt", 'r') as f:
|
| 8 |
+
letters = f.read().strip('\n')
|
| 9 |
+
model="hi_female_vits_30hrs.pt"
|
| 10 |
+
text = "फिल्म गर्दिश में अमरीश पुरी के साथ जैकी श्रॉफ, ऐश्वर्या, डिंपल कपाड़िया"
|
| 11 |
+
|
| 12 |
+
config = VitsConfig(
|
| 13 |
+
text_cleaner="multilingual_cleaners",
|
| 14 |
+
characters=CharactersConfig(
|
| 15 |
+
characters_class=VitsCharacters,
|
| 16 |
+
pad="<PAD>",
|
| 17 |
+
eos="<EOS>",
|
| 18 |
+
bos="<BOS>",
|
| 19 |
+
blank="<BLNK>",
|
| 20 |
+
characters=letters,
|
| 21 |
+
punctuations="!¡'(),-.:;¿? ",
|
| 22 |
+
phonemes=None)
|
| 23 |
+
)
|
| 24 |
+
tokenizer, config = TTSTokenizer.init_from_config(config)
|
| 25 |
+
|
| 26 |
+
x = tokenizer.text_to_ids(text)
|
| 27 |
+
x = torch.from_numpy(np.array(x)).unsqueeze(0)
|
| 28 |
+
net = torch.jit.load(model)
|
| 29 |
+
with torch.no_grad():
|
| 30 |
+
out2 = net(x)
|
| 31 |
+
import soundfile as sf
|
| 32 |
+
sf.write("jit.wav", out2.squeeze().cpu().numpy(), 22050)
|
models/hi_male/chars.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
शदऊतओसषमऱढै?ख़ौक़ड़ःिअनठय़ज़फ़्खँे।ंऋउ'हछङझ" ुणऔयघञृएईॆीपचॉॠवगडटइ,बॅूऐफजकलग़आधोथाभढ़ऑ
|
models/hi_male/extra.py
ADDED
|
@@ -0,0 +1,787 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Callable, Dict, List, Union
|
| 2 |
+
from dataclasses import asdict, dataclass, field
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
import re
|
| 6 |
+
from dataclasses import replace
|
| 7 |
+
from typing import Dict
|
| 8 |
+
_whitespace_re = re.compile(r"\s+")
|
| 9 |
+
|
| 10 |
+
from dataclasses import dataclass, field
|
| 11 |
+
from typing import List
|
| 12 |
+
|
| 13 |
+
# from TTS.tts.configs.shared_configs import BaseTTSConfig
|
| 14 |
+
# from TTS.tts.models.vits import VitsArgs, VitsAudioConfig
|
| 15 |
+
|
| 16 |
+
@dataclass
|
| 17 |
+
class CharactersConfig():
|
| 18 |
+
|
| 19 |
+
characters_class: str = None
|
| 20 |
+
|
| 21 |
+
# using BaseVocabulary
|
| 22 |
+
vocab_dict: Dict = None
|
| 23 |
+
|
| 24 |
+
# using on BaseCharacters
|
| 25 |
+
pad: str = None
|
| 26 |
+
eos: str = None
|
| 27 |
+
bos: str = None
|
| 28 |
+
blank: str = None
|
| 29 |
+
characters: str = None
|
| 30 |
+
punctuations: str = None
|
| 31 |
+
phonemes: str = None
|
| 32 |
+
is_unique: bool = True # for backwards compatibility of models trained with char sets with duplicates
|
| 33 |
+
is_sorted: bool = True
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@dataclass
|
| 37 |
+
class BaseTTSConfig():
|
| 38 |
+
|
| 39 |
+
# audio: BaseAudioConfig = field(default_factory=BaseAudioConfig)
|
| 40 |
+
# phoneme settings
|
| 41 |
+
use_phonemes: bool = False
|
| 42 |
+
phonemizer: str = None
|
| 43 |
+
phoneme_language: str = None
|
| 44 |
+
compute_input_seq_cache: bool = False
|
| 45 |
+
text_cleaner: str = None
|
| 46 |
+
enable_eos_bos_chars: bool = False
|
| 47 |
+
test_sentences_file: str = ""
|
| 48 |
+
phoneme_cache_path: str = None
|
| 49 |
+
# vocabulary parameters
|
| 50 |
+
characters: CharactersConfig = None
|
| 51 |
+
add_blank: bool = False
|
| 52 |
+
# training params
|
| 53 |
+
batch_group_size: int = 0
|
| 54 |
+
loss_masking: bool = None
|
| 55 |
+
# dataloading
|
| 56 |
+
min_audio_len: int = 1
|
| 57 |
+
max_audio_len: int = float("inf")
|
| 58 |
+
min_text_len: int = 1
|
| 59 |
+
max_text_len: int = float("inf")
|
| 60 |
+
compute_f0: bool = False
|
| 61 |
+
compute_energy: bool = False
|
| 62 |
+
compute_linear_spec: bool = False
|
| 63 |
+
precompute_num_workers: int = 0
|
| 64 |
+
use_noise_augment: bool = False
|
| 65 |
+
start_by_longest: bool = False
|
| 66 |
+
shuffle: bool = False
|
| 67 |
+
drop_last: bool = False
|
| 68 |
+
# dataset
|
| 69 |
+
datasets: str = None
|
| 70 |
+
# optimizer
|
| 71 |
+
optimizer: str = "radam"
|
| 72 |
+
optimizer_params: dict = None
|
| 73 |
+
# scheduler
|
| 74 |
+
lr_scheduler: str = None
|
| 75 |
+
lr_scheduler_params: dict = field(default_factory=lambda: {})
|
| 76 |
+
# testing
|
| 77 |
+
test_sentences: List[str] = field(default_factory=lambda: [])
|
| 78 |
+
# evaluation
|
| 79 |
+
eval_split_max_size: int = None
|
| 80 |
+
eval_split_size: float = 0.01
|
| 81 |
+
# weighted samplers
|
| 82 |
+
use_speaker_weighted_sampler: bool = False
|
| 83 |
+
speaker_weighted_sampler_alpha: float = 1.0
|
| 84 |
+
use_language_weighted_sampler: bool = False
|
| 85 |
+
language_weighted_sampler_alpha: float = 1.0
|
| 86 |
+
use_length_weighted_sampler: bool = False
|
| 87 |
+
length_weighted_sampler_alpha: float = 1.0
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
@dataclass
|
| 91 |
+
class VitsAudioConfig():
|
| 92 |
+
fft_size: int = 1024
|
| 93 |
+
sample_rate: int = 22050
|
| 94 |
+
win_length: int = 1024
|
| 95 |
+
hop_length: int = 256
|
| 96 |
+
num_mels: int = 80
|
| 97 |
+
mel_fmin: int = 0
|
| 98 |
+
mel_fmax: int = None
|
| 99 |
+
|
| 100 |
+
@dataclass
|
| 101 |
+
class VitsArgs():
|
| 102 |
+
num_chars: int = 100
|
| 103 |
+
out_channels: int = 513
|
| 104 |
+
spec_segment_size: int = 32
|
| 105 |
+
hidden_channels: int = 192
|
| 106 |
+
hidden_channels_ffn_text_encoder: int = 768
|
| 107 |
+
num_heads_text_encoder: int = 2
|
| 108 |
+
num_layers_text_encoder: int = 6
|
| 109 |
+
kernel_size_text_encoder: int = 3
|
| 110 |
+
dropout_p_text_encoder: float = 0.1
|
| 111 |
+
dropout_p_duration_predictor: float = 0.5
|
| 112 |
+
kernel_size_posterior_encoder: int = 5
|
| 113 |
+
dilation_rate_posterior_encoder: int = 1
|
| 114 |
+
num_layers_posterior_encoder: int = 16
|
| 115 |
+
kernel_size_flow: int = 5
|
| 116 |
+
dilation_rate_flow: int = 1
|
| 117 |
+
num_layers_flow: int = 4
|
| 118 |
+
resblock_type_decoder: str = "1"
|
| 119 |
+
resblock_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [3, 7, 11])
|
| 120 |
+
resblock_dilation_sizes_decoder: List[List[int]] = field(default_factory=lambda: [[1, 3, 5], [1, 3, 5], [1, 3, 5]])
|
| 121 |
+
upsample_rates_decoder: List[int] = field(default_factory=lambda: [8, 8, 2, 2])
|
| 122 |
+
upsample_initial_channel_decoder: int = 512
|
| 123 |
+
upsample_kernel_sizes_decoder: List[int] = field(default_factory=lambda: [16, 16, 4, 4])
|
| 124 |
+
periods_multi_period_discriminator: List[int] = field(default_factory=lambda: [2, 3, 5, 7, 11])
|
| 125 |
+
use_sdp: bool = True
|
| 126 |
+
noise_scale: float = 1.0
|
| 127 |
+
inference_noise_scale: float = 0.667
|
| 128 |
+
length_scale: float = 1
|
| 129 |
+
noise_scale_dp: float = 1.0
|
| 130 |
+
inference_noise_scale_dp: float = 1.0
|
| 131 |
+
max_inference_len: int = None
|
| 132 |
+
init_discriminator: bool = True
|
| 133 |
+
use_spectral_norm_disriminator: bool = False
|
| 134 |
+
use_speaker_embedding: bool = False
|
| 135 |
+
num_speakers: int = 0
|
| 136 |
+
speakers_file: str = None
|
| 137 |
+
d_vector_file: List[str] = None
|
| 138 |
+
speaker_embedding_channels: int = 256
|
| 139 |
+
use_d_vector_file: bool = False
|
| 140 |
+
d_vector_dim: int = 0
|
| 141 |
+
detach_dp_input: bool = True
|
| 142 |
+
use_language_embedding: bool = False
|
| 143 |
+
embedded_language_dim: int = 4
|
| 144 |
+
num_languages: int = 0
|
| 145 |
+
language_ids_file: str = None
|
| 146 |
+
use_speaker_encoder_as_loss: bool = False
|
| 147 |
+
speaker_encoder_config_path: str = ""
|
| 148 |
+
speaker_encoder_model_path: str = ""
|
| 149 |
+
condition_dp_on_speaker: bool = True
|
| 150 |
+
freeze_encoder: bool = False
|
| 151 |
+
freeze_DP: bool = False
|
| 152 |
+
freeze_PE: bool = False
|
| 153 |
+
freeze_flow_decoder: bool = False
|
| 154 |
+
freeze_waveform_decoder: bool = False
|
| 155 |
+
encoder_sample_rate: int = None
|
| 156 |
+
interpolate_z: bool = True
|
| 157 |
+
reinit_DP: bool = False
|
| 158 |
+
reinit_text_encoder: bool = False
|
| 159 |
+
@dataclass
|
| 160 |
+
class VitsConfig(BaseTTSConfig):
|
| 161 |
+
|
| 162 |
+
model: str = "vits"
|
| 163 |
+
# model specific params
|
| 164 |
+
model_args: VitsArgs = field(default_factory=VitsArgs)
|
| 165 |
+
audio: VitsAudioConfig = field(default_factory=VitsAudioConfig)
|
| 166 |
+
|
| 167 |
+
# optimizer
|
| 168 |
+
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000])
|
| 169 |
+
lr_gen: float = 0.0002
|
| 170 |
+
lr_disc: float = 0.0002
|
| 171 |
+
lr_scheduler_gen: str = "ExponentialLR"
|
| 172 |
+
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 173 |
+
lr_scheduler_disc: str = "ExponentialLR"
|
| 174 |
+
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
|
| 175 |
+
scheduler_after_epoch: bool = True
|
| 176 |
+
optimizer: str = "AdamW"
|
| 177 |
+
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01})
|
| 178 |
+
|
| 179 |
+
# loss params
|
| 180 |
+
kl_loss_alpha: float = 1.0
|
| 181 |
+
disc_loss_alpha: float = 1.0
|
| 182 |
+
gen_loss_alpha: float = 1.0
|
| 183 |
+
feat_loss_alpha: float = 1.0
|
| 184 |
+
mel_loss_alpha: float = 45.0
|
| 185 |
+
dur_loss_alpha: float = 1.0
|
| 186 |
+
speaker_encoder_loss_alpha: float = 1.0
|
| 187 |
+
|
| 188 |
+
# data loader params
|
| 189 |
+
return_wav: bool = True
|
| 190 |
+
compute_linear_spec: bool = True
|
| 191 |
+
|
| 192 |
+
# sampler params
|
| 193 |
+
use_weighted_sampler: bool = False # TODO: move it to the base config
|
| 194 |
+
weighted_sampler_attrs: dict = field(default_factory=lambda: {})
|
| 195 |
+
weighted_sampler_multipliers: dict = field(default_factory=lambda: {})
|
| 196 |
+
|
| 197 |
+
# overrides
|
| 198 |
+
r: int = 1 # DO NOT CHANGE
|
| 199 |
+
add_blank: bool = True
|
| 200 |
+
|
| 201 |
+
# testing
|
| 202 |
+
test_sentences: List[List] = field(
|
| 203 |
+
default_factory=lambda: [
|
| 204 |
+
["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."],
|
| 205 |
+
["Be a voice, not an echo."],
|
| 206 |
+
["I'm sorry Dave. I'm afraid I can't do that."],
|
| 207 |
+
["This cake is great. It's so delicious and moist."],
|
| 208 |
+
["Prior to November 22, 1963."],
|
| 209 |
+
]
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
# multi-speaker settings
|
| 213 |
+
# use speaker embedding layer
|
| 214 |
+
num_speakers: int = 0
|
| 215 |
+
use_speaker_embedding: bool = False
|
| 216 |
+
speakers_file: str = None
|
| 217 |
+
speaker_embedding_channels: int = 256
|
| 218 |
+
language_ids_file: str = None
|
| 219 |
+
use_language_embedding: bool = False
|
| 220 |
+
|
| 221 |
+
# use d-vectors
|
| 222 |
+
use_d_vector_file: bool = False
|
| 223 |
+
d_vector_file: List[str] = None
|
| 224 |
+
d_vector_dim: int = None
|
| 225 |
+
|
| 226 |
+
def __post_init__(self):
|
| 227 |
+
pass
|
| 228 |
+
# for key, val in self.model_args.items():
|
| 229 |
+
# if hasattr(self, key):
|
| 230 |
+
# self[key] = val
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
def parse_symbols():
|
| 237 |
+
return {
|
| 238 |
+
"pad": _pad,
|
| 239 |
+
"eos": _eos,
|
| 240 |
+
"bos": _bos,
|
| 241 |
+
"characters": _characters,
|
| 242 |
+
"punctuations": _punctuations,
|
| 243 |
+
"phonemes": _phonemes,
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# DEFAULT SET OF GRAPHEMES
|
| 248 |
+
_pad = "<PAD>"
|
| 249 |
+
_eos = "<EOS>"
|
| 250 |
+
_bos = "<BOS>"
|
| 251 |
+
_blank = "<BLNK>" # TODO: check if we need this alongside with PAD
|
| 252 |
+
_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
|
| 253 |
+
_punctuations = "!'(),-.:;? "
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
# DEFAULT SET OF IPA PHONEMES
|
| 257 |
+
# Phonemes definition (All IPA characters)
|
| 258 |
+
_vowels = "iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ"
|
| 259 |
+
_non_pulmonic_consonants = "ʘɓǀɗǃʄǂɠǁʛ"
|
| 260 |
+
_pulmonic_consonants = "pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ"
|
| 261 |
+
_suprasegmentals = "ˈˌːˑ"
|
| 262 |
+
_other_symbols = "ʍwɥʜʢʡɕʑɺɧʲ"
|
| 263 |
+
_diacrilics = "ɚ˞ɫ"
|
| 264 |
+
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilics
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
class BaseVocabulary:
|
| 268 |
+
"""Base Vocabulary class.
|
| 269 |
+
|
| 270 |
+
This class only needs a vocabulary dictionary without specifying the characters.
|
| 271 |
+
|
| 272 |
+
Args:
|
| 273 |
+
vocab (Dict): A dictionary of characters and their corresponding indices.
|
| 274 |
+
"""
|
| 275 |
+
|
| 276 |
+
def __init__(self, vocab: Dict, pad: str = None, blank: str = None, bos: str = None, eos: str = None):
|
| 277 |
+
self.vocab = vocab
|
| 278 |
+
self.pad = pad
|
| 279 |
+
self.blank = blank
|
| 280 |
+
self.bos = bos
|
| 281 |
+
self.eos = eos
|
| 282 |
+
|
| 283 |
+
@property
|
| 284 |
+
def pad_id(self) -> int:
|
| 285 |
+
"""Return the index of the padding character. If the padding character is not specified, return the length
|
| 286 |
+
of the vocabulary."""
|
| 287 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 288 |
+
|
| 289 |
+
@property
|
| 290 |
+
def blank_id(self) -> int:
|
| 291 |
+
"""Return the index of the blank character. If the blank character is not specified, return the length of
|
| 292 |
+
the vocabulary."""
|
| 293 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 294 |
+
|
| 295 |
+
@property
|
| 296 |
+
def bos_id(self) -> int:
|
| 297 |
+
"""Return the index of the bos character. If the bos character is not specified, return the length of the
|
| 298 |
+
vocabulary."""
|
| 299 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 300 |
+
|
| 301 |
+
@property
|
| 302 |
+
def eos_id(self) -> int:
|
| 303 |
+
"""Return the index of the eos character. If the eos character is not specified, return the length of the
|
| 304 |
+
vocabulary."""
|
| 305 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 306 |
+
|
| 307 |
+
@property
|
| 308 |
+
def vocab(self):
|
| 309 |
+
"""Return the vocabulary dictionary."""
|
| 310 |
+
return self._vocab
|
| 311 |
+
|
| 312 |
+
@vocab.setter
|
| 313 |
+
def vocab(self, vocab):
|
| 314 |
+
"""Set the vocabulary dictionary and character mapping dictionaries."""
|
| 315 |
+
self._vocab, self._char_to_id, self._id_to_char = None, None, None
|
| 316 |
+
if vocab is not None:
|
| 317 |
+
self._vocab = vocab
|
| 318 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self._vocab)}
|
| 319 |
+
self._id_to_char = {
|
| 320 |
+
idx: char for idx, char in enumerate(self._vocab) # pylint: disable=unnecessary-comprehension
|
| 321 |
+
}
|
| 322 |
+
|
| 323 |
+
@staticmethod
|
| 324 |
+
def init_from_config(config, **kwargs):
|
| 325 |
+
"""Initialize from the given config."""
|
| 326 |
+
if config.characters is not None and "vocab_dict" in config.characters and config.characters.vocab_dict:
|
| 327 |
+
return (
|
| 328 |
+
BaseVocabulary(
|
| 329 |
+
config.characters.vocab_dict,
|
| 330 |
+
config.characters.pad,
|
| 331 |
+
config.characters.blank,
|
| 332 |
+
config.characters.bos,
|
| 333 |
+
config.characters.eos,
|
| 334 |
+
),
|
| 335 |
+
config,
|
| 336 |
+
)
|
| 337 |
+
return BaseVocabulary(**kwargs), config
|
| 338 |
+
|
| 339 |
+
def to_config(self):
|
| 340 |
+
return CharactersConfig(
|
| 341 |
+
vocab_dict=self._vocab,
|
| 342 |
+
pad=self.pad,
|
| 343 |
+
eos=self.eos,
|
| 344 |
+
bos=self.bos,
|
| 345 |
+
blank=self.blank,
|
| 346 |
+
is_unique=False,
|
| 347 |
+
is_sorted=False,
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
@property
|
| 351 |
+
def num_chars(self):
|
| 352 |
+
"""Return number of tokens in the vocabulary."""
|
| 353 |
+
return len(self._vocab)
|
| 354 |
+
|
| 355 |
+
def char_to_id(self, char: str) -> int:
|
| 356 |
+
"""Map a character to an token ID."""
|
| 357 |
+
try:
|
| 358 |
+
return self._char_to_id[char]
|
| 359 |
+
except KeyError as e:
|
| 360 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 361 |
+
|
| 362 |
+
def id_to_char(self, idx: int) -> str:
|
| 363 |
+
"""Map an token ID to a character."""
|
| 364 |
+
return self._id_to_char[idx]
|
| 365 |
+
|
| 366 |
+
|
| 367 |
+
class BaseCharacters:
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
def __init__(
|
| 371 |
+
self,
|
| 372 |
+
characters: str = None,
|
| 373 |
+
punctuations: str = None,
|
| 374 |
+
pad: str = None,
|
| 375 |
+
eos: str = None,
|
| 376 |
+
bos: str = None,
|
| 377 |
+
blank: str = None,
|
| 378 |
+
is_unique: bool = False,
|
| 379 |
+
is_sorted: bool = True,
|
| 380 |
+
) -> None:
|
| 381 |
+
self._characters = characters
|
| 382 |
+
self._punctuations = punctuations
|
| 383 |
+
self._pad = pad
|
| 384 |
+
self._eos = eos
|
| 385 |
+
self._bos = bos
|
| 386 |
+
self._blank = blank
|
| 387 |
+
self.is_unique = is_unique
|
| 388 |
+
self.is_sorted = is_sorted
|
| 389 |
+
self._create_vocab()
|
| 390 |
+
|
| 391 |
+
@property
|
| 392 |
+
def pad_id(self) -> int:
|
| 393 |
+
return self.char_to_id(self.pad) if self.pad else len(self.vocab)
|
| 394 |
+
|
| 395 |
+
@property
|
| 396 |
+
def blank_id(self) -> int:
|
| 397 |
+
return self.char_to_id(self.blank) if self.blank else len(self.vocab)
|
| 398 |
+
|
| 399 |
+
@property
|
| 400 |
+
def eos_id(self) -> int:
|
| 401 |
+
return self.char_to_id(self.eos) if self.eos else len(self.vocab)
|
| 402 |
+
|
| 403 |
+
@property
|
| 404 |
+
def bos_id(self) -> int:
|
| 405 |
+
return self.char_to_id(self.bos) if self.bos else len(self.vocab)
|
| 406 |
+
|
| 407 |
+
@property
|
| 408 |
+
def characters(self):
|
| 409 |
+
return self._characters
|
| 410 |
+
|
| 411 |
+
@characters.setter
|
| 412 |
+
def characters(self, characters):
|
| 413 |
+
self._characters = characters
|
| 414 |
+
self._create_vocab()
|
| 415 |
+
|
| 416 |
+
@property
|
| 417 |
+
def punctuations(self):
|
| 418 |
+
return self._punctuations
|
| 419 |
+
|
| 420 |
+
@punctuations.setter
|
| 421 |
+
def punctuations(self, punctuations):
|
| 422 |
+
self._punctuations = punctuations
|
| 423 |
+
self._create_vocab()
|
| 424 |
+
|
| 425 |
+
@property
|
| 426 |
+
def pad(self):
|
| 427 |
+
return self._pad
|
| 428 |
+
|
| 429 |
+
@pad.setter
|
| 430 |
+
def pad(self, pad):
|
| 431 |
+
self._pad = pad
|
| 432 |
+
self._create_vocab()
|
| 433 |
+
|
| 434 |
+
@property
|
| 435 |
+
def eos(self):
|
| 436 |
+
return self._eos
|
| 437 |
+
|
| 438 |
+
@eos.setter
|
| 439 |
+
def eos(self, eos):
|
| 440 |
+
self._eos = eos
|
| 441 |
+
self._create_vocab()
|
| 442 |
+
|
| 443 |
+
@property
|
| 444 |
+
def bos(self):
|
| 445 |
+
return self._bos
|
| 446 |
+
|
| 447 |
+
@bos.setter
|
| 448 |
+
def bos(self, bos):
|
| 449 |
+
self._bos = bos
|
| 450 |
+
self._create_vocab()
|
| 451 |
+
|
| 452 |
+
@property
|
| 453 |
+
def blank(self):
|
| 454 |
+
return self._blank
|
| 455 |
+
|
| 456 |
+
@blank.setter
|
| 457 |
+
def blank(self, blank):
|
| 458 |
+
self._blank = blank
|
| 459 |
+
self._create_vocab()
|
| 460 |
+
|
| 461 |
+
@property
|
| 462 |
+
def vocab(self):
|
| 463 |
+
return self._vocab
|
| 464 |
+
|
| 465 |
+
@vocab.setter
|
| 466 |
+
def vocab(self, vocab):
|
| 467 |
+
self._vocab = vocab
|
| 468 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 469 |
+
self._id_to_char = {
|
| 470 |
+
idx: char for idx, char in enumerate(self.vocab) # pylint: disable=unnecessary-comprehension
|
| 471 |
+
}
|
| 472 |
+
|
| 473 |
+
@property
|
| 474 |
+
def num_chars(self):
|
| 475 |
+
return len(self._vocab)
|
| 476 |
+
|
| 477 |
+
def _create_vocab(self):
|
| 478 |
+
_vocab = self._characters
|
| 479 |
+
if self.is_unique:
|
| 480 |
+
_vocab = list(set(_vocab))
|
| 481 |
+
if self.is_sorted:
|
| 482 |
+
_vocab = sorted(_vocab)
|
| 483 |
+
_vocab = list(_vocab)
|
| 484 |
+
_vocab = [self._blank] + _vocab if self._blank is not None and len(self._blank) > 0 else _vocab
|
| 485 |
+
_vocab = [self._bos] + _vocab if self._bos is not None and len(self._bos) > 0 else _vocab
|
| 486 |
+
_vocab = [self._eos] + _vocab if self._eos is not None and len(self._eos) > 0 else _vocab
|
| 487 |
+
_vocab = [self._pad] + _vocab if self._pad is not None and len(self._pad) > 0 else _vocab
|
| 488 |
+
self.vocab = _vocab + list(self._punctuations)
|
| 489 |
+
if self.is_unique:
|
| 490 |
+
duplicates = {x for x in self.vocab if self.vocab.count(x) > 1}
|
| 491 |
+
assert (
|
| 492 |
+
len(self.vocab) == len(self._char_to_id) == len(self._id_to_char)
|
| 493 |
+
), f" [!] There are duplicate characters in the character set. {duplicates}"
|
| 494 |
+
|
| 495 |
+
def char_to_id(self, char: str) -> int:
|
| 496 |
+
try:
|
| 497 |
+
return self._char_to_id[char]
|
| 498 |
+
except KeyError as e:
|
| 499 |
+
raise KeyError(f" [!] {repr(char)} is not in the vocabulary.") from e
|
| 500 |
+
|
| 501 |
+
def id_to_char(self, idx: int) -> str:
|
| 502 |
+
return self._id_to_char[idx]
|
| 503 |
+
|
| 504 |
+
def print_log(self, level: int = 0):
|
| 505 |
+
"""
|
| 506 |
+
Prints the vocabulary in a nice format.
|
| 507 |
+
"""
|
| 508 |
+
indent = "\t" * level
|
| 509 |
+
print(f"{indent}| > Characters: {self._characters}")
|
| 510 |
+
print(f"{indent}| > Punctuations: {self._punctuations}")
|
| 511 |
+
print(f"{indent}| > Pad: {self._pad}")
|
| 512 |
+
print(f"{indent}| > EOS: {self._eos}")
|
| 513 |
+
print(f"{indent}| > BOS: {self._bos}")
|
| 514 |
+
print(f"{indent}| > Blank: {self._blank}")
|
| 515 |
+
print(f"{indent}| > Vocab: {self.vocab}")
|
| 516 |
+
print(f"{indent}| > Num chars: {self.num_chars}")
|
| 517 |
+
|
| 518 |
+
@staticmethod
|
| 519 |
+
def init_from_config(config: "Coqpit"): # pylint: disable=unused-argument
|
| 520 |
+
"""Init your character class from a config.
|
| 521 |
+
|
| 522 |
+
Implement this method for your subclass.
|
| 523 |
+
"""
|
| 524 |
+
# use character set from config
|
| 525 |
+
if config.characters is not None:
|
| 526 |
+
return BaseCharacters(**config.characters), config
|
| 527 |
+
# return default character set
|
| 528 |
+
characters = BaseCharacters()
|
| 529 |
+
new_config = replace(config, characters=characters.to_config())
|
| 530 |
+
return characters, new_config
|
| 531 |
+
|
| 532 |
+
def to_config(self) -> "CharactersConfig":
|
| 533 |
+
return CharactersConfig(
|
| 534 |
+
characters=self._characters,
|
| 535 |
+
punctuations=self._punctuations,
|
| 536 |
+
pad=self._pad,
|
| 537 |
+
eos=self._eos,
|
| 538 |
+
bos=self._bos,
|
| 539 |
+
blank=self._blank,
|
| 540 |
+
is_unique=self.is_unique,
|
| 541 |
+
is_sorted=self.is_sorted,
|
| 542 |
+
)
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
class IPAPhonemes(BaseCharacters):
|
| 546 |
+
|
| 547 |
+
|
| 548 |
+
def __init__(
|
| 549 |
+
self,
|
| 550 |
+
characters: str = _phonemes,
|
| 551 |
+
punctuations: str = _punctuations,
|
| 552 |
+
pad: str = _pad,
|
| 553 |
+
eos: str = _eos,
|
| 554 |
+
bos: str = _bos,
|
| 555 |
+
blank: str = _blank,
|
| 556 |
+
is_unique: bool = False,
|
| 557 |
+
is_sorted: bool = True,
|
| 558 |
+
) -> None:
|
| 559 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 560 |
+
|
| 561 |
+
@staticmethod
|
| 562 |
+
def init_from_config(config: "Coqpit"):
|
| 563 |
+
"""Init a IPAPhonemes object from a model config
|
| 564 |
+
|
| 565 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 566 |
+
will be updated.
|
| 567 |
+
"""
|
| 568 |
+
# band-aid for compatibility with old models
|
| 569 |
+
if "characters" in config and config.characters is not None:
|
| 570 |
+
if "phonemes" in config.characters and config.characters.phonemes is not None:
|
| 571 |
+
config.characters["characters"] = config.characters["phonemes"]
|
| 572 |
+
return (
|
| 573 |
+
IPAPhonemes(
|
| 574 |
+
characters=config.characters["characters"],
|
| 575 |
+
punctuations=config.characters["punctuations"],
|
| 576 |
+
pad=config.characters["pad"],
|
| 577 |
+
eos=config.characters["eos"],
|
| 578 |
+
bos=config.characters["bos"],
|
| 579 |
+
blank=config.characters["blank"],
|
| 580 |
+
is_unique=config.characters["is_unique"],
|
| 581 |
+
is_sorted=config.characters["is_sorted"],
|
| 582 |
+
),
|
| 583 |
+
config,
|
| 584 |
+
)
|
| 585 |
+
# use character set from config
|
| 586 |
+
if config.characters is not None:
|
| 587 |
+
return IPAPhonemes(**config.characters), config
|
| 588 |
+
# return default character set
|
| 589 |
+
characters = IPAPhonemes()
|
| 590 |
+
new_config = replace(config, characters=characters.to_config())
|
| 591 |
+
return characters, new_config
|
| 592 |
+
|
| 593 |
+
|
| 594 |
+
class Graphemes(BaseCharacters):
|
| 595 |
+
|
| 596 |
+
|
| 597 |
+
def __init__(
|
| 598 |
+
self,
|
| 599 |
+
characters: str = _characters,
|
| 600 |
+
punctuations: str = _punctuations,
|
| 601 |
+
pad: str = _pad,
|
| 602 |
+
eos: str = _eos,
|
| 603 |
+
bos: str = _bos,
|
| 604 |
+
blank: str = _blank,
|
| 605 |
+
is_unique: bool = False,
|
| 606 |
+
is_sorted: bool = True,
|
| 607 |
+
) -> None:
|
| 608 |
+
super().__init__(characters, punctuations, pad, eos, bos, blank, is_unique, is_sorted)
|
| 609 |
+
|
| 610 |
+
@staticmethod
|
| 611 |
+
def init_from_config(config: "Coqpit"):
|
| 612 |
+
"""Init a Graphemes object from a model config
|
| 613 |
+
|
| 614 |
+
If characters are not defined in the config, it will be set to the default characters and the config
|
| 615 |
+
will be updated.
|
| 616 |
+
"""
|
| 617 |
+
if config.characters is not None:
|
| 618 |
+
# band-aid for compatibility with old models
|
| 619 |
+
if "phonemes" in config.characters:
|
| 620 |
+
return (
|
| 621 |
+
Graphemes(
|
| 622 |
+
characters=config.characters["characters"],
|
| 623 |
+
punctuations=config.characters["punctuations"],
|
| 624 |
+
pad=config.characters["pad"],
|
| 625 |
+
eos=config.characters["eos"],
|
| 626 |
+
bos=config.characters["bos"],
|
| 627 |
+
blank=config.characters["blank"],
|
| 628 |
+
is_unique=config.characters["is_unique"],
|
| 629 |
+
is_sorted=config.characters["is_sorted"],
|
| 630 |
+
),
|
| 631 |
+
config,
|
| 632 |
+
)
|
| 633 |
+
return Graphemes(**config.characters), config
|
| 634 |
+
characters = Graphemes()
|
| 635 |
+
new_config = replace(config, characters=characters.to_config())
|
| 636 |
+
return characters, new_config
|
| 637 |
+
|
| 638 |
+
|
| 639 |
+
if __name__ == "__main__":
|
| 640 |
+
gr = Graphemes()
|
| 641 |
+
ph = IPAPhonemes()
|
| 642 |
+
gr.print_log()
|
| 643 |
+
ph.print_log()
|
| 644 |
+
|
| 645 |
+
|
| 646 |
+
class VitsCharacters(BaseCharacters):
|
| 647 |
+
"""Characters class for VITs model for compatibility with pre-trained models"""
|
| 648 |
+
|
| 649 |
+
def __init__(
|
| 650 |
+
self,
|
| 651 |
+
graphemes: str = _characters,
|
| 652 |
+
punctuations: str = _punctuations,
|
| 653 |
+
pad: str = _pad,
|
| 654 |
+
ipa_characters: str = _phonemes,
|
| 655 |
+
) -> None:
|
| 656 |
+
if ipa_characters is not None:
|
| 657 |
+
graphemes += ipa_characters
|
| 658 |
+
super().__init__(graphemes, punctuations, pad, None, None, "<BLNK>", is_unique=False, is_sorted=True)
|
| 659 |
+
|
| 660 |
+
def _create_vocab(self):
|
| 661 |
+
self._vocab = [self._pad] + list(self._punctuations) + list(self._characters) + [self._blank]
|
| 662 |
+
self._char_to_id = {char: idx for idx, char in enumerate(self.vocab)}
|
| 663 |
+
# pylint: disable=unnecessary-comprehension
|
| 664 |
+
self._id_to_char = {idx: char for idx, char in enumerate(self.vocab)}
|
| 665 |
+
|
| 666 |
+
@staticmethod
|
| 667 |
+
def init_from_config(config):
|
| 668 |
+
_pad = config.characters.pad
|
| 669 |
+
_punctuations = config.characters.punctuations
|
| 670 |
+
_letters = config.characters.characters
|
| 671 |
+
_letters_ipa = config.characters.phonemes
|
| 672 |
+
return (
|
| 673 |
+
VitsCharacters(graphemes=_letters, ipa_characters=_letters_ipa, punctuations=_punctuations, pad=_pad),
|
| 674 |
+
config,
|
| 675 |
+
)
|
| 676 |
+
|
| 677 |
+
def to_config(self) -> "CharactersConfig":
|
| 678 |
+
return CharactersConfig(
|
| 679 |
+
characters=self._characters,
|
| 680 |
+
punctuations=self._punctuations,
|
| 681 |
+
pad=self._pad,
|
| 682 |
+
eos=None,
|
| 683 |
+
bos=None,
|
| 684 |
+
blank=self._blank,
|
| 685 |
+
is_unique=False,
|
| 686 |
+
is_sorted=True,
|
| 687 |
+
)
|
| 688 |
+
|
| 689 |
+
class TTSTokenizer:
|
| 690 |
+
def __init__(
|
| 691 |
+
self,
|
| 692 |
+
text_cleaner: Callable = None,
|
| 693 |
+
characters: "BaseCharacters" = None,
|
| 694 |
+
):
|
| 695 |
+
self.text_cleaner = text_cleaner
|
| 696 |
+
self.characters = characters
|
| 697 |
+
self.not_found_characters = []
|
| 698 |
+
|
| 699 |
+
@property
|
| 700 |
+
def characters(self):
|
| 701 |
+
return self._characters
|
| 702 |
+
|
| 703 |
+
@characters.setter
|
| 704 |
+
def characters(self, new_characters):
|
| 705 |
+
self._characters = new_characters
|
| 706 |
+
self.pad_id = self.characters.char_to_id(self.characters.pad) if self.characters.pad else None
|
| 707 |
+
self.blank_id = self.characters.char_to_id(self.characters.blank) if self.characters.blank else None
|
| 708 |
+
|
| 709 |
+
def encode(self, text: str) -> List[int]:
|
| 710 |
+
"""Encodes a string of text as a sequence of IDs."""
|
| 711 |
+
token_ids = []
|
| 712 |
+
for char in text:
|
| 713 |
+
try:
|
| 714 |
+
idx = self.characters.char_to_id(char)
|
| 715 |
+
token_ids.append(idx)
|
| 716 |
+
except KeyError:
|
| 717 |
+
# discard but store not found characters
|
| 718 |
+
if char not in self.not_found_characters:
|
| 719 |
+
self.not_found_characters.append(char)
|
| 720 |
+
print(text)
|
| 721 |
+
print(f" [!] Character {repr(char)} not found in the vocabulary. Discarding it.")
|
| 722 |
+
return token_ids
|
| 723 |
+
|
| 724 |
+
def text_to_ids(self, text: str, language: str = None) -> List[int]: # pylint: disable=unused-argument
|
| 725 |
+
text = self.text_cleaner(text)
|
| 726 |
+
text = self.encode(text)
|
| 727 |
+
text = self.intersperse_blank_char(text, True)
|
| 728 |
+
return text
|
| 729 |
+
|
| 730 |
+
def pad_with_bos_eos(self, char_sequence: List[str]):
|
| 731 |
+
"""Pads a sequence with the special BOS and EOS characters."""
|
| 732 |
+
return [self.characters.bos_id] + list(char_sequence) + [self.characters.eos_id]
|
| 733 |
+
|
| 734 |
+
def intersperse_blank_char(self, char_sequence: List[str], use_blank_char: bool = False):
|
| 735 |
+
"""Intersperses the blank character between characters in a sequence.
|
| 736 |
+
|
| 737 |
+
Use the ```blank``` character if defined else use the ```pad``` character.
|
| 738 |
+
"""
|
| 739 |
+
char_to_use = self.characters.blank_id if use_blank_char else self.characters.pad
|
| 740 |
+
result = [char_to_use] * (len(char_sequence) * 2 + 1)
|
| 741 |
+
result[1::2] = char_sequence
|
| 742 |
+
return result
|
| 743 |
+
|
| 744 |
+
@staticmethod
|
| 745 |
+
def init_from_config(config: "Coqpit", characters: "BaseCharacters" = None):
|
| 746 |
+
text_cleaner = multilingual_cleaners
|
| 747 |
+
CharactersClass = VitsCharacters
|
| 748 |
+
characters, new_config = CharactersClass.init_from_config(config)
|
| 749 |
+
# new_config.characters.characters_class = get_import_path(characters)
|
| 750 |
+
new_config.characters.characters_class = VitsCharacters
|
| 751 |
+
return (
|
| 752 |
+
TTSTokenizer(text_cleaner, characters),new_config)
|
| 753 |
+
|
| 754 |
+
|
| 755 |
+
def multilingual_cleaners(text):
|
| 756 |
+
"""Pipeline for multilingual text"""
|
| 757 |
+
text = lowercase(text)
|
| 758 |
+
text = replace_symbols(text, lang=None)
|
| 759 |
+
text = remove_aux_symbols(text)
|
| 760 |
+
text = collapse_whitespace(text)
|
| 761 |
+
return text
|
| 762 |
+
|
| 763 |
+
def lowercase(text):
|
| 764 |
+
return text.lower()
|
| 765 |
+
|
| 766 |
+
def collapse_whitespace(text):
|
| 767 |
+
return re.sub(_whitespace_re, " ", text).strip()
|
| 768 |
+
|
| 769 |
+
def replace_symbols(text, lang="en"):
|
| 770 |
+
|
| 771 |
+
text = text.replace(";", ",")
|
| 772 |
+
text = text.replace("-", " ") if lang != "ca" else text.replace("-", "")
|
| 773 |
+
text = text.replace(":", ",")
|
| 774 |
+
if lang == "en":
|
| 775 |
+
text = text.replace("&", " and ")
|
| 776 |
+
elif lang == "fr":
|
| 777 |
+
text = text.replace("&", " et ")
|
| 778 |
+
elif lang == "pt":
|
| 779 |
+
text = text.replace("&", " e ")
|
| 780 |
+
elif lang == "ca":
|
| 781 |
+
text = text.replace("&", " i ")
|
| 782 |
+
text = text.replace("'", "")
|
| 783 |
+
return text
|
| 784 |
+
|
| 785 |
+
def remove_aux_symbols(text):
|
| 786 |
+
text = re.sub(r"[\<\>\(\)\[\]\"]+", "", text)
|
| 787 |
+
return text
|
models/hi_male/hi_male_vits_30hrs.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eb36eca2d90214662f1647e83eb6979ead93b72f269606c6411f52959acf77a8
|
| 3 |
+
size 333256012
|
models/hi_male/jit_infer.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from extra import TTSTokenizer, VitsConfig, CharactersConfig, VitsCharacters
|
| 3 |
+
import torch
|
| 4 |
+
import numpy as np
|
| 5 |
+
|
| 6 |
+
#ch female
|
| 7 |
+
with open("chars.txt", 'r') as f:
|
| 8 |
+
letters = f.read().strip('\n')
|
| 9 |
+
model="hi_male_vits_30hrs.pt"
|
| 10 |
+
text = "फिल्म गर्दिश में अमरीश पुरी के साथ जैकी श्रॉफ, ऐश्वर्या, डिंपल कपाड़िया"
|
| 11 |
+
|
| 12 |
+
config = VitsConfig(
|
| 13 |
+
text_cleaner="multilingual_cleaners",
|
| 14 |
+
characters=CharactersConfig(
|
| 15 |
+
characters_class=VitsCharacters,
|
| 16 |
+
pad="<PAD>",
|
| 17 |
+
eos="<EOS>",
|
| 18 |
+
bos="<BOS>",
|
| 19 |
+
blank="<BLNK>",
|
| 20 |
+
characters=letters,
|
| 21 |
+
punctuations="!¡'(),-.:;¿? ",
|
| 22 |
+
phonemes=None)
|
| 23 |
+
)
|
| 24 |
+
tokenizer, config = TTSTokenizer.init_from_config(config)
|
| 25 |
+
|
| 26 |
+
x = tokenizer.text_to_ids(text)
|
| 27 |
+
x = torch.from_numpy(np.array(x)).unsqueeze(0)
|
| 28 |
+
net = torch.jit.load(model)
|
| 29 |
+
with torch.no_grad():
|
| 30 |
+
out2 = net(x)
|
| 31 |
+
import soundfile as sf
|
| 32 |
+
sf.write("jit.wav", out2.squeeze().cpu().numpy(), 22050)
|
models/hne_female/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
models/hne_female/README.md
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
---
|
models/hne_female/ch_female_vits_30hrs.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c3393916262f03807d8338aa8dce79379582c71a0ada346457e36ea6f72a6635
|
| 3 |
+
size 333255366
|
models/hne_female/chars.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
खछगचऊुलशौढ़इणज़झैठढजफ़औ्ड़फूेानटॅयव़ऋदप.थअँऑआघहतषरसभउञडएईऐक़ िओ?धी,ॉंख़कोबमृ
|