Upload folder using huggingface_hub
Browse files- README.md +158 -0
- ssrn.pth +3 -0
- t2m_step-102000_first.pth +3 -0
README.md
CHANGED
|
@@ -1,3 +1,161 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- text-to-speech
|
| 7 |
+
- tts
|
| 8 |
+
- dctts
|
| 9 |
+
- pytorch
|
| 10 |
+
- speech-synthesis
|
| 11 |
+
- deep-convolutional-tts
|
| 12 |
+
pipeline_tag: text-to-speech
|
| 13 |
---
|
| 14 |
+
|
| 15 |
+
# DC-TTS Geralt Voice Model
|
| 16 |
+
|
| 17 |
+
A Deep Convolutional Text-to-Speech (DC-TTS) model trained to synthesize speech in the voice of Geralt of Rivia from The Witcher series.
|
| 18 |
+
|
| 19 |
+
## Model Description
|
| 20 |
+
|
| 21 |
+
This model is part of the [Deepstory](https://github.com/thetobysiu/deepstory) project, which combines Natural Language Generation, Text-to-Speech, and animation technologies to create interactive storytelling experiences.
|
| 22 |
+
|
| 23 |
+
The DC-TTS architecture is based on the paper:
|
| 24 |
+
> Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara. "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" ([arXiv:1710.08969](https://arxiv.org/abs/1710.08969))
|
| 25 |
+
|
| 26 |
+
## Model Architecture
|
| 27 |
+
|
| 28 |
+
This model consists of two components:
|
| 29 |
+
|
| 30 |
+
### Text2Mel Network
|
| 31 |
+
Converts text input to mel-spectrograms.
|
| 32 |
+
|
| 33 |
+
| Parameter | Value |
|
| 34 |
+
|-----------|-------|
|
| 35 |
+
| Embedding Dimension (e) | 128 |
|
| 36 |
+
| Hidden Unit Dimension (d) | 512 |
|
| 37 |
+
| Vocabulary | `PE abcdefghijklmnopqrstuvwxyz'.,!?` |
|
| 38 |
+
| Max Characters (N) | 259 |
|
| 39 |
+
| Max Mel Frames (T) | 326 |
|
| 40 |
+
| Basic Block Type | Gated Convolution |
|
| 41 |
+
| Normalization | Layer Normalization |
|
| 42 |
+
| Dropout Rate | 0.05 |
|
| 43 |
+
|
| 44 |
+
### SSRN (Spectrogram Super-Resolution Network)
|
| 45 |
+
Upsamples mel-spectrograms to full spectrograms for audio synthesis.
|
| 46 |
+
|
| 47 |
+
| Parameter | Value |
|
| 48 |
+
|-----------|-------|
|
| 49 |
+
| Hidden Unit Dimension (c) | 640 (512 + 128) |
|
| 50 |
+
| Number of Mel Bins (f) | 80 |
|
| 51 |
+
| FFT Points | 2048 |
|
| 52 |
+
| Full Spectrogram Dimension | 1025 |
|
| 53 |
+
| Reduction Rate | 4 |
|
| 54 |
+
| Basic Block Type | Residual |
|
| 55 |
+
| Normalization | Weight Normalization |
|
| 56 |
+
| Weight Initialization | Kaiming |
|
| 57 |
+
|
| 58 |
+
### Audio Parameters
|
| 59 |
+
|
| 60 |
+
| Parameter | Value |
|
| 61 |
+
|-----------|-------|
|
| 62 |
+
| Sample Rate | 22050 Hz |
|
| 63 |
+
| Frame Shift | 0.0125s (12.5ms) |
|
| 64 |
+
| Frame Length | 0.05s (50ms) |
|
| 65 |
+
| Hop Length | 276 samples |
|
| 66 |
+
| Win Length | 1102 samples |
|
| 67 |
+
| Power | 1.5 |
|
| 68 |
+
| Preemphasis | 0.97 |
|
| 69 |
+
| Max dB | 100 |
|
| 70 |
+
| Reference dB | 20 |
|
| 71 |
+
| Griffin-Lim Iterations | 50 |
|
| 72 |
+
|
| 73 |
+
## Files
|
| 74 |
+
|
| 75 |
+
- `t2m_step-102000_first.pth` - Text2Mel model checkpoint
|
| 76 |
+
- `ssrn.pth` - SSRN model checkpoint
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
```python
|
| 81 |
+
import torch
|
| 82 |
+
from modules.dctts import Text2Mel, SSRN, hp, spectrogram2wav
|
| 83 |
+
|
| 84 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 85 |
+
|
| 86 |
+
# Load models
|
| 87 |
+
text2mel = Text2Mel(hp.vocab).to(device).eval()
|
| 88 |
+
text2mel.load_state_dict(torch.load('t2m_step-102000_first.pth', map_location=device)['state_dict'])
|
| 89 |
+
|
| 90 |
+
ssrn = SSRN().to(device).eval()
|
| 91 |
+
ssrn.load_state_dict(torch.load('ssrn.pth', map_location=device)['state_dict'])
|
| 92 |
+
|
| 93 |
+
# Synthesize speech
|
| 94 |
+
def synthesize(text, timeout=10000):
|
| 95 |
+
normalized_text = normalize_text(text) + "E" # E: EOS
|
| 96 |
+
L = torch.from_numpy(np.array([[hp.char2idx[char] for char in normalized_text]], np.long)).to(device)
|
| 97 |
+
zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32)).to(device)
|
| 98 |
+
Y = zeros
|
| 99 |
+
|
| 100 |
+
with torch.no_grad():
|
| 101 |
+
for i in range(timeout):
|
| 102 |
+
_, Y_t, A = text2mel(L, Y, monotonic_attention=True)
|
| 103 |
+
Y = torch.cat((zeros, Y_t), -1)
|
| 104 |
+
_, attention = torch.max(A[0, :, -1], 0)
|
| 105 |
+
if L[0, attention.item()] == hp.vocab.index('E'):
|
| 106 |
+
break
|
| 107 |
+
|
| 108 |
+
_, Z = ssrn(Y)
|
| 109 |
+
Z = Z.cpu().numpy()
|
| 110 |
+
|
| 111 |
+
wav = spectrogram2wav(Z[0, :, :].T)
|
| 112 |
+
return wav
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
## Training Data
|
| 116 |
+
|
| 117 |
+
The model was trained on audio samples of Geralt's voice from The Witcher 3: Wild Hunt video game.
|
| 118 |
+
|
| 119 |
+
## Intended Use
|
| 120 |
+
|
| 121 |
+
This model is intended for:
|
| 122 |
+
- Research and experimentation in speech synthesis
|
| 123 |
+
- Creative projects and fan content
|
| 124 |
+
- Educational purposes
|
| 125 |
+
|
| 126 |
+
## Limitations
|
| 127 |
+
|
| 128 |
+
- The model works best with English text
|
| 129 |
+
- Vocabulary is limited to lowercase letters and basic punctuation
|
| 130 |
+
- Audio quality may vary depending on input text complexity
|
| 131 |
+
- The character voice is based on copyrighted material
|
| 132 |
+
|
| 133 |
+
## Citation
|
| 134 |
+
|
| 135 |
+
If you use this model, please cite the original DC-TTS paper and the Deepstory project:
|
| 136 |
+
|
| 137 |
+
```bibtex
|
| 138 |
+
@article{tachibana2018efficiently,
|
| 139 |
+
title={Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention},
|
| 140 |
+
author={Tachibana, Hideyuki and Uenoyama, Katsuya and Aihara, Shunsuke},
|
| 141 |
+
journal={arXiv preprint arXiv:1710.08969},
|
| 142 |
+
year={2018}
|
| 143 |
+
}
|
| 144 |
+
|
| 145 |
+
@misc{deepstory,
|
| 146 |
+
author = {Siu King Wai},
|
| 147 |
+
title = {Deepstory},
|
| 148 |
+
year = {2020},
|
| 149 |
+
publisher = {GitHub},
|
| 150 |
+
url = {https://github.com/thetobysiu/deepstory}
|
| 151 |
+
}
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
## License
|
| 155 |
+
|
| 156 |
+
This model is released under the MIT License. Please note that the voice characteristics are based on copyrighted material from The Witcher 3: Wild Hunt.
|
| 157 |
+
|
| 158 |
+
## Acknowledgments
|
| 159 |
+
|
| 160 |
+
- Original DC-TTS implementation: [tugstugi/pytorch-dc-tts](https://github.com/tugstugi/pytorch-dc-tts)
|
| 161 |
+
- The Witcher 3: Wild Hunt by CD Projekt Red
|
ssrn.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1d0c4335f53daa9b06341d92ed033dcb7370cb31c290a50ccf3c87e842464948
|
| 3 |
+
size 497068180
|
t2m_step-102000_first.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f3ea666057b34709c1219deee14bf2bf1df47ad0f6200aa48d5df29bd1c9d34a
|
| 3 |
+
size 1146268309
|