thetobysiu commited on
Commit
98b2faa
·
verified ·
1 Parent(s): 49b1f8c

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +158 -0
  2. ssrn.pth +3 -0
  3. t2m_step-102000_first.pth +3 -0
README.md CHANGED
@@ -1,3 +1,161 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-to-speech
7
+ - tts
8
+ - dctts
9
+ - pytorch
10
+ - speech-synthesis
11
+ - deep-convolutional-tts
12
+ pipeline_tag: text-to-speech
13
  ---
14
+
15
+ # DC-TTS Geralt Voice Model
16
+
17
+ A Deep Convolutional Text-to-Speech (DC-TTS) model trained to synthesize speech in the voice of Geralt of Rivia from The Witcher series.
18
+
19
+ ## Model Description
20
+
21
+ This model is part of the [Deepstory](https://github.com/thetobysiu/deepstory) project, which combines Natural Language Generation, Text-to-Speech, and animation technologies to create interactive storytelling experiences.
22
+
23
+ The DC-TTS architecture is based on the paper:
24
+ > Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara. "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" ([arXiv:1710.08969](https://arxiv.org/abs/1710.08969))
25
+
26
+ ## Model Architecture
27
+
28
+ This model consists of two components:
29
+
30
+ ### Text2Mel Network
31
+ Converts text input to mel-spectrograms.
32
+
33
+ | Parameter | Value |
34
+ |-----------|-------|
35
+ | Embedding Dimension (e) | 128 |
36
+ | Hidden Unit Dimension (d) | 512 |
37
+ | Vocabulary | `PE abcdefghijklmnopqrstuvwxyz'.,!?` |
38
+ | Max Characters (N) | 259 |
39
+ | Max Mel Frames (T) | 326 |
40
+ | Basic Block Type | Gated Convolution |
41
+ | Normalization | Layer Normalization |
42
+ | Dropout Rate | 0.05 |
43
+
44
+ ### SSRN (Spectrogram Super-Resolution Network)
45
+ Upsamples mel-spectrograms to full spectrograms for audio synthesis.
46
+
47
+ | Parameter | Value |
48
+ |-----------|-------|
49
+ | Hidden Unit Dimension (c) | 640 (512 + 128) |
50
+ | Number of Mel Bins (f) | 80 |
51
+ | FFT Points | 2048 |
52
+ | Full Spectrogram Dimension | 1025 |
53
+ | Reduction Rate | 4 |
54
+ | Basic Block Type | Residual |
55
+ | Normalization | Weight Normalization |
56
+ | Weight Initialization | Kaiming |
57
+
58
+ ### Audio Parameters
59
+
60
+ | Parameter | Value |
61
+ |-----------|-------|
62
+ | Sample Rate | 22050 Hz |
63
+ | Frame Shift | 0.0125s (12.5ms) |
64
+ | Frame Length | 0.05s (50ms) |
65
+ | Hop Length | 276 samples |
66
+ | Win Length | 1102 samples |
67
+ | Power | 1.5 |
68
+ | Preemphasis | 0.97 |
69
+ | Max dB | 100 |
70
+ | Reference dB | 20 |
71
+ | Griffin-Lim Iterations | 50 |
72
+
73
+ ## Files
74
+
75
+ - `t2m_step-102000_first.pth` - Text2Mel model checkpoint
76
+ - `ssrn.pth` - SSRN model checkpoint
77
+
78
+ ## Usage
79
+
80
+ ```python
81
+ import torch
82
+ from modules.dctts import Text2Mel, SSRN, hp, spectrogram2wav
83
+
84
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
85
+
86
+ # Load models
87
+ text2mel = Text2Mel(hp.vocab).to(device).eval()
88
+ text2mel.load_state_dict(torch.load('t2m_step-102000_first.pth', map_location=device)['state_dict'])
89
+
90
+ ssrn = SSRN().to(device).eval()
91
+ ssrn.load_state_dict(torch.load('ssrn.pth', map_location=device)['state_dict'])
92
+
93
+ # Synthesize speech
94
+ def synthesize(text, timeout=10000):
95
+ normalized_text = normalize_text(text) + "E" # E: EOS
96
+ L = torch.from_numpy(np.array([[hp.char2idx[char] for char in normalized_text]], np.long)).to(device)
97
+ zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32)).to(device)
98
+ Y = zeros
99
+
100
+ with torch.no_grad():
101
+ for i in range(timeout):
102
+ _, Y_t, A = text2mel(L, Y, monotonic_attention=True)
103
+ Y = torch.cat((zeros, Y_t), -1)
104
+ _, attention = torch.max(A[0, :, -1], 0)
105
+ if L[0, attention.item()] == hp.vocab.index('E'):
106
+ break
107
+
108
+ _, Z = ssrn(Y)
109
+ Z = Z.cpu().numpy()
110
+
111
+ wav = spectrogram2wav(Z[0, :, :].T)
112
+ return wav
113
+ ```
114
+
115
+ ## Training Data
116
+
117
+ The model was trained on audio samples of Geralt's voice from The Witcher 3: Wild Hunt video game.
118
+
119
+ ## Intended Use
120
+
121
+ This model is intended for:
122
+ - Research and experimentation in speech synthesis
123
+ - Creative projects and fan content
124
+ - Educational purposes
125
+
126
+ ## Limitations
127
+
128
+ - The model works best with English text
129
+ - Vocabulary is limited to lowercase letters and basic punctuation
130
+ - Audio quality may vary depending on input text complexity
131
+ - The character voice is based on copyrighted material
132
+
133
+ ## Citation
134
+
135
+ If you use this model, please cite the original DC-TTS paper and the Deepstory project:
136
+
137
+ ```bibtex
138
+ @article{tachibana2018efficiently,
139
+ title={Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention},
140
+ author={Tachibana, Hideyuki and Uenoyama, Katsuya and Aihara, Shunsuke},
141
+ journal={arXiv preprint arXiv:1710.08969},
142
+ year={2018}
143
+ }
144
+
145
+ @misc{deepstory,
146
+ author = {Siu King Wai},
147
+ title = {Deepstory},
148
+ year = {2020},
149
+ publisher = {GitHub},
150
+ url = {https://github.com/thetobysiu/deepstory}
151
+ }
152
+ ```
153
+
154
+ ## License
155
+
156
+ This model is released under the MIT License. Please note that the voice characteristics are based on copyrighted material from The Witcher 3: Wild Hunt.
157
+
158
+ ## Acknowledgments
159
+
160
+ - Original DC-TTS implementation: [tugstugi/pytorch-dc-tts](https://github.com/tugstugi/pytorch-dc-tts)
161
+ - The Witcher 3: Wild Hunt by CD Projekt Red
ssrn.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d0c4335f53daa9b06341d92ed033dcb7370cb31c290a50ccf3c87e842464948
3
+ size 497068180
t2m_step-102000_first.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3ea666057b34709c1219deee14bf2bf1df47ad0f6200aa48d5df29bd1c9d34a
3
+ size 1146268309