Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +158 -0
ssrn.pth +3 -0
t2m_step-102000_first.pth +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,161 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+  - en
+tags:
+  - text-to-speech
+  - tts
+  - dctts
+  - pytorch
+  - speech-synthesis
+  - deep-convolutional-tts
+pipeline_tag: text-to-speech
 ---
+# DC-TTS Geralt Voice Model
+A Deep Convolutional Text-to-Speech (DC-TTS) model trained to synthesize speech in the voice of Geralt of Rivia from The Witcher series.
+## Model Description
+This model is part of the [Deepstory](https://github.com/thetobysiu/deepstory) project, which combines Natural Language Generation, Text-to-Speech, and animation technologies to create interactive storytelling experiences.
+The DC-TTS architecture is based on the paper:
+> Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara. "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" ([arXiv:1710.08969](https://arxiv.org/abs/1710.08969))
+## Model Architecture
+This model consists of two components:
+### Text2Mel Network
+Converts text input to mel-spectrograms.
+| Parameter | Value |
+|-----------|-------|
+| Embedding Dimension (e) | 128 |
+| Hidden Unit Dimension (d) | 512 |
+| Vocabulary | `PE abcdefghijklmnopqrstuvwxyz'.,!?` |
+| Max Characters (N) | 259 |
+| Max Mel Frames (T) | 326 |
+| Basic Block Type | Gated Convolution |
+| Normalization | Layer Normalization |
+| Dropout Rate | 0.05 |
+### SSRN (Spectrogram Super-Resolution Network)
+Upsamples mel-spectrograms to full spectrograms for audio synthesis.
+| Parameter | Value |
+|-----------|-------|
+| Hidden Unit Dimension (c) | 640 (512 + 128) |
+| Number of Mel Bins (f) | 80 |
+| FFT Points | 2048 |
+| Full Spectrogram Dimension | 1025 |
+| Reduction Rate | 4 |
+| Basic Block Type | Residual |
+| Normalization | Weight Normalization |
+| Weight Initialization | Kaiming |
+### Audio Parameters
+| Parameter | Value |
+|-----------|-------|
+| Sample Rate | 22050 Hz |
+| Frame Shift | 0.0125s (12.5ms) |
+| Frame Length | 0.05s (50ms) |
+| Hop Length | 276 samples |
+| Win Length | 1102 samples |
+| Power | 1.5 |
+| Preemphasis | 0.97 |
+| Max dB | 100 |
+| Reference dB | 20 |
+| Griffin-Lim Iterations | 50 |
+## Files
+- `t2m_step-102000_first.pth` - Text2Mel model checkpoint
+- `ssrn.pth` - SSRN model checkpoint
+## Usage
+```python
+import torch
+from modules.dctts import Text2Mel, SSRN, hp, spectrogram2wav
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# Load models
+text2mel = Text2Mel(hp.vocab).to(device).eval()
+text2mel.load_state_dict(torch.load('t2m_step-102000_first.pth', map_location=device)['state_dict'])
+ssrn = SSRN().to(device).eval()
+ssrn.load_state_dict(torch.load('ssrn.pth', map_location=device)['state_dict'])
+# Synthesize speech
+def synthesize(text, timeout=10000):
+    normalized_text = normalize_text(text) + "E"  # E: EOS
+    L = torch.from_numpy(np.array([[hp.char2idx[char] for char in normalized_text]], np.long)).to(device)
+    zeros = torch.from_numpy(np.zeros((1, hp.n_mels, 1), np.float32)).to(device)
+    Y = zeros
+    with torch.no_grad():
+        for i in range(timeout):
+            _, Y_t, A = text2mel(L, Y, monotonic_attention=True)
+            Y = torch.cat((zeros, Y_t), -1)
+            _, attention = torch.max(A[0, :, -1], 0)
+            if L[0, attention.item()] == hp.vocab.index('E'):
+                break
+        _, Z = ssrn(Y)
+        Z = Z.cpu().numpy()
+    wav = spectrogram2wav(Z[0, :, :].T)
+    return wav
+```
+## Training Data
+The model was trained on audio samples of Geralt's voice from The Witcher 3: Wild Hunt video game.
+## Intended Use
+This model is intended for:
+- Research and experimentation in speech synthesis
+- Creative projects and fan content
+- Educational purposes
+## Limitations
+- The model works best with English text
+- Vocabulary is limited to lowercase letters and basic punctuation
+- Audio quality may vary depending on input text complexity
+- The character voice is based on copyrighted material
+## Citation
+If you use this model, please cite the original DC-TTS paper and the Deepstory project:
+```bibtex
+@article{tachibana2018efficiently,
+  title={Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention},
+  author={Tachibana, Hideyuki and Uenoyama, Katsuya and Aihara, Shunsuke},
+  journal={arXiv preprint arXiv:1710.08969},
+  year={2018}
+}
+@misc{deepstory,
+  author = {Siu King Wai},
+  title = {Deepstory},
+  year = {2020},
+  publisher = {GitHub},
+  url = {https://github.com/thetobysiu/deepstory}
+}
+```
+## License
+This model is released under the MIT License. Please note that the voice characteristics are based on copyrighted material from The Witcher 3: Wild Hunt.
+## Acknowledgments
+- Original DC-TTS implementation: [tugstugi/pytorch-dc-tts](https://github.com/tugstugi/pytorch-dc-tts)
+- The Witcher 3: Wild Hunt by CD Projekt Red

ssrn.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1d0c4335f53daa9b06341d92ed033dcb7370cb31c290a50ccf3c87e842464948
+size 497068180

t2m_step-102000_first.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3ea666057b34709c1219deee14bf2bf1df47ad0f6200aa48d5df29bd1c9d34a
+size 1146268309