FarmerlineML commited on
Commit
b0f32e8
·
verified ·
1 Parent(s): 3f476bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -75
README.md CHANGED
@@ -1,75 +0,0 @@
1
-
2
- ---
3
- license: cc-by-nc-4.0
4
- tags:
5
- - mms
6
- - vits
7
- pipeline_tag: text-to-speech
8
- ---
9
-
10
-
11
- ## Model Details
12
-
13
- VITS (**V**ariational **I**nference with adversarial learning for end-to-end **T**ext-to-**S**peech) is an end-to-end
14
- speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational
15
- autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
16
-
17
- A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based
18
- text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers,
19
- much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text
20
- input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to
21
- synthesise speech with different rhythms from the same input text.
22
-
23
- The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training.
24
- To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During
25
- inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the
26
- waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor,
27
- the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform.
28
-
29
- For the MMS project, a separate VITS checkpoint is trained on each langauge.
30
-
31
- ## Usage
32
-
33
- MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint,
34
- first install the latest version of the library:
35
-
36
- ```
37
- pip install --upgrade transformers accelerate
38
- ```
39
-
40
- Then, run inference with the following code-snippet:
41
-
42
- ```python
43
- from transformers import VitsModel, AutoTokenizer
44
- import torch
45
-
46
- model = VitsModel.from_pretrained("facebook/mms-tts-lug")
47
- tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-lug")
48
-
49
- text = "some example text in the Ganda language"
50
- inputs = tokenizer(text, return_tensors="pt")
51
-
52
- with torch.no_grad():
53
- output = model(**inputs).waveform
54
- ```
55
-
56
- The resulting waveform can be saved as a `.wav` file:
57
-
58
- ```python
59
- import scipy
60
-
61
- scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=output)
62
- ```
63
-
64
- Or displayed in a Jupyter Notebook / Google Colab:
65
-
66
- ```python
67
- from IPython.display import Audio
68
-
69
- Audio(output, rate=model.config.sampling_rate)
70
- ```
71
-
72
-
73
- ## License
74
-
75
- The model is licensed as **CC-BY-NC 4.0**.