Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,49 @@
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: cc-by-4.0
|
| 3 |
+
datasets:
|
| 4 |
+
- cdminix/libritts-aligned
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
tags:
|
| 8 |
+
- speech recognition, speech synthesis, text-to-speech
|
| 9 |
---
|
| 10 |
+
|
| 11 |
+
This model requires the Vocex library, which is available using
|
| 12 |
+
```pip install vocex```
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
Vocex extracts several measures (as well as d-vectors) from audio.
|
| 16 |
+

|
| 17 |
+
You can read more here:
|
| 18 |
+
https://github.com/minixc/vocex
|
| 19 |
+
|
| 20 |
+
## Usage
|
| 21 |
+
```python
|
| 22 |
+
from vocex import Vocex
|
| 23 |
+
import torchaudio # or any other audio loading library
|
| 24 |
+
|
| 25 |
+
model = vocex.from_checkpoint('vocex/cdminix') # an fp16 model is loaded by default
|
| 26 |
+
model = vocex.from_checkpoint('vocex/cdminix', fp16=False) # to load a fp32 model
|
| 27 |
+
model = vocex.from_checkpoint('some/path/model.ckpt') # to load local checkpoint
|
| 28 |
+
|
| 29 |
+
audio = ... # a numpy or torch array is required with shape [batch_size, length_in_samples] or just [length_in_samples]
|
| 30 |
+
sample_rate = ... # we need to specify a sample rate if the audio is not sampled at 22050
|
| 31 |
+
|
| 32 |
+
outputs = model(audio, sample_rate)
|
| 33 |
+
pitch, energy, snr, srmr = (
|
| 34 |
+
outputs["measures"]["pitch"],
|
| 35 |
+
outputs["measures"]["energy"],
|
| 36 |
+
outputs["measures"]["snr"],
|
| 37 |
+
outputs["measures"]["srmr"],
|
| 38 |
+
)
|
| 39 |
+
d_vector = outputs["d_vector"] # a torch tensor with shape [batch_size, 256]
|
| 40 |
+
|
| 41 |
+
# you can also get activations and attention weights at all layers of the model
|
| 42 |
+
outputs = model(audio, sample_rate, return_activations=True, return_attention=True)
|
| 43 |
+
activations = outputs["activations"] # a list of torch tensors with shape [batch_size, layers, ...]
|
| 44 |
+
attention = outputs["attention"] # a list of torch tensors with shape [batch_size, layers, ...]
|
| 45 |
+
|
| 46 |
+
# there are also speaker avatars, which are a 2D representation of the speaker's voice
|
| 47 |
+
outputs = model(audio, sample_rate, return_avatar=True)
|
| 48 |
+
avatar = outputs["avatars"] # a torch tensor with shape [batch_size, 256, 256]
|
| 49 |
+
```
|