cdminix commited on
Commit
4d3249f
·
1 Parent(s): 2fc3a6d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md CHANGED
@@ -1,3 +1,49 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ datasets:
4
+ - cdminix/libritts-aligned
5
+ language:
6
+ - en
7
+ tags:
8
+ - speech recognition, speech synthesis, text-to-speech
9
  ---
10
+
11
+ This model requires the Vocex library, which is available using
12
+ ```pip install vocex```
13
+
14
+
15
+ Vocex extracts several measures (as well as d-vectors) from audio.
16
+ ![summary](https://raw.githubusercontent.com/MiniXC/vocex/main/demo/summary.png)
17
+ You can read more here:
18
+ https://github.com/minixc/vocex
19
+
20
+ ## Usage
21
+ ```python
22
+ from vocex import Vocex
23
+ import torchaudio # or any other audio loading library
24
+
25
+ model = vocex.from_checkpoint('vocex/cdminix') # an fp16 model is loaded by default
26
+ model = vocex.from_checkpoint('vocex/cdminix', fp16=False) # to load a fp32 model
27
+ model = vocex.from_checkpoint('some/path/model.ckpt') # to load local checkpoint
28
+
29
+ audio = ... # a numpy or torch array is required with shape [batch_size, length_in_samples] or just [length_in_samples]
30
+ sample_rate = ... # we need to specify a sample rate if the audio is not sampled at 22050
31
+
32
+ outputs = model(audio, sample_rate)
33
+ pitch, energy, snr, srmr = (
34
+ outputs["measures"]["pitch"],
35
+ outputs["measures"]["energy"],
36
+ outputs["measures"]["snr"],
37
+ outputs["measures"]["srmr"],
38
+ )
39
+ d_vector = outputs["d_vector"] # a torch tensor with shape [batch_size, 256]
40
+
41
+ # you can also get activations and attention weights at all layers of the model
42
+ outputs = model(audio, sample_rate, return_activations=True, return_attention=True)
43
+ activations = outputs["activations"] # a list of torch tensors with shape [batch_size, layers, ...]
44
+ attention = outputs["attention"] # a list of torch tensors with shape [batch_size, layers, ...]
45
+
46
+ # there are also speaker avatars, which are a 2D representation of the speaker's voice
47
+ outputs = model(audio, sample_rate, return_avatar=True)
48
+ avatar = outputs["avatars"] # a torch tensor with shape [batch_size, 256, 256]
49
+ ```