timofeiiz commited on
Commit
8d545cf
·
verified ·
1 Parent(s): ad0bcd5

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +57 -1
README.md CHANGED
@@ -1 +1,57 @@
1
- Soundstream implementation. Sample rate 16000.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - audio
5
+ - soundstream
6
+ license: mit
7
+ language:
8
+ - en
9
+ pipeline_tag: audio-to-audio
10
+ ---
11
+
12
+ # SoundStream
13
+
14
+ A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
15
+
16
+ Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
17
+ ## Architecture
18
+
19
+ - **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
20
+ - **Quantizer**: Residual Vector Quantizer with 8 codebooks of 1024 entries each
21
+ - **Decoder**: Mirrored encoder with transposed convolutions
22
+ - **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
23
+
24
+ | Parameter | Value |
25
+ |---|---|
26
+ | Sample rate | 16 kHz |
27
+ | Channels | 32 |
28
+ | Latent dim | 512 |
29
+ | Codebook size | 1024 |
30
+ | Num quantizers | 8 |
31
+ | Downsampling factor | 200 |
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ import torchaudio
37
+ from transformers import AutoModel
38
+
39
+ # Load model
40
+ model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True)
41
+ model.eval()
42
+
43
+ waveform, sr = torchaudio.load("audio.wav")
44
+ assert sr == 16000 # Only 16 kHz sample rate is supported
45
+
46
+ # Encode to discrete tokens
47
+ indices = model.encode(waveform.unsqueeze(0)) # (1, 8, T)
48
+
49
+ # Decode back to audio
50
+ reconstructed = model.decode(indices, original_length=waveform.size(-1))
51
+
52
+ torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000)
53
+ ```
54
+
55
+ ## License
56
+
57
+ MIT