timofeiiz commited on
Commit
939b5d8
·
verified ·
1 Parent(s): 8d545cf

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +9 -8
README.md CHANGED
@@ -14,6 +14,14 @@ pipeline_tag: audio-to-audio
14
  A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
15
 
16
  Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
 
 
 
 
 
 
 
 
17
  ## Architecture
18
 
19
  - **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
@@ -21,14 +29,7 @@ Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes t
21
  - **Decoder**: Mirrored encoder with transposed convolutions
22
  - **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
23
 
24
- | Parameter | Value |
25
- |---|---|
26
- | Sample rate | 16 kHz |
27
- | Channels | 32 |
28
- | Latent dim | 512 |
29
- | Codebook size | 1024 |
30
- | Num quantizers | 8 |
31
- | Downsampling factor | 200 |
32
 
33
  ## Usage
34
 
 
14
  A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
15
 
16
  Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
17
+
18
+ ## Metrics
19
+
20
+ Evaluated on LibriSpeech test-clean:
21
+
22
+ - **STOI**: 0.804
23
+ - **NISQA**: 2.276
24
+
25
  ## Architecture
26
 
27
  - **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
 
29
  - **Decoder**: Mirrored encoder with transposed convolutions
30
  - **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
31
 
32
+ **Model parameters**: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling
 
 
 
 
 
 
 
33
 
34
  ## Usage
35