timofeiiz
/

soundstream-impl

@@ -14,6 +14,14 @@ pipeline_tag: audio-to-audio
 A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
 Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
 ## Architecture
 - **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
@@ -21,14 +29,7 @@ Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes t
 - **Decoder**: Mirrored encoder with transposed convolutions
 - **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
-| Parameter | Value |
-|---|---|
-| Sample rate | 16 kHz |
-| Channels | 32 |
-| Latent dim | 512 |
-| Codebook size | 1024 |
-| Num quantizers | 8 |
-| Downsampling factor | 200 |
 ## Usage

 A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
 Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
+## Metrics
+Evaluated on LibriSpeech test-clean:
+- **STOI**: 0.804
+- **NISQA**: 2.276
 ## Architecture
 - **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
 - **Decoder**: Mirrored encoder with transposed convolutions
 - **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
+**Model parameters**: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling
 ## Usage