timofeiiz
/

soundstream-impl

+---
+library_name: transformers
+tags:
+  - audio
+  - soundstream
+license: mit
+language:
+  - en
+pipeline_tag: audio-to-audio
+---
+# SoundStream
+A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
+Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
+## Architecture
+- **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
+- **Quantizer**: Residual Vector Quantizer with 8 codebooks of 1024 entries each
+- **Decoder**: Mirrored encoder with transposed convolutions
+- **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
+| Parameter | Value |
+|---|---|
+| Sample rate | 16 kHz |
+| Channels | 32 |
+| Latent dim | 512 |
+| Codebook size | 1024 |
+| Num quantizers | 8 |
+| Downsampling factor | 200 |
+## Usage
+```python
+import torchaudio
+from transformers import AutoModel
+# Load model
+model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True)
+model.eval()
+waveform, sr = torchaudio.load("audio.wav")
+assert sr == 16000  # Only 16 kHz sample rate is supported
+# Encode to discrete tokens
+indices = model.encode(waveform.unsqueeze(0))  # (1, 8, T)
+# Decode back to audio
+reconstructed = model.decode(indices, original_length=waveform.size(-1))
+torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000)
+```
+## License
+MIT