Instructions to use timofeiiz/soundstream-impl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use timofeiiz/soundstream-impl with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - audio | |
| - soundstream | |
| license: mit | |
| language: | |
| - en | |
| pipeline_tag: audio-to-audio | |
| # SoundStream | |
| A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio. | |
| Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio. | |
| ## Metrics | |
| Evaluated on LibriSpeech test-clean: | |
| - **STOI**: 0.804 | |
| - **NISQA**: 2.276 | |
| ## Architecture | |
| - **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression) | |
| - **Quantizer**: Residual Vector Quantizer with 8 codebooks of 1024 entries each | |
| - **Decoder**: Mirrored encoder with transposed convolutions | |
| - **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator | |
| **Model parameters**: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling | |
| ## Usage | |
| ```python | |
| import torchaudio | |
| from transformers import AutoModel | |
| # Load model | |
| model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True) | |
| model.eval() | |
| waveform, sr = torchaudio.load("audio.wav") | |
| assert sr == 16000 # Only 16 kHz sample rate is supported | |
| # Encode to discrete tokens | |
| indices = model.encode(waveform.unsqueeze(0)) # (1, 8, T) | |
| # Decode back to audio | |
| reconstructed = model.decode(indices, original_length=waveform.size(-1)) | |
| torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000) | |
| ``` | |
| ## License | |
| MIT | |