--- library_name: transformers tags: - audio - soundstream license: mit language: - en pipeline_tag: audio-to-audio --- # SoundStream A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio. Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio. ## Metrics Evaluated on LibriSpeech test-clean: - **STOI**: 0.804 - **NISQA**: 2.276 ## Architecture - **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression) - **Quantizer**: Residual Vector Quantizer with 8 codebooks of 1024 entries each - **Decoder**: Mirrored encoder with transposed convolutions - **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator **Model parameters**: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling ## Usage ```python import torchaudio from transformers import AutoModel # Load model model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True) model.eval() waveform, sr = torchaudio.load("audio.wav") assert sr == 16000 # Only 16 kHz sample rate is supported # Encode to discrete tokens indices = model.encode(waveform.unsqueeze(0)) # (1, 8, T) # Decode back to audio reconstructed = model.decode(indices, original_length=waveform.size(-1)) torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000) ``` ## License MIT