SoundStream

A PyTorch implementation of the SoundStream neural audio codec. Accepts only 16 kHz audio.

Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.

Metrics

Evaluated on LibriSpeech test-clean:

  • STOI: 0.804
  • NISQA: 2.276

Architecture

  • Encoder: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
  • Quantizer: Residual Vector Quantizer with 8 codebooks of 1024 entries each
  • Decoder: Mirrored encoder with transposed convolutions
  • Discriminator (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator

Model parameters: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling

Usage

import torchaudio
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True)
model.eval()

waveform, sr = torchaudio.load("audio.wav")
assert sr == 16000  # Only 16 kHz sample rate is supported

# Encode to discrete tokens
indices = model.encode(waveform.unsqueeze(0))  # (1, 8, T)

# Decode back to audio
reconstructed = model.decode(indices, original_length=waveform.size(-1))

torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000)

License

MIT

Downloads last month
84
Safetensors
Model size
18.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for timofeiiz/soundstream-impl