SoundStream: An End-to-End Neural Audio Codec
Paper • 2107.03312 • Published
How to use timofeiiz/soundstream-impl with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True, dtype="auto")A PyTorch implementation of the SoundStream neural audio codec. Accepts only 16 kHz audio.
Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
Evaluated on LibriSpeech test-clean:
Model parameters: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling
import torchaudio
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True)
model.eval()
waveform, sr = torchaudio.load("audio.wav")
assert sr == 16000 # Only 16 kHz sample rate is supported
# Encode to discrete tokens
indices = model.encode(waveform.unsqueeze(0)) # (1, 8, T)
# Decode back to audio
reconstructed = model.decode(indices, original_length=waveform.size(-1))
torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000)
MIT