File size: 1,521 Bytes
8d545cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
939b5d8
 
 
 
 
 
 
 
8d545cf
 
 
 
 
 
 
939b5d8
8d545cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
library_name: transformers
tags:
  - audio
  - soundstream
license: mit
language:
  - en
pipeline_tag: audio-to-audio
---

# SoundStream

A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.

Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.

## Metrics

Evaluated on LibriSpeech test-clean:

- **STOI**: 0.804
- **NISQA**: 2.276

## Architecture

- **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
- **Quantizer**: Residual Vector Quantizer with 8 codebooks of 1024 entries each
- **Decoder**: Mirrored encoder with transposed convolutions
- **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator

**Model parameters**: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling

## Usage

```python
import torchaudio
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True)
model.eval()

waveform, sr = torchaudio.load("audio.wav")
assert sr == 16000  # Only 16 kHz sample rate is supported

# Encode to discrete tokens
indices = model.encode(waveform.unsqueeze(0))  # (1, 8, T)

# Decode back to audio
reconstructed = model.decode(indices, original_length=waveform.size(-1))

torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000)
```

## License

MIT