Instructions to use timofeiiz/soundstream-impl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use timofeiiz/soundstream-impl with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1 +1,57 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- soundstream
|
| 6 |
+
license: mit
|
| 7 |
+
language:
|
| 8 |
+
- en
|
| 9 |
+
pipeline_tag: audio-to-audio
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# SoundStream
|
| 13 |
+
|
| 14 |
+
A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
|
| 15 |
+
|
| 16 |
+
Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
|
| 17 |
+
## Architecture
|
| 18 |
+
|
| 19 |
+
- **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
|
| 20 |
+
- **Quantizer**: Residual Vector Quantizer with 8 codebooks of 1024 entries each
|
| 21 |
+
- **Decoder**: Mirrored encoder with transposed convolutions
|
| 22 |
+
- **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
|
| 23 |
+
|
| 24 |
+
| Parameter | Value |
|
| 25 |
+
|---|---|
|
| 26 |
+
| Sample rate | 16 kHz |
|
| 27 |
+
| Channels | 32 |
|
| 28 |
+
| Latent dim | 512 |
|
| 29 |
+
| Codebook size | 1024 |
|
| 30 |
+
| Num quantizers | 8 |
|
| 31 |
+
| Downsampling factor | 200 |
|
| 32 |
+
|
| 33 |
+
## Usage
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
import torchaudio
|
| 37 |
+
from transformers import AutoModel
|
| 38 |
+
|
| 39 |
+
# Load model
|
| 40 |
+
model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True)
|
| 41 |
+
model.eval()
|
| 42 |
+
|
| 43 |
+
waveform, sr = torchaudio.load("audio.wav")
|
| 44 |
+
assert sr == 16000 # Only 16 kHz sample rate is supported
|
| 45 |
+
|
| 46 |
+
# Encode to discrete tokens
|
| 47 |
+
indices = model.encode(waveform.unsqueeze(0)) # (1, 8, T)
|
| 48 |
+
|
| 49 |
+
# Decode back to audio
|
| 50 |
+
reconstructed = model.decode(indices, original_length=waveform.size(-1))
|
| 51 |
+
|
| 52 |
+
torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000)
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## License
|
| 56 |
+
|
| 57 |
+
MIT
|