Instructions to use timofeiiz/soundstream-impl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use timofeiiz/soundstream-impl with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -14,6 +14,14 @@ pipeline_tag: audio-to-audio
|
|
| 14 |
A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
|
| 15 |
|
| 16 |
Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
## Architecture
|
| 18 |
|
| 19 |
- **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
|
|
@@ -21,14 +29,7 @@ Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes t
|
|
| 21 |
- **Decoder**: Mirrored encoder with transposed convolutions
|
| 22 |
- **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|---|---|
|
| 26 |
-
| Sample rate | 16 kHz |
|
| 27 |
-
| Channels | 32 |
|
| 28 |
-
| Latent dim | 512 |
|
| 29 |
-
| Codebook size | 1024 |
|
| 30 |
-
| Num quantizers | 8 |
|
| 31 |
-
| Downsampling factor | 200 |
|
| 32 |
|
| 33 |
## Usage
|
| 34 |
|
|
|
|
| 14 |
A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.
|
| 15 |
|
| 16 |
Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.
|
| 17 |
+
|
| 18 |
+
## Metrics
|
| 19 |
+
|
| 20 |
+
Evaluated on LibriSpeech test-clean:
|
| 21 |
+
|
| 22 |
+
- **STOI**: 0.804
|
| 23 |
+
- **NISQA**: 2.276
|
| 24 |
+
|
| 25 |
## Architecture
|
| 26 |
|
| 27 |
- **Encoder**: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
|
|
|
|
| 29 |
- **Decoder**: Mirrored encoder with transposed convolutions
|
| 30 |
- **Discriminator** (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator
|
| 31 |
|
| 32 |
+
**Model parameters**: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
## Usage
|
| 35 |
|