SoundStream — LibriSpeech 16 kHz
SoundStream-style neural audio codec for speech, trained from scratch on
train-clean-100 of LibriSpeech at 16 kHz. The RVQ module is implemented by
hand (no library version). Training uses delayed adversarial losses: an STFT
discriminator and a small waveform discriminator are switched on at step 20k
with hinge loss and feature matching on top of multi-scale mel reconstruction.
Code: https://github.com/tolyaho/NeuralAudioCodec
Results
Full test-clean (LibriSpeech), 16 kHz mono:
| model | STOI | NISQA |
|---|---|---|
| reconstruction only | 0.9144 | 1.9335 |
| two-stage STFT-GAN (from reconstruction) | 0.9317 | 2.3278 |
| scratch EMA RVQ + delayed STFT/wave GAN (this) | 0.9399 | 2.5212 |
Full precision: STOI 0.9398752048725391, NISQA 2.5212083049857887.
Architecture
- Encoder/decoder strides
[2, 4, 5, 5]→ 80 frames/s at 16 kHz. base_channels = 32,latent_dim = 512.- RVQ with
num_quantizers = 8,codebook_size = 1024, EMA codebook updates, straight-through estimator, commitment loss weight 1.0. - Bitrate ≈ 80 × 8 × log₂(1024) = 6.4 kbps.
Discriminators:
- STFT discriminator,
base_channels = 16, multi-scale. - Waveform discriminator,
base_channels = 8, multi-scale.
Training setup
Single Kaggle T4, batch size 12, 0.5 s random crops, 45000 steps.
- Reconstruction phase: multi-scale mel loss + commitment loss,
lr = 1e-4. - Adversarial phase: starts at step 20000 with a 15000-step linear warmup of
the adversarial weight from 0 to
adv_weight = 0.02. - Discriminator update every 4 steps,
disc_lr = 5e-7. - Feature matching weight 3.0.
Final-run command lives in the repo README.md under
"Final scratch EMA + STFT/wave GAN".
Usage
git clone https://github.com/tolyaho/NeuralAudioCodec
cd NeuralAudioCodec
pip install -r requirements.txt
bash scripts/download_checkpoint.sh # pulls final_soundstream.pt here
python scripts/inference.py checkpoint=checkpoints/final_soundstream.pt limit=2 save_audio=0
For an end-to-end "paste a URL, get a reconstructed clip" demo, see
notebooks/demo.ipynb.
License
MIT.