SoundStream — LibriSpeech 16 kHz

SoundStream-style neural audio codec for speech, trained from scratch on train-clean-100 of LibriSpeech at 16 kHz. The RVQ module is implemented by hand (no library version). Training uses delayed adversarial losses: an STFT discriminator and a small waveform discriminator are switched on at step 20k with hinge loss and feature matching on top of multi-scale mel reconstruction.

Code: https://github.com/tolyaho/NeuralAudioCodec

Results

Full test-clean (LibriSpeech), 16 kHz mono:

model STOI NISQA
reconstruction only 0.9144 1.9335
two-stage STFT-GAN (from reconstruction) 0.9317 2.3278
scratch EMA RVQ + delayed STFT/wave GAN (this) 0.9399 2.5212

Full precision: STOI 0.9398752048725391, NISQA 2.5212083049857887.

Architecture

  • Encoder/decoder strides [2, 4, 5, 5] → 80 frames/s at 16 kHz.
  • base_channels = 32, latent_dim = 512.
  • RVQ with num_quantizers = 8, codebook_size = 1024, EMA codebook updates, straight-through estimator, commitment loss weight 1.0.
  • Bitrate ≈ 80 × 8 × logâ‚‚(1024) = 6.4 kbps.

Discriminators:

  • STFT discriminator, base_channels = 16, multi-scale.
  • Waveform discriminator, base_channels = 8, multi-scale.

Training setup

Single Kaggle T4, batch size 12, 0.5 s random crops, 45000 steps.

  • Reconstruction phase: multi-scale mel loss + commitment loss, lr = 1e-4.
  • Adversarial phase: starts at step 20000 with a 15000-step linear warmup of the adversarial weight from 0 to adv_weight = 0.02.
  • Discriminator update every 4 steps, disc_lr = 5e-7.
  • Feature matching weight 3.0.

Final-run command lives in the repo README.md under "Final scratch EMA + STFT/wave GAN".

Usage

git clone https://github.com/tolyaho/NeuralAudioCodec
cd NeuralAudioCodec
pip install -r requirements.txt
bash scripts/download_checkpoint.sh   # pulls final_soundstream.pt here
python scripts/inference.py checkpoint=checkpoints/final_soundstream.pt limit=2 save_audio=0

For an end-to-end "paste a URL, get a reconstructed clip" demo, see notebooks/demo.ipynb.

License

MIT.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train tolyho/soundstream-librispeech-16khz