SoundStream — LibriSpeech 16 kHz

SoundStream-style neural audio codec for speech, trained from scratch on train-clean-100 of LibriSpeech at 16 kHz. The RVQ module is implemented by hand (no library version). Training uses delayed adversarial losses: an STFT discriminator and a small waveform discriminator are switched on at step 20k with hinge loss and feature matching on top of multi-scale mel reconstruction.

Code: https://github.com/tolyaho/NeuralAudioCodec

Results

Full test-clean (LibriSpeech), 16 kHz mono:

model	STOI	NISQA
reconstruction only	0.9144	1.9335
two-stage STFT-GAN (from reconstruction)	0.9317	2.3278
scratch EMA RVQ + delayed STFT/wave GAN (this)	0.9399	2.5212

Full precision: STOI 0.9398752048725391, NISQA 2.5212083049857887.

Architecture

Encoder/decoder strides [2, 4, 5, 5] → 80 frames/s at 16 kHz.
base_channels = 32, latent_dim = 512.
RVQ with num_quantizers = 8, codebook_size = 1024, EMA codebook updates, straight-through estimator, commitment loss weight 1.0.
Bitrate ≈ 80 × 8 × log₂(1024) = 6.4 kbps.

Discriminators:

STFT discriminator, base_channels = 16, multi-scale.
Waveform discriminator, base_channels = 8, multi-scale.

Training setup

Single Kaggle T4, batch size 12, 0.5 s random crops, 45000 steps.

Reconstruction phase: multi-scale mel loss + commitment loss, lr = 1e-4.
Adversarial phase: starts at step 20000 with a 15000-step linear warmup of the adversarial weight from 0 to adv_weight = 0.02.
Discriminator update every 4 steps, disc_lr = 5e-7.
Feature matching weight 3.0.

Final-run command lives in the repo README.md under "Final scratch EMA + STFT/wave GAN".

Usage

git clone https://github.com/tolyaho/NeuralAudioCodec
cd NeuralAudioCodec
pip install -r requirements.txt
bash scripts/download_checkpoint.sh   # pulls final_soundstream.pt here
python scripts/inference.py checkpoint=checkpoints/final_soundstream.pt limit=2 save_audio=0

For an end-to-end "paste a URL, get a reconstructed clip" demo, see notebooks/demo.ipynb.

License

MIT.

Downloads last month: -; Downloads are not tracked for this model. How to track

tolyho
/

soundstream-librispeech-16khz