timofeiiz
/

soundstream-impl

Model card Files Files and versions

soundstream-impl / README.md

timofeiiz's picture

Upload folder using huggingface_hub

939b5d8 verified 13 days ago

|

history blame contribute delete

1.52 kB

	---
	library_name: transformers
	tags:
	- audio
	- soundstream
	license: mit
	language:
	- en
	pipeline_tag: audio-to-audio
	---

	# SoundStream

	A PyTorch implementation of the [SoundStream](https://arxiv.org/abs/2107.03312) neural audio codec. Accepts only 16 kHz audio.

	Encodes speech into discrete tokens (8 codebooks × 80 tokens/sec) and decodes them back to audio.

	## Metrics

	Evaluated on LibriSpeech test-clean:

	- STOI: 0.804
	- NISQA: 2.276

	## Architecture

	- Encoder: Causal convolutions with residual units and strided downsampling (2, 4, 5, 5 = 200x compression)
	- Quantizer: Residual Vector Quantizer with 8 codebooks of 1024 entries each
	- Decoder: Mirrored encoder with transposed convolutions
	- Discriminator (training only): 3 multi-scale waveform discriminators + 1 STFT-based discriminator

	Model parameters: 16 kHz, 32 channels, latent dim 512, codebook size 1024, 8 quantizers, 200x downsampling

	## Usage

	```python
	import torchaudio
	from transformers import AutoModel

	# Load model
	model = AutoModel.from_pretrained("timofeiiz/soundstream-impl", trust_remote_code=True)
	model.eval()

	waveform, sr = torchaudio.load("audio.wav")
	assert sr == 16000 # Only 16 kHz sample rate is supported

	# Encode to discrete tokens
	indices = model.encode(waveform.unsqueeze(0)) # (1, 8, T)

	# Decode back to audio
	reconstructed = model.decode(indices, original_length=waveform.size(-1))

	torchaudio.save("reconstructed.wav", reconstructed.squeeze(0).cpu(), 16000)
	```

	## License

	MIT